nonparametric variable selection and classi cation: the ...lc436/tang_chen_etal_2012_csda.pdf ·...
TRANSCRIPT
Nonparametric Variable Selection and Classification The
CATCH Algorithm
Shijie Tanga Lisha Chenblowast Kam-Wah Tsuic1 Kjell Doksumd1
aBoehringer Ingelheim Pharmaceuticals Danbury CTbDepartment of Statistics Yale University New Haven CT
cDepartment of Statistics University of Wisconsin Madison WIdDepartment of Statistics University of Wisconsin Madison WI and Department of Statistics
Columbia University New York City NY
lowastCorresponding and joint first author1Partially supported by NSF grant DMS-0604931
Abstract
In a nonparametric framework we consider the problem of classifying a categoricalresponse Y whose distribution depends on a vector of predictors X where the coordi-nates Xj of X may be continuous discrete or categorical To select the variables tobe used for classification we construct an algorithm which for each variable Xj com-putes an importance score sj to measure the strength of association of Xj with Y The algorithm deletes Xj if sj falls below a certain threshold It is shown in MonteCarlo simulations that the algorithm has a high probability of only selecting variablesassociated with Y Moreover when this variable selection rule is used for dimensionreduction prior to applying classification procedures it improves the performance ofthese procedures Our approach for computing importance scores is based on root Chi-square type statistics computed for randomly selected regions (tubes) of the samplespace The size and shape of the regions are adjusted iteratively and adaptively usingthe data to enhance the ability of the importance score to detect local relationshipsbetween the response and the predictors These local scores are then averaged over thetubes to form a global importance score sj for variable Xj When confounding andspurious associations are issues the nonparametric importance score for variable Xj
is computed conditionally by using tubes to restrict the other variables We call thisvariable selection procedure CATCH (Categorical Adaptive Tube Covariate Hunting)We establish asymptotic properties including consistency
Keywords Adaptive variable selection Importance score Chi-square statistic
0
1 Introduction
We consider classification problems with a large number of predictors that can benumerical or categorical We are interested in the case in which many of the predictorsmay be irrelevant for classification thus the variables useful for classification need to beselected In genomic research the classes considered are often cases and controls Thusvariable selection is the same as the important problem of deciding which variables areassociated with disease With modern technology data of large dimension arise in manyscientific disciplines including biology genomics astronomy economics and computerscience In particular the number of variables can be greater than the sample sizewhich poses a considerable challenge to the statistical analysis If only some of thevariables are useful for classification over-fitting is a problem for methods that use allvariables and therefore variable selection becomes critical for statistical analysis
Methods for variable selection in the classification context include methods that in-corporate variable selection as part of the classification procedure This class includesrandom forest (Breiman [1]) CART (Breiman et al [2]) and GUIDE (Loh [12]) Ran-dom forest assigns an importance score for each of the predictors and one can dropthose variables whose importance score fall below a certain threshold CART afterpruning will choose a subset of optimal splitting variables to be the most significantvariables GUIDE is a tree-based method particularly powerful in unbiased variableselection and interaction detection Other research includes methods that incorporatevariable selection by applying shrinkage methods with L1 norm constraints on the pa-rameters (Tibshirani [17]) that generate sparse vectors of parameter estimates Wangand Shen [18] and Zhang et al [19] incorporate variable selection with classificationbased on support vector machine (SVM) methods Qiao et al [14] consider variable se-lection based on linear discriminant analysis One limitation of these variable selectionmethods is that they are not based on nonparametric methods and therefore they maynot work well when there is a complex relationship between the prediction variablesand the variables to be classified
We propose a nonparametric method for variable selection called Categorical Adap-tive Tube Covariate Hunting (CATCH) that performs well as a variable selection algo-rithm and can be used to improve the performance of available classification proceduresThe idea is to construct a nonparametric measure of the relational strength betweeneach predictor and the categorical response and to retain those predictors whose rela-tionship to the response is above a certain threshold The nonparametric measure ofimportance for each predictor is obtained by first measuring the importance of the pre-dictor using local information and then combing such local importance scores to obtainan overall importance score The local importance scores are based on root chi-squaretype statistics for local contingency tables
In addition to the aforementioned nonparametric feature the CATCH procedurehas another property it measures the importance of each variable conditioning on allother variables thereby reducing the confounding that may lead to selection of variables
1
spuriously related to the categorical variable Y This is accomplished by constraining allpredictors but the one we are focusing on For the case where the number of predictorsis huge (d n) this can be done by restricting principal components for some typesof studies See Remark 34
Our approach to nonparametric variable selection is related to the EARTH algo-rithm (Doksum et al [3]) which applies to nonparametric regression problems with acontinuous response variable and continuous predictors It measures the conditionalassociation between a predictor i and the response variable conditional on all the otherpredictors jj 6=i by constraining the jj 6=i variables to regions called tubes The localimportance score is based on a local linear or a local polynomial regression The contri-bution of the current paper is to develop variable selection methods for the classificationproblem with a categorical response variable and predictors that can be continuous dis-crete or categorical
Our CATCH algorithm can be used as a variable selection step before classificationAny classification method preferably nonparametric can be used after we single out theimportant variables In particular SVM and random forest are statistical classificationmethods that can be used with CATCH We show in simulation studies that when thetrue model is highly nonlinear and there are many irrelevant predictors using CATCHto screen out irrelevant or weak predictors greatly improves the performance of SVMand random forest
The CATCH algorithm works with general classification data with both numericaland categorical predictors Moreover the CATCH algorithm is robust when the pre-dictors interact with each other especially when numerical and categorical predictorsinteract eg for hierarchical interactive association between the numerical and cate-gorical predictors We present a model with a reasonable sample size and predictordimension for which random forest has a relatively low chance of detecting the signifi-cance of the categorical predictor The CART and GUIDE algorithms with pruningcan find the correct splitting variables including the categorical one but yield relativelyhigh classification errors Moreover CART and GUIDE have trouble choosing the split-ting predictors in the correct order The CATCH algorithm achieves higher accuracyin the task of variable selection This is due to the importance scores being conditionalas illustrated in the simulation example in section 41
The rest of the paper will proceed as follows In section 2 we introduce importancescores for univariate predictors In sections 31 - 34 we extend these scores to multi-variate predictors and in section 35 we introduce the CATCH algorithm In section4 we use simulation studies to show the effectiveness of the CATCH algorithm A realexample is provided in section 5 In section 6 we provide some theoretical properties tojustify the definition of local contingency efficacy and in section 7 we show asymptoticconsistency of the algorithm
2
2 Importance Scores for Univariate Classification
Let (X(i) Y (i)) i = 1 middot middot middot n be independent and identically distributed (iid) as(X Y ) sim P We first consider the case of a univariate X then use the methodsconstructed for the univariate case to construct the multivariate versions
21 Numerical Predictor Local Contingency Efficacy
Consider the classification problem with a numerical predictor Let
Pr(Y = c|x) = pc(x) c = 1 middot middot middot CCsumc=1
pc(x) equiv 1 (21)
We introduce a measure of how strongly X is associated with Y in a neighbor-hood of a ldquofixedrdquo point x(0) Points x(0) will be selected at random by our algorithmLocal importance scores will be computed for each point then averaged For a fixedh gt 0 define Nh(x
(0)) = x 0 lt |x minus x(0)| le h to be a neighborhood of x(0) andlet n(x(0) h) =
sumni=1 I(X(i) isin Nh(x
(0))) denote the number of data points in the neigh-borhood For c = 1 middot middot middot C let nminusc (x(0) h) be the number of observations (X(i) Y (i))satisfying x(0)minush le X(i) lt x(0) and Y (i) = c let n+
c (x(0) h) be the number of (X(i) Y (i))satisfying x(0) lt X(i) le x(0) +h and Y (i) = c Let nc(x
(0) h) = n+c (x(0) h) +nminusc (x(0) h)
nminus(x(0) h) =sumC
c=1 nminusc (x(0) h) and n+(x(0) h) =
sumCc=1 n
+c (x(0) h) Table 1 gives the
resulting local contingency table
Table 1 Local contingency table
Y = 1 Y = 2 middot middot middot middot middot middot Y = C Total
x(0) minus h le X lt x(0) nminus1 (x(0) h) nminus2 (x(0) h) middot middot middot middot middot middot nminusC(x(0) h) nminus(x(0) h)
x(0) lt X le x(0) + h n+1 (x(0) h) n+
2 (x(0) h) middot middot middot middot middot middot n+C(x(0) h) n+(x(0) h)
Total n1(x(0) h) n2(x
(0) h) middot middot middot middot middot middot nC(x(0) h) n(x(0) h)
To measure how strongly Y relates to local restrictions on X we consider the chi-square statistic
X 2(x(0) h) =Csumc=1
((nminusc (x(0) h)minus Eminusc (x(0) h))2
Eminusc (x(0) h)+
(n+c (x(0) h)minus E+
c (x(0) h))2
E+c (x(0) h)
) (22)
3
where 00 equiv 0 and
Eminusc (x(0) h) =nminus(x(0) h)nc(x
(0) h)
n(x(0) h) E+
c (x(0) h) =n+(x(0) h)nc(x
(0) h)
n(x(0) h) (23)
Due to the local restriction on X the local contingency table might have some zerocells However this will not be an issue for the χ2 statistic The χ2 statistic can detectlocal dependence between Y and X as long as there exist observations from at leasttwo categories of Y in the neighborhood of x(0) When all observations belong to thesame category of Y intuitively no dependence can be detected In this case χ2 statisticis equal to zero which coincides with our intuition
We call the neighborhood Nh(x(0)) a section and maximize X 2(x(0) h) (22) with
respect to the section size h Let plim denote limit in probability Our local measureof association is ζ(x(0)) as given in
Definition 1 Local Contingency Efficacy (For Numerical Predictor) The local con-tingency section efficacy and the Local Contingency Efficacy (LCE) of Y on a numericalpredictor X at the point x(0) are
ζ(x(0) h) = plimnrarrinfinnminus12
radicX 2(x(0) h) ζ(x(0)) = sup
hgt0ζ(x(0) h) (24)
If pc(x) is constant in x for all c isin 1 middot middot middot C for x near x(0) then as n rarr infinX 2(x(0) h) converges in distribution to a χ2
Cminus1 variable and in this case ζ(x(0) h) = 0On the other hand if pc(x) is not constant near x(0) ζ(x(0) h) determines the asymptoticpower for Pitmanrsquos alternatives of the test based on X 2(x(0) h) for testing whether Xis locally independent of Y and is called the efficacy of this test Moreover ζ2(x(0) h)is the asymptotic noncentrality parameter of the statistic nminus1X 2(x(0) h)
The estimators of ζ(x(0) h) and ζ(x(0)) are
ζ(x(0) h) = nminus12radicX 2(x(0) h) ζ(x(0)) = sup
hgt0ζ(x(0) h) (25)
We selected the section size h as
h = argmaxζ(x(0) h) h isin h1 hg
for some grid h1 hg This corresponds to selecting h by maximizing estimatedpower See Doksum and Schafer [5] Gao and Gijbels [7] Doksum et al [3] Schafer andDoksum [16]
4
22 Categorical Predictor Contingency Efficacy
Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet
X 2 =Csumc=1
Cprimesumcprime=1
(n(c cprime)minus E(c cprime))2
E(c cprime)2 (26)
where
E(c cprime) =
sumCc=1 n(c cprime)
sumCprime
cprime=1 n(c cprime)
n (27)
Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is
ζ = plimnrarrinfin nminus12radicX 2 (28)
Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2
(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic
noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is
ζ = nminus12radicX 2 (29)
3 Variable Selection for Classification Problems The CATCH Algorithm
Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies
31 Numerical Predictors
To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a
ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)
T
to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later
Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set
Tl equiv Tl(x(0) δD) equiv x D(xminusl x
(0)minusl ) le δ (31)
5
We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression
Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of
Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let
NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)
be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in
NlhδD(x(0)) we consider
X 2lδD(x(0) h) (33)
where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)
Analogous to the definitions of local contingency efficacy (24) we define
Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are
ζ(x(0) h l) = plimnrarrinfin nminus12
radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (34)
ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (35)
For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose
logit P (Y = 2|x) =(xminus 1
2
)2 x isin [0 1] Then the local efficacy with h small will be
large while if h ge 1 the efficacy will be zero
32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct
tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables
unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x
(0) δD) by using only the observationsin the tube Tl(x
(0) δD) We denote this version of (26) by
X 2lδD(x(0)) (36)
Analogous to the definition of contingency efficacy given in (28) we define
Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare
ζl(x(0)) = plimnrarrinfin n
minus12radicX 2lδD(x(0)) ζl(x
(0)) = nminus12radicX 2lδD(x(0)) (37)
6
33 The Empirical Importance Scores
The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points
Definition 6 The CATCH empirical importance score for variable l is
sl = Mminus1Msuma=1
ζl(xlowasta) (38)
where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors
respectively
34 Adaptive Tube Distances
Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2
on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form
Dl(xminusl x(0)minusl ) =
dsumj=1
wj|xj minus x(0)j |I(j 6= l) (39)
where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained
The weights wj are proportional to sj minus sprimej which are determined iteratively and
adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the
importance score s(2)j using (38) For the second iteration we use the weights sj = s
(2)j
and the distance
Dl(xminusl x(0)minusl ) =
sumdj=1(sj minus sprimej)+|xj minus x
(0)j |I(j 6= l)sumd
j=1(sj minus sprimej)+I(j 6= l)(310)
where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor
7
with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the
predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and
use s(3)j in the next iteration and so on
In the definition above we need to define |xj minus x(0)j | appropriately for a categorical
variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant
when xj and x(0)j take any pair of different categories One approach is to define |xj minus
x(0)j | =infinI(xj 6= x
(0)j ) in which case all points in the tube are given the same Xj value
as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj
does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the
tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The
value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)
1 minus X(2)1 |) where X
(1)1 and
X(2)1 are two independent realizations of X1 Note that k0 =
radic2π if X1 follows a
standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =
radic2π in the real data example
Remark 31 (choice of k0) k0 =radic
2π is used all the time unless the sample size is
large enough in the sense to be explained below Suppose k0 = infin then Dl
(xminusl x
(0)minusl
)is finite only for those observations with xj = x
(0)j When there are multiple categorical
variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for
all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =
radic2π is used as in the real data example
On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples
35 The CATCH Algorithm
The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations
8
(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below
(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X
lowastM with
replacement from the n observations Set k = 1 do (b)
(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk
(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated
local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a
categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y
on Xl at Xlowastk as defined in (37)
(d) Thresholding Let Y lowast denote a random variable which is identically distributed
as Y but independent of X Let (Y(1)0 middot middot middot Y (n)
0 ) be a random permutation of
(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)
0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y
If k lt M set k = k + 1 go to (b) otherwise go to (e)
(e) Updating importance scores Set
snewl =1
M
Msumk=1
ζl(Xlowastk) s
primenewl =
1
M
Msumk=1
ζlowastl (Xlowastk)
Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s
primenew1 middot middot middot sprimenewd ) denote the updates of S
and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt
radic2SD(sprimestop)
If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm
Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes
9
of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming
(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y
but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)
0 ) be a random permutation of the
Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)
0 ) can be viewed as a realization of Y lowast If Xl
is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of
Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the
estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)
Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl
given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic
2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt
radic2SE(sprimestop) is similar to the one standard
deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01
Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]
Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel
10
4 Simulation Studies
41 Variable Selection with Categorical and Numerical Predictors
Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]
When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion
Y =
0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05
(41)
When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent
In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors
Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors
The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is
11
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
Abstract
In a nonparametric framework we consider the problem of classifying a categoricalresponse Y whose distribution depends on a vector of predictors X where the coordi-nates Xj of X may be continuous discrete or categorical To select the variables tobe used for classification we construct an algorithm which for each variable Xj com-putes an importance score sj to measure the strength of association of Xj with Y The algorithm deletes Xj if sj falls below a certain threshold It is shown in MonteCarlo simulations that the algorithm has a high probability of only selecting variablesassociated with Y Moreover when this variable selection rule is used for dimensionreduction prior to applying classification procedures it improves the performance ofthese procedures Our approach for computing importance scores is based on root Chi-square type statistics computed for randomly selected regions (tubes) of the samplespace The size and shape of the regions are adjusted iteratively and adaptively usingthe data to enhance the ability of the importance score to detect local relationshipsbetween the response and the predictors These local scores are then averaged over thetubes to form a global importance score sj for variable Xj When confounding andspurious associations are issues the nonparametric importance score for variable Xj
is computed conditionally by using tubes to restrict the other variables We call thisvariable selection procedure CATCH (Categorical Adaptive Tube Covariate Hunting)We establish asymptotic properties including consistency
Keywords Adaptive variable selection Importance score Chi-square statistic
0
1 Introduction
We consider classification problems with a large number of predictors that can benumerical or categorical We are interested in the case in which many of the predictorsmay be irrelevant for classification thus the variables useful for classification need to beselected In genomic research the classes considered are often cases and controls Thusvariable selection is the same as the important problem of deciding which variables areassociated with disease With modern technology data of large dimension arise in manyscientific disciplines including biology genomics astronomy economics and computerscience In particular the number of variables can be greater than the sample sizewhich poses a considerable challenge to the statistical analysis If only some of thevariables are useful for classification over-fitting is a problem for methods that use allvariables and therefore variable selection becomes critical for statistical analysis
Methods for variable selection in the classification context include methods that in-corporate variable selection as part of the classification procedure This class includesrandom forest (Breiman [1]) CART (Breiman et al [2]) and GUIDE (Loh [12]) Ran-dom forest assigns an importance score for each of the predictors and one can dropthose variables whose importance score fall below a certain threshold CART afterpruning will choose a subset of optimal splitting variables to be the most significantvariables GUIDE is a tree-based method particularly powerful in unbiased variableselection and interaction detection Other research includes methods that incorporatevariable selection by applying shrinkage methods with L1 norm constraints on the pa-rameters (Tibshirani [17]) that generate sparse vectors of parameter estimates Wangand Shen [18] and Zhang et al [19] incorporate variable selection with classificationbased on support vector machine (SVM) methods Qiao et al [14] consider variable se-lection based on linear discriminant analysis One limitation of these variable selectionmethods is that they are not based on nonparametric methods and therefore they maynot work well when there is a complex relationship between the prediction variablesand the variables to be classified
We propose a nonparametric method for variable selection called Categorical Adap-tive Tube Covariate Hunting (CATCH) that performs well as a variable selection algo-rithm and can be used to improve the performance of available classification proceduresThe idea is to construct a nonparametric measure of the relational strength betweeneach predictor and the categorical response and to retain those predictors whose rela-tionship to the response is above a certain threshold The nonparametric measure ofimportance for each predictor is obtained by first measuring the importance of the pre-dictor using local information and then combing such local importance scores to obtainan overall importance score The local importance scores are based on root chi-squaretype statistics for local contingency tables
In addition to the aforementioned nonparametric feature the CATCH procedurehas another property it measures the importance of each variable conditioning on allother variables thereby reducing the confounding that may lead to selection of variables
1
spuriously related to the categorical variable Y This is accomplished by constraining allpredictors but the one we are focusing on For the case where the number of predictorsis huge (d n) this can be done by restricting principal components for some typesof studies See Remark 34
Our approach to nonparametric variable selection is related to the EARTH algo-rithm (Doksum et al [3]) which applies to nonparametric regression problems with acontinuous response variable and continuous predictors It measures the conditionalassociation between a predictor i and the response variable conditional on all the otherpredictors jj 6=i by constraining the jj 6=i variables to regions called tubes The localimportance score is based on a local linear or a local polynomial regression The contri-bution of the current paper is to develop variable selection methods for the classificationproblem with a categorical response variable and predictors that can be continuous dis-crete or categorical
Our CATCH algorithm can be used as a variable selection step before classificationAny classification method preferably nonparametric can be used after we single out theimportant variables In particular SVM and random forest are statistical classificationmethods that can be used with CATCH We show in simulation studies that when thetrue model is highly nonlinear and there are many irrelevant predictors using CATCHto screen out irrelevant or weak predictors greatly improves the performance of SVMand random forest
The CATCH algorithm works with general classification data with both numericaland categorical predictors Moreover the CATCH algorithm is robust when the pre-dictors interact with each other especially when numerical and categorical predictorsinteract eg for hierarchical interactive association between the numerical and cate-gorical predictors We present a model with a reasonable sample size and predictordimension for which random forest has a relatively low chance of detecting the signifi-cance of the categorical predictor The CART and GUIDE algorithms with pruningcan find the correct splitting variables including the categorical one but yield relativelyhigh classification errors Moreover CART and GUIDE have trouble choosing the split-ting predictors in the correct order The CATCH algorithm achieves higher accuracyin the task of variable selection This is due to the importance scores being conditionalas illustrated in the simulation example in section 41
The rest of the paper will proceed as follows In section 2 we introduce importancescores for univariate predictors In sections 31 - 34 we extend these scores to multi-variate predictors and in section 35 we introduce the CATCH algorithm In section4 we use simulation studies to show the effectiveness of the CATCH algorithm A realexample is provided in section 5 In section 6 we provide some theoretical properties tojustify the definition of local contingency efficacy and in section 7 we show asymptoticconsistency of the algorithm
2
2 Importance Scores for Univariate Classification
Let (X(i) Y (i)) i = 1 middot middot middot n be independent and identically distributed (iid) as(X Y ) sim P We first consider the case of a univariate X then use the methodsconstructed for the univariate case to construct the multivariate versions
21 Numerical Predictor Local Contingency Efficacy
Consider the classification problem with a numerical predictor Let
Pr(Y = c|x) = pc(x) c = 1 middot middot middot CCsumc=1
pc(x) equiv 1 (21)
We introduce a measure of how strongly X is associated with Y in a neighbor-hood of a ldquofixedrdquo point x(0) Points x(0) will be selected at random by our algorithmLocal importance scores will be computed for each point then averaged For a fixedh gt 0 define Nh(x
(0)) = x 0 lt |x minus x(0)| le h to be a neighborhood of x(0) andlet n(x(0) h) =
sumni=1 I(X(i) isin Nh(x
(0))) denote the number of data points in the neigh-borhood For c = 1 middot middot middot C let nminusc (x(0) h) be the number of observations (X(i) Y (i))satisfying x(0)minush le X(i) lt x(0) and Y (i) = c let n+
c (x(0) h) be the number of (X(i) Y (i))satisfying x(0) lt X(i) le x(0) +h and Y (i) = c Let nc(x
(0) h) = n+c (x(0) h) +nminusc (x(0) h)
nminus(x(0) h) =sumC
c=1 nminusc (x(0) h) and n+(x(0) h) =
sumCc=1 n
+c (x(0) h) Table 1 gives the
resulting local contingency table
Table 1 Local contingency table
Y = 1 Y = 2 middot middot middot middot middot middot Y = C Total
x(0) minus h le X lt x(0) nminus1 (x(0) h) nminus2 (x(0) h) middot middot middot middot middot middot nminusC(x(0) h) nminus(x(0) h)
x(0) lt X le x(0) + h n+1 (x(0) h) n+
2 (x(0) h) middot middot middot middot middot middot n+C(x(0) h) n+(x(0) h)
Total n1(x(0) h) n2(x
(0) h) middot middot middot middot middot middot nC(x(0) h) n(x(0) h)
To measure how strongly Y relates to local restrictions on X we consider the chi-square statistic
X 2(x(0) h) =Csumc=1
((nminusc (x(0) h)minus Eminusc (x(0) h))2
Eminusc (x(0) h)+
(n+c (x(0) h)minus E+
c (x(0) h))2
E+c (x(0) h)
) (22)
3
where 00 equiv 0 and
Eminusc (x(0) h) =nminus(x(0) h)nc(x
(0) h)
n(x(0) h) E+
c (x(0) h) =n+(x(0) h)nc(x
(0) h)
n(x(0) h) (23)
Due to the local restriction on X the local contingency table might have some zerocells However this will not be an issue for the χ2 statistic The χ2 statistic can detectlocal dependence between Y and X as long as there exist observations from at leasttwo categories of Y in the neighborhood of x(0) When all observations belong to thesame category of Y intuitively no dependence can be detected In this case χ2 statisticis equal to zero which coincides with our intuition
We call the neighborhood Nh(x(0)) a section and maximize X 2(x(0) h) (22) with
respect to the section size h Let plim denote limit in probability Our local measureof association is ζ(x(0)) as given in
Definition 1 Local Contingency Efficacy (For Numerical Predictor) The local con-tingency section efficacy and the Local Contingency Efficacy (LCE) of Y on a numericalpredictor X at the point x(0) are
ζ(x(0) h) = plimnrarrinfinnminus12
radicX 2(x(0) h) ζ(x(0)) = sup
hgt0ζ(x(0) h) (24)
If pc(x) is constant in x for all c isin 1 middot middot middot C for x near x(0) then as n rarr infinX 2(x(0) h) converges in distribution to a χ2
Cminus1 variable and in this case ζ(x(0) h) = 0On the other hand if pc(x) is not constant near x(0) ζ(x(0) h) determines the asymptoticpower for Pitmanrsquos alternatives of the test based on X 2(x(0) h) for testing whether Xis locally independent of Y and is called the efficacy of this test Moreover ζ2(x(0) h)is the asymptotic noncentrality parameter of the statistic nminus1X 2(x(0) h)
The estimators of ζ(x(0) h) and ζ(x(0)) are
ζ(x(0) h) = nminus12radicX 2(x(0) h) ζ(x(0)) = sup
hgt0ζ(x(0) h) (25)
We selected the section size h as
h = argmaxζ(x(0) h) h isin h1 hg
for some grid h1 hg This corresponds to selecting h by maximizing estimatedpower See Doksum and Schafer [5] Gao and Gijbels [7] Doksum et al [3] Schafer andDoksum [16]
4
22 Categorical Predictor Contingency Efficacy
Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet
X 2 =Csumc=1
Cprimesumcprime=1
(n(c cprime)minus E(c cprime))2
E(c cprime)2 (26)
where
E(c cprime) =
sumCc=1 n(c cprime)
sumCprime
cprime=1 n(c cprime)
n (27)
Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is
ζ = plimnrarrinfin nminus12radicX 2 (28)
Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2
(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic
noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is
ζ = nminus12radicX 2 (29)
3 Variable Selection for Classification Problems The CATCH Algorithm
Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies
31 Numerical Predictors
To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a
ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)
T
to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later
Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set
Tl equiv Tl(x(0) δD) equiv x D(xminusl x
(0)minusl ) le δ (31)
5
We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression
Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of
Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let
NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)
be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in
NlhδD(x(0)) we consider
X 2lδD(x(0) h) (33)
where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)
Analogous to the definitions of local contingency efficacy (24) we define
Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are
ζ(x(0) h l) = plimnrarrinfin nminus12
radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (34)
ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (35)
For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose
logit P (Y = 2|x) =(xminus 1
2
)2 x isin [0 1] Then the local efficacy with h small will be
large while if h ge 1 the efficacy will be zero
32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct
tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables
unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x
(0) δD) by using only the observationsin the tube Tl(x
(0) δD) We denote this version of (26) by
X 2lδD(x(0)) (36)
Analogous to the definition of contingency efficacy given in (28) we define
Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare
ζl(x(0)) = plimnrarrinfin n
minus12radicX 2lδD(x(0)) ζl(x
(0)) = nminus12radicX 2lδD(x(0)) (37)
6
33 The Empirical Importance Scores
The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points
Definition 6 The CATCH empirical importance score for variable l is
sl = Mminus1Msuma=1
ζl(xlowasta) (38)
where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors
respectively
34 Adaptive Tube Distances
Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2
on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form
Dl(xminusl x(0)minusl ) =
dsumj=1
wj|xj minus x(0)j |I(j 6= l) (39)
where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained
The weights wj are proportional to sj minus sprimej which are determined iteratively and
adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the
importance score s(2)j using (38) For the second iteration we use the weights sj = s
(2)j
and the distance
Dl(xminusl x(0)minusl ) =
sumdj=1(sj minus sprimej)+|xj minus x
(0)j |I(j 6= l)sumd
j=1(sj minus sprimej)+I(j 6= l)(310)
where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor
7
with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the
predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and
use s(3)j in the next iteration and so on
In the definition above we need to define |xj minus x(0)j | appropriately for a categorical
variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant
when xj and x(0)j take any pair of different categories One approach is to define |xj minus
x(0)j | =infinI(xj 6= x
(0)j ) in which case all points in the tube are given the same Xj value
as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj
does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the
tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The
value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)
1 minus X(2)1 |) where X
(1)1 and
X(2)1 are two independent realizations of X1 Note that k0 =
radic2π if X1 follows a
standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =
radic2π in the real data example
Remark 31 (choice of k0) k0 =radic
2π is used all the time unless the sample size is
large enough in the sense to be explained below Suppose k0 = infin then Dl
(xminusl x
(0)minusl
)is finite only for those observations with xj = x
(0)j When there are multiple categorical
variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for
all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =
radic2π is used as in the real data example
On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples
35 The CATCH Algorithm
The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations
8
(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below
(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X
lowastM with
replacement from the n observations Set k = 1 do (b)
(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk
(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated
local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a
categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y
on Xl at Xlowastk as defined in (37)
(d) Thresholding Let Y lowast denote a random variable which is identically distributed
as Y but independent of X Let (Y(1)0 middot middot middot Y (n)
0 ) be a random permutation of
(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)
0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y
If k lt M set k = k + 1 go to (b) otherwise go to (e)
(e) Updating importance scores Set
snewl =1
M
Msumk=1
ζl(Xlowastk) s
primenewl =
1
M
Msumk=1
ζlowastl (Xlowastk)
Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s
primenew1 middot middot middot sprimenewd ) denote the updates of S
and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt
radic2SD(sprimestop)
If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm
Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes
9
of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming
(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y
but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)
0 ) be a random permutation of the
Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)
0 ) can be viewed as a realization of Y lowast If Xl
is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of
Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the
estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)
Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl
given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic
2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt
radic2SE(sprimestop) is similar to the one standard
deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01
Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]
Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel
10
4 Simulation Studies
41 Variable Selection with Categorical and Numerical Predictors
Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]
When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion
Y =
0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05
(41)
When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent
In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors
Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors
The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is
11
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
1 Introduction
We consider classification problems with a large number of predictors that can benumerical or categorical We are interested in the case in which many of the predictorsmay be irrelevant for classification thus the variables useful for classification need to beselected In genomic research the classes considered are often cases and controls Thusvariable selection is the same as the important problem of deciding which variables areassociated with disease With modern technology data of large dimension arise in manyscientific disciplines including biology genomics astronomy economics and computerscience In particular the number of variables can be greater than the sample sizewhich poses a considerable challenge to the statistical analysis If only some of thevariables are useful for classification over-fitting is a problem for methods that use allvariables and therefore variable selection becomes critical for statistical analysis
Methods for variable selection in the classification context include methods that in-corporate variable selection as part of the classification procedure This class includesrandom forest (Breiman [1]) CART (Breiman et al [2]) and GUIDE (Loh [12]) Ran-dom forest assigns an importance score for each of the predictors and one can dropthose variables whose importance score fall below a certain threshold CART afterpruning will choose a subset of optimal splitting variables to be the most significantvariables GUIDE is a tree-based method particularly powerful in unbiased variableselection and interaction detection Other research includes methods that incorporatevariable selection by applying shrinkage methods with L1 norm constraints on the pa-rameters (Tibshirani [17]) that generate sparse vectors of parameter estimates Wangand Shen [18] and Zhang et al [19] incorporate variable selection with classificationbased on support vector machine (SVM) methods Qiao et al [14] consider variable se-lection based on linear discriminant analysis One limitation of these variable selectionmethods is that they are not based on nonparametric methods and therefore they maynot work well when there is a complex relationship between the prediction variablesand the variables to be classified
We propose a nonparametric method for variable selection called Categorical Adap-tive Tube Covariate Hunting (CATCH) that performs well as a variable selection algo-rithm and can be used to improve the performance of available classification proceduresThe idea is to construct a nonparametric measure of the relational strength betweeneach predictor and the categorical response and to retain those predictors whose rela-tionship to the response is above a certain threshold The nonparametric measure ofimportance for each predictor is obtained by first measuring the importance of the pre-dictor using local information and then combing such local importance scores to obtainan overall importance score The local importance scores are based on root chi-squaretype statistics for local contingency tables
In addition to the aforementioned nonparametric feature the CATCH procedurehas another property it measures the importance of each variable conditioning on allother variables thereby reducing the confounding that may lead to selection of variables
1
spuriously related to the categorical variable Y This is accomplished by constraining allpredictors but the one we are focusing on For the case where the number of predictorsis huge (d n) this can be done by restricting principal components for some typesof studies See Remark 34
Our approach to nonparametric variable selection is related to the EARTH algo-rithm (Doksum et al [3]) which applies to nonparametric regression problems with acontinuous response variable and continuous predictors It measures the conditionalassociation between a predictor i and the response variable conditional on all the otherpredictors jj 6=i by constraining the jj 6=i variables to regions called tubes The localimportance score is based on a local linear or a local polynomial regression The contri-bution of the current paper is to develop variable selection methods for the classificationproblem with a categorical response variable and predictors that can be continuous dis-crete or categorical
Our CATCH algorithm can be used as a variable selection step before classificationAny classification method preferably nonparametric can be used after we single out theimportant variables In particular SVM and random forest are statistical classificationmethods that can be used with CATCH We show in simulation studies that when thetrue model is highly nonlinear and there are many irrelevant predictors using CATCHto screen out irrelevant or weak predictors greatly improves the performance of SVMand random forest
The CATCH algorithm works with general classification data with both numericaland categorical predictors Moreover the CATCH algorithm is robust when the pre-dictors interact with each other especially when numerical and categorical predictorsinteract eg for hierarchical interactive association between the numerical and cate-gorical predictors We present a model with a reasonable sample size and predictordimension for which random forest has a relatively low chance of detecting the signifi-cance of the categorical predictor The CART and GUIDE algorithms with pruningcan find the correct splitting variables including the categorical one but yield relativelyhigh classification errors Moreover CART and GUIDE have trouble choosing the split-ting predictors in the correct order The CATCH algorithm achieves higher accuracyin the task of variable selection This is due to the importance scores being conditionalas illustrated in the simulation example in section 41
The rest of the paper will proceed as follows In section 2 we introduce importancescores for univariate predictors In sections 31 - 34 we extend these scores to multi-variate predictors and in section 35 we introduce the CATCH algorithm In section4 we use simulation studies to show the effectiveness of the CATCH algorithm A realexample is provided in section 5 In section 6 we provide some theoretical properties tojustify the definition of local contingency efficacy and in section 7 we show asymptoticconsistency of the algorithm
2
2 Importance Scores for Univariate Classification
Let (X(i) Y (i)) i = 1 middot middot middot n be independent and identically distributed (iid) as(X Y ) sim P We first consider the case of a univariate X then use the methodsconstructed for the univariate case to construct the multivariate versions
21 Numerical Predictor Local Contingency Efficacy
Consider the classification problem with a numerical predictor Let
Pr(Y = c|x) = pc(x) c = 1 middot middot middot CCsumc=1
pc(x) equiv 1 (21)
We introduce a measure of how strongly X is associated with Y in a neighbor-hood of a ldquofixedrdquo point x(0) Points x(0) will be selected at random by our algorithmLocal importance scores will be computed for each point then averaged For a fixedh gt 0 define Nh(x
(0)) = x 0 lt |x minus x(0)| le h to be a neighborhood of x(0) andlet n(x(0) h) =
sumni=1 I(X(i) isin Nh(x
(0))) denote the number of data points in the neigh-borhood For c = 1 middot middot middot C let nminusc (x(0) h) be the number of observations (X(i) Y (i))satisfying x(0)minush le X(i) lt x(0) and Y (i) = c let n+
c (x(0) h) be the number of (X(i) Y (i))satisfying x(0) lt X(i) le x(0) +h and Y (i) = c Let nc(x
(0) h) = n+c (x(0) h) +nminusc (x(0) h)
nminus(x(0) h) =sumC
c=1 nminusc (x(0) h) and n+(x(0) h) =
sumCc=1 n
+c (x(0) h) Table 1 gives the
resulting local contingency table
Table 1 Local contingency table
Y = 1 Y = 2 middot middot middot middot middot middot Y = C Total
x(0) minus h le X lt x(0) nminus1 (x(0) h) nminus2 (x(0) h) middot middot middot middot middot middot nminusC(x(0) h) nminus(x(0) h)
x(0) lt X le x(0) + h n+1 (x(0) h) n+
2 (x(0) h) middot middot middot middot middot middot n+C(x(0) h) n+(x(0) h)
Total n1(x(0) h) n2(x
(0) h) middot middot middot middot middot middot nC(x(0) h) n(x(0) h)
To measure how strongly Y relates to local restrictions on X we consider the chi-square statistic
X 2(x(0) h) =Csumc=1
((nminusc (x(0) h)minus Eminusc (x(0) h))2
Eminusc (x(0) h)+
(n+c (x(0) h)minus E+
c (x(0) h))2
E+c (x(0) h)
) (22)
3
where 00 equiv 0 and
Eminusc (x(0) h) =nminus(x(0) h)nc(x
(0) h)
n(x(0) h) E+
c (x(0) h) =n+(x(0) h)nc(x
(0) h)
n(x(0) h) (23)
Due to the local restriction on X the local contingency table might have some zerocells However this will not be an issue for the χ2 statistic The χ2 statistic can detectlocal dependence between Y and X as long as there exist observations from at leasttwo categories of Y in the neighborhood of x(0) When all observations belong to thesame category of Y intuitively no dependence can be detected In this case χ2 statisticis equal to zero which coincides with our intuition
We call the neighborhood Nh(x(0)) a section and maximize X 2(x(0) h) (22) with
respect to the section size h Let plim denote limit in probability Our local measureof association is ζ(x(0)) as given in
Definition 1 Local Contingency Efficacy (For Numerical Predictor) The local con-tingency section efficacy and the Local Contingency Efficacy (LCE) of Y on a numericalpredictor X at the point x(0) are
ζ(x(0) h) = plimnrarrinfinnminus12
radicX 2(x(0) h) ζ(x(0)) = sup
hgt0ζ(x(0) h) (24)
If pc(x) is constant in x for all c isin 1 middot middot middot C for x near x(0) then as n rarr infinX 2(x(0) h) converges in distribution to a χ2
Cminus1 variable and in this case ζ(x(0) h) = 0On the other hand if pc(x) is not constant near x(0) ζ(x(0) h) determines the asymptoticpower for Pitmanrsquos alternatives of the test based on X 2(x(0) h) for testing whether Xis locally independent of Y and is called the efficacy of this test Moreover ζ2(x(0) h)is the asymptotic noncentrality parameter of the statistic nminus1X 2(x(0) h)
The estimators of ζ(x(0) h) and ζ(x(0)) are
ζ(x(0) h) = nminus12radicX 2(x(0) h) ζ(x(0)) = sup
hgt0ζ(x(0) h) (25)
We selected the section size h as
h = argmaxζ(x(0) h) h isin h1 hg
for some grid h1 hg This corresponds to selecting h by maximizing estimatedpower See Doksum and Schafer [5] Gao and Gijbels [7] Doksum et al [3] Schafer andDoksum [16]
4
22 Categorical Predictor Contingency Efficacy
Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet
X 2 =Csumc=1
Cprimesumcprime=1
(n(c cprime)minus E(c cprime))2
E(c cprime)2 (26)
where
E(c cprime) =
sumCc=1 n(c cprime)
sumCprime
cprime=1 n(c cprime)
n (27)
Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is
ζ = plimnrarrinfin nminus12radicX 2 (28)
Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2
(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic
noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is
ζ = nminus12radicX 2 (29)
3 Variable Selection for Classification Problems The CATCH Algorithm
Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies
31 Numerical Predictors
To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a
ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)
T
to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later
Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set
Tl equiv Tl(x(0) δD) equiv x D(xminusl x
(0)minusl ) le δ (31)
5
We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression
Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of
Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let
NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)
be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in
NlhδD(x(0)) we consider
X 2lδD(x(0) h) (33)
where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)
Analogous to the definitions of local contingency efficacy (24) we define
Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are
ζ(x(0) h l) = plimnrarrinfin nminus12
radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (34)
ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (35)
For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose
logit P (Y = 2|x) =(xminus 1
2
)2 x isin [0 1] Then the local efficacy with h small will be
large while if h ge 1 the efficacy will be zero
32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct
tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables
unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x
(0) δD) by using only the observationsin the tube Tl(x
(0) δD) We denote this version of (26) by
X 2lδD(x(0)) (36)
Analogous to the definition of contingency efficacy given in (28) we define
Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare
ζl(x(0)) = plimnrarrinfin n
minus12radicX 2lδD(x(0)) ζl(x
(0)) = nminus12radicX 2lδD(x(0)) (37)
6
33 The Empirical Importance Scores
The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points
Definition 6 The CATCH empirical importance score for variable l is
sl = Mminus1Msuma=1
ζl(xlowasta) (38)
where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors
respectively
34 Adaptive Tube Distances
Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2
on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form
Dl(xminusl x(0)minusl ) =
dsumj=1
wj|xj minus x(0)j |I(j 6= l) (39)
where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained
The weights wj are proportional to sj minus sprimej which are determined iteratively and
adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the
importance score s(2)j using (38) For the second iteration we use the weights sj = s
(2)j
and the distance
Dl(xminusl x(0)minusl ) =
sumdj=1(sj minus sprimej)+|xj minus x
(0)j |I(j 6= l)sumd
j=1(sj minus sprimej)+I(j 6= l)(310)
where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor
7
with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the
predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and
use s(3)j in the next iteration and so on
In the definition above we need to define |xj minus x(0)j | appropriately for a categorical
variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant
when xj and x(0)j take any pair of different categories One approach is to define |xj minus
x(0)j | =infinI(xj 6= x
(0)j ) in which case all points in the tube are given the same Xj value
as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj
does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the
tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The
value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)
1 minus X(2)1 |) where X
(1)1 and
X(2)1 are two independent realizations of X1 Note that k0 =
radic2π if X1 follows a
standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =
radic2π in the real data example
Remark 31 (choice of k0) k0 =radic
2π is used all the time unless the sample size is
large enough in the sense to be explained below Suppose k0 = infin then Dl
(xminusl x
(0)minusl
)is finite only for those observations with xj = x
(0)j When there are multiple categorical
variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for
all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =
radic2π is used as in the real data example
On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples
35 The CATCH Algorithm
The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations
8
(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below
(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X
lowastM with
replacement from the n observations Set k = 1 do (b)
(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk
(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated
local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a
categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y
on Xl at Xlowastk as defined in (37)
(d) Thresholding Let Y lowast denote a random variable which is identically distributed
as Y but independent of X Let (Y(1)0 middot middot middot Y (n)
0 ) be a random permutation of
(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)
0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y
If k lt M set k = k + 1 go to (b) otherwise go to (e)
(e) Updating importance scores Set
snewl =1
M
Msumk=1
ζl(Xlowastk) s
primenewl =
1
M
Msumk=1
ζlowastl (Xlowastk)
Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s
primenew1 middot middot middot sprimenewd ) denote the updates of S
and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt
radic2SD(sprimestop)
If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm
Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes
9
of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming
(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y
but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)
0 ) be a random permutation of the
Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)
0 ) can be viewed as a realization of Y lowast If Xl
is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of
Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the
estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)
Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl
given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic
2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt
radic2SE(sprimestop) is similar to the one standard
deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01
Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]
Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel
10
4 Simulation Studies
41 Variable Selection with Categorical and Numerical Predictors
Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]
When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion
Y =
0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05
(41)
When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent
In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors
Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors
The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is
11
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
spuriously related to the categorical variable Y This is accomplished by constraining allpredictors but the one we are focusing on For the case where the number of predictorsis huge (d n) this can be done by restricting principal components for some typesof studies See Remark 34
Our approach to nonparametric variable selection is related to the EARTH algo-rithm (Doksum et al [3]) which applies to nonparametric regression problems with acontinuous response variable and continuous predictors It measures the conditionalassociation between a predictor i and the response variable conditional on all the otherpredictors jj 6=i by constraining the jj 6=i variables to regions called tubes The localimportance score is based on a local linear or a local polynomial regression The contri-bution of the current paper is to develop variable selection methods for the classificationproblem with a categorical response variable and predictors that can be continuous dis-crete or categorical
Our CATCH algorithm can be used as a variable selection step before classificationAny classification method preferably nonparametric can be used after we single out theimportant variables In particular SVM and random forest are statistical classificationmethods that can be used with CATCH We show in simulation studies that when thetrue model is highly nonlinear and there are many irrelevant predictors using CATCHto screen out irrelevant or weak predictors greatly improves the performance of SVMand random forest
The CATCH algorithm works with general classification data with both numericaland categorical predictors Moreover the CATCH algorithm is robust when the pre-dictors interact with each other especially when numerical and categorical predictorsinteract eg for hierarchical interactive association between the numerical and cate-gorical predictors We present a model with a reasonable sample size and predictordimension for which random forest has a relatively low chance of detecting the signifi-cance of the categorical predictor The CART and GUIDE algorithms with pruningcan find the correct splitting variables including the categorical one but yield relativelyhigh classification errors Moreover CART and GUIDE have trouble choosing the split-ting predictors in the correct order The CATCH algorithm achieves higher accuracyin the task of variable selection This is due to the importance scores being conditionalas illustrated in the simulation example in section 41
The rest of the paper will proceed as follows In section 2 we introduce importancescores for univariate predictors In sections 31 - 34 we extend these scores to multi-variate predictors and in section 35 we introduce the CATCH algorithm In section4 we use simulation studies to show the effectiveness of the CATCH algorithm A realexample is provided in section 5 In section 6 we provide some theoretical properties tojustify the definition of local contingency efficacy and in section 7 we show asymptoticconsistency of the algorithm
2
2 Importance Scores for Univariate Classification
Let (X(i) Y (i)) i = 1 middot middot middot n be independent and identically distributed (iid) as(X Y ) sim P We first consider the case of a univariate X then use the methodsconstructed for the univariate case to construct the multivariate versions
21 Numerical Predictor Local Contingency Efficacy
Consider the classification problem with a numerical predictor Let
Pr(Y = c|x) = pc(x) c = 1 middot middot middot CCsumc=1
pc(x) equiv 1 (21)
We introduce a measure of how strongly X is associated with Y in a neighbor-hood of a ldquofixedrdquo point x(0) Points x(0) will be selected at random by our algorithmLocal importance scores will be computed for each point then averaged For a fixedh gt 0 define Nh(x
(0)) = x 0 lt |x minus x(0)| le h to be a neighborhood of x(0) andlet n(x(0) h) =
sumni=1 I(X(i) isin Nh(x
(0))) denote the number of data points in the neigh-borhood For c = 1 middot middot middot C let nminusc (x(0) h) be the number of observations (X(i) Y (i))satisfying x(0)minush le X(i) lt x(0) and Y (i) = c let n+
c (x(0) h) be the number of (X(i) Y (i))satisfying x(0) lt X(i) le x(0) +h and Y (i) = c Let nc(x
(0) h) = n+c (x(0) h) +nminusc (x(0) h)
nminus(x(0) h) =sumC
c=1 nminusc (x(0) h) and n+(x(0) h) =
sumCc=1 n
+c (x(0) h) Table 1 gives the
resulting local contingency table
Table 1 Local contingency table
Y = 1 Y = 2 middot middot middot middot middot middot Y = C Total
x(0) minus h le X lt x(0) nminus1 (x(0) h) nminus2 (x(0) h) middot middot middot middot middot middot nminusC(x(0) h) nminus(x(0) h)
x(0) lt X le x(0) + h n+1 (x(0) h) n+
2 (x(0) h) middot middot middot middot middot middot n+C(x(0) h) n+(x(0) h)
Total n1(x(0) h) n2(x
(0) h) middot middot middot middot middot middot nC(x(0) h) n(x(0) h)
To measure how strongly Y relates to local restrictions on X we consider the chi-square statistic
X 2(x(0) h) =Csumc=1
((nminusc (x(0) h)minus Eminusc (x(0) h))2
Eminusc (x(0) h)+
(n+c (x(0) h)minus E+
c (x(0) h))2
E+c (x(0) h)
) (22)
3
where 00 equiv 0 and
Eminusc (x(0) h) =nminus(x(0) h)nc(x
(0) h)
n(x(0) h) E+
c (x(0) h) =n+(x(0) h)nc(x
(0) h)
n(x(0) h) (23)
Due to the local restriction on X the local contingency table might have some zerocells However this will not be an issue for the χ2 statistic The χ2 statistic can detectlocal dependence between Y and X as long as there exist observations from at leasttwo categories of Y in the neighborhood of x(0) When all observations belong to thesame category of Y intuitively no dependence can be detected In this case χ2 statisticis equal to zero which coincides with our intuition
We call the neighborhood Nh(x(0)) a section and maximize X 2(x(0) h) (22) with
respect to the section size h Let plim denote limit in probability Our local measureof association is ζ(x(0)) as given in
Definition 1 Local Contingency Efficacy (For Numerical Predictor) The local con-tingency section efficacy and the Local Contingency Efficacy (LCE) of Y on a numericalpredictor X at the point x(0) are
ζ(x(0) h) = plimnrarrinfinnminus12
radicX 2(x(0) h) ζ(x(0)) = sup
hgt0ζ(x(0) h) (24)
If pc(x) is constant in x for all c isin 1 middot middot middot C for x near x(0) then as n rarr infinX 2(x(0) h) converges in distribution to a χ2
Cminus1 variable and in this case ζ(x(0) h) = 0On the other hand if pc(x) is not constant near x(0) ζ(x(0) h) determines the asymptoticpower for Pitmanrsquos alternatives of the test based on X 2(x(0) h) for testing whether Xis locally independent of Y and is called the efficacy of this test Moreover ζ2(x(0) h)is the asymptotic noncentrality parameter of the statistic nminus1X 2(x(0) h)
The estimators of ζ(x(0) h) and ζ(x(0)) are
ζ(x(0) h) = nminus12radicX 2(x(0) h) ζ(x(0)) = sup
hgt0ζ(x(0) h) (25)
We selected the section size h as
h = argmaxζ(x(0) h) h isin h1 hg
for some grid h1 hg This corresponds to selecting h by maximizing estimatedpower See Doksum and Schafer [5] Gao and Gijbels [7] Doksum et al [3] Schafer andDoksum [16]
4
22 Categorical Predictor Contingency Efficacy
Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet
X 2 =Csumc=1
Cprimesumcprime=1
(n(c cprime)minus E(c cprime))2
E(c cprime)2 (26)
where
E(c cprime) =
sumCc=1 n(c cprime)
sumCprime
cprime=1 n(c cprime)
n (27)
Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is
ζ = plimnrarrinfin nminus12radicX 2 (28)
Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2
(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic
noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is
ζ = nminus12radicX 2 (29)
3 Variable Selection for Classification Problems The CATCH Algorithm
Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies
31 Numerical Predictors
To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a
ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)
T
to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later
Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set
Tl equiv Tl(x(0) δD) equiv x D(xminusl x
(0)minusl ) le δ (31)
5
We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression
Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of
Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let
NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)
be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in
NlhδD(x(0)) we consider
X 2lδD(x(0) h) (33)
where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)
Analogous to the definitions of local contingency efficacy (24) we define
Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are
ζ(x(0) h l) = plimnrarrinfin nminus12
radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (34)
ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (35)
For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose
logit P (Y = 2|x) =(xminus 1
2
)2 x isin [0 1] Then the local efficacy with h small will be
large while if h ge 1 the efficacy will be zero
32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct
tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables
unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x
(0) δD) by using only the observationsin the tube Tl(x
(0) δD) We denote this version of (26) by
X 2lδD(x(0)) (36)
Analogous to the definition of contingency efficacy given in (28) we define
Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare
ζl(x(0)) = plimnrarrinfin n
minus12radicX 2lδD(x(0)) ζl(x
(0)) = nminus12radicX 2lδD(x(0)) (37)
6
33 The Empirical Importance Scores
The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points
Definition 6 The CATCH empirical importance score for variable l is
sl = Mminus1Msuma=1
ζl(xlowasta) (38)
where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors
respectively
34 Adaptive Tube Distances
Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2
on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form
Dl(xminusl x(0)minusl ) =
dsumj=1
wj|xj minus x(0)j |I(j 6= l) (39)
where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained
The weights wj are proportional to sj minus sprimej which are determined iteratively and
adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the
importance score s(2)j using (38) For the second iteration we use the weights sj = s
(2)j
and the distance
Dl(xminusl x(0)minusl ) =
sumdj=1(sj minus sprimej)+|xj minus x
(0)j |I(j 6= l)sumd
j=1(sj minus sprimej)+I(j 6= l)(310)
where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor
7
with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the
predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and
use s(3)j in the next iteration and so on
In the definition above we need to define |xj minus x(0)j | appropriately for a categorical
variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant
when xj and x(0)j take any pair of different categories One approach is to define |xj minus
x(0)j | =infinI(xj 6= x
(0)j ) in which case all points in the tube are given the same Xj value
as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj
does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the
tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The
value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)
1 minus X(2)1 |) where X
(1)1 and
X(2)1 are two independent realizations of X1 Note that k0 =
radic2π if X1 follows a
standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =
radic2π in the real data example
Remark 31 (choice of k0) k0 =radic
2π is used all the time unless the sample size is
large enough in the sense to be explained below Suppose k0 = infin then Dl
(xminusl x
(0)minusl
)is finite only for those observations with xj = x
(0)j When there are multiple categorical
variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for
all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =
radic2π is used as in the real data example
On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples
35 The CATCH Algorithm
The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations
8
(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below
(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X
lowastM with
replacement from the n observations Set k = 1 do (b)
(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk
(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated
local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a
categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y
on Xl at Xlowastk as defined in (37)
(d) Thresholding Let Y lowast denote a random variable which is identically distributed
as Y but independent of X Let (Y(1)0 middot middot middot Y (n)
0 ) be a random permutation of
(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)
0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y
If k lt M set k = k + 1 go to (b) otherwise go to (e)
(e) Updating importance scores Set
snewl =1
M
Msumk=1
ζl(Xlowastk) s
primenewl =
1
M
Msumk=1
ζlowastl (Xlowastk)
Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s
primenew1 middot middot middot sprimenewd ) denote the updates of S
and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt
radic2SD(sprimestop)
If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm
Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes
9
of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming
(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y
but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)
0 ) be a random permutation of the
Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)
0 ) can be viewed as a realization of Y lowast If Xl
is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of
Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the
estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)
Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl
given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic
2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt
radic2SE(sprimestop) is similar to the one standard
deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01
Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]
Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel
10
4 Simulation Studies
41 Variable Selection with Categorical and Numerical Predictors
Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]
When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion
Y =
0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05
(41)
When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent
In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors
Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors
The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is
11
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
2 Importance Scores for Univariate Classification
Let (X(i) Y (i)) i = 1 middot middot middot n be independent and identically distributed (iid) as(X Y ) sim P We first consider the case of a univariate X then use the methodsconstructed for the univariate case to construct the multivariate versions
21 Numerical Predictor Local Contingency Efficacy
Consider the classification problem with a numerical predictor Let
Pr(Y = c|x) = pc(x) c = 1 middot middot middot CCsumc=1
pc(x) equiv 1 (21)
We introduce a measure of how strongly X is associated with Y in a neighbor-hood of a ldquofixedrdquo point x(0) Points x(0) will be selected at random by our algorithmLocal importance scores will be computed for each point then averaged For a fixedh gt 0 define Nh(x
(0)) = x 0 lt |x minus x(0)| le h to be a neighborhood of x(0) andlet n(x(0) h) =
sumni=1 I(X(i) isin Nh(x
(0))) denote the number of data points in the neigh-borhood For c = 1 middot middot middot C let nminusc (x(0) h) be the number of observations (X(i) Y (i))satisfying x(0)minush le X(i) lt x(0) and Y (i) = c let n+
c (x(0) h) be the number of (X(i) Y (i))satisfying x(0) lt X(i) le x(0) +h and Y (i) = c Let nc(x
(0) h) = n+c (x(0) h) +nminusc (x(0) h)
nminus(x(0) h) =sumC
c=1 nminusc (x(0) h) and n+(x(0) h) =
sumCc=1 n
+c (x(0) h) Table 1 gives the
resulting local contingency table
Table 1 Local contingency table
Y = 1 Y = 2 middot middot middot middot middot middot Y = C Total
x(0) minus h le X lt x(0) nminus1 (x(0) h) nminus2 (x(0) h) middot middot middot middot middot middot nminusC(x(0) h) nminus(x(0) h)
x(0) lt X le x(0) + h n+1 (x(0) h) n+
2 (x(0) h) middot middot middot middot middot middot n+C(x(0) h) n+(x(0) h)
Total n1(x(0) h) n2(x
(0) h) middot middot middot middot middot middot nC(x(0) h) n(x(0) h)
To measure how strongly Y relates to local restrictions on X we consider the chi-square statistic
X 2(x(0) h) =Csumc=1
((nminusc (x(0) h)minus Eminusc (x(0) h))2
Eminusc (x(0) h)+
(n+c (x(0) h)minus E+
c (x(0) h))2
E+c (x(0) h)
) (22)
3
where 00 equiv 0 and
Eminusc (x(0) h) =nminus(x(0) h)nc(x
(0) h)
n(x(0) h) E+
c (x(0) h) =n+(x(0) h)nc(x
(0) h)
n(x(0) h) (23)
Due to the local restriction on X the local contingency table might have some zerocells However this will not be an issue for the χ2 statistic The χ2 statistic can detectlocal dependence between Y and X as long as there exist observations from at leasttwo categories of Y in the neighborhood of x(0) When all observations belong to thesame category of Y intuitively no dependence can be detected In this case χ2 statisticis equal to zero which coincides with our intuition
We call the neighborhood Nh(x(0)) a section and maximize X 2(x(0) h) (22) with
respect to the section size h Let plim denote limit in probability Our local measureof association is ζ(x(0)) as given in
Definition 1 Local Contingency Efficacy (For Numerical Predictor) The local con-tingency section efficacy and the Local Contingency Efficacy (LCE) of Y on a numericalpredictor X at the point x(0) are
ζ(x(0) h) = plimnrarrinfinnminus12
radicX 2(x(0) h) ζ(x(0)) = sup
hgt0ζ(x(0) h) (24)
If pc(x) is constant in x for all c isin 1 middot middot middot C for x near x(0) then as n rarr infinX 2(x(0) h) converges in distribution to a χ2
Cminus1 variable and in this case ζ(x(0) h) = 0On the other hand if pc(x) is not constant near x(0) ζ(x(0) h) determines the asymptoticpower for Pitmanrsquos alternatives of the test based on X 2(x(0) h) for testing whether Xis locally independent of Y and is called the efficacy of this test Moreover ζ2(x(0) h)is the asymptotic noncentrality parameter of the statistic nminus1X 2(x(0) h)
The estimators of ζ(x(0) h) and ζ(x(0)) are
ζ(x(0) h) = nminus12radicX 2(x(0) h) ζ(x(0)) = sup
hgt0ζ(x(0) h) (25)
We selected the section size h as
h = argmaxζ(x(0) h) h isin h1 hg
for some grid h1 hg This corresponds to selecting h by maximizing estimatedpower See Doksum and Schafer [5] Gao and Gijbels [7] Doksum et al [3] Schafer andDoksum [16]
4
22 Categorical Predictor Contingency Efficacy
Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet
X 2 =Csumc=1
Cprimesumcprime=1
(n(c cprime)minus E(c cprime))2
E(c cprime)2 (26)
where
E(c cprime) =
sumCc=1 n(c cprime)
sumCprime
cprime=1 n(c cprime)
n (27)
Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is
ζ = plimnrarrinfin nminus12radicX 2 (28)
Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2
(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic
noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is
ζ = nminus12radicX 2 (29)
3 Variable Selection for Classification Problems The CATCH Algorithm
Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies
31 Numerical Predictors
To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a
ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)
T
to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later
Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set
Tl equiv Tl(x(0) δD) equiv x D(xminusl x
(0)minusl ) le δ (31)
5
We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression
Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of
Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let
NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)
be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in
NlhδD(x(0)) we consider
X 2lδD(x(0) h) (33)
where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)
Analogous to the definitions of local contingency efficacy (24) we define
Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are
ζ(x(0) h l) = plimnrarrinfin nminus12
radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (34)
ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (35)
For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose
logit P (Y = 2|x) =(xminus 1
2
)2 x isin [0 1] Then the local efficacy with h small will be
large while if h ge 1 the efficacy will be zero
32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct
tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables
unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x
(0) δD) by using only the observationsin the tube Tl(x
(0) δD) We denote this version of (26) by
X 2lδD(x(0)) (36)
Analogous to the definition of contingency efficacy given in (28) we define
Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare
ζl(x(0)) = plimnrarrinfin n
minus12radicX 2lδD(x(0)) ζl(x
(0)) = nminus12radicX 2lδD(x(0)) (37)
6
33 The Empirical Importance Scores
The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points
Definition 6 The CATCH empirical importance score for variable l is
sl = Mminus1Msuma=1
ζl(xlowasta) (38)
where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors
respectively
34 Adaptive Tube Distances
Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2
on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form
Dl(xminusl x(0)minusl ) =
dsumj=1
wj|xj minus x(0)j |I(j 6= l) (39)
where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained
The weights wj are proportional to sj minus sprimej which are determined iteratively and
adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the
importance score s(2)j using (38) For the second iteration we use the weights sj = s
(2)j
and the distance
Dl(xminusl x(0)minusl ) =
sumdj=1(sj minus sprimej)+|xj minus x
(0)j |I(j 6= l)sumd
j=1(sj minus sprimej)+I(j 6= l)(310)
where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor
7
with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the
predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and
use s(3)j in the next iteration and so on
In the definition above we need to define |xj minus x(0)j | appropriately for a categorical
variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant
when xj and x(0)j take any pair of different categories One approach is to define |xj minus
x(0)j | =infinI(xj 6= x
(0)j ) in which case all points in the tube are given the same Xj value
as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj
does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the
tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The
value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)
1 minus X(2)1 |) where X
(1)1 and
X(2)1 are two independent realizations of X1 Note that k0 =
radic2π if X1 follows a
standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =
radic2π in the real data example
Remark 31 (choice of k0) k0 =radic
2π is used all the time unless the sample size is
large enough in the sense to be explained below Suppose k0 = infin then Dl
(xminusl x
(0)minusl
)is finite only for those observations with xj = x
(0)j When there are multiple categorical
variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for
all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =
radic2π is used as in the real data example
On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples
35 The CATCH Algorithm
The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations
8
(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below
(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X
lowastM with
replacement from the n observations Set k = 1 do (b)
(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk
(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated
local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a
categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y
on Xl at Xlowastk as defined in (37)
(d) Thresholding Let Y lowast denote a random variable which is identically distributed
as Y but independent of X Let (Y(1)0 middot middot middot Y (n)
0 ) be a random permutation of
(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)
0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y
If k lt M set k = k + 1 go to (b) otherwise go to (e)
(e) Updating importance scores Set
snewl =1
M
Msumk=1
ζl(Xlowastk) s
primenewl =
1
M
Msumk=1
ζlowastl (Xlowastk)
Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s
primenew1 middot middot middot sprimenewd ) denote the updates of S
and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt
radic2SD(sprimestop)
If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm
Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes
9
of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming
(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y
but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)
0 ) be a random permutation of the
Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)
0 ) can be viewed as a realization of Y lowast If Xl
is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of
Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the
estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)
Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl
given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic
2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt
radic2SE(sprimestop) is similar to the one standard
deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01
Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]
Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel
10
4 Simulation Studies
41 Variable Selection with Categorical and Numerical Predictors
Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]
When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion
Y =
0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05
(41)
When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent
In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors
Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors
The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is
11
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
where 00 equiv 0 and
Eminusc (x(0) h) =nminus(x(0) h)nc(x
(0) h)
n(x(0) h) E+
c (x(0) h) =n+(x(0) h)nc(x
(0) h)
n(x(0) h) (23)
Due to the local restriction on X the local contingency table might have some zerocells However this will not be an issue for the χ2 statistic The χ2 statistic can detectlocal dependence between Y and X as long as there exist observations from at leasttwo categories of Y in the neighborhood of x(0) When all observations belong to thesame category of Y intuitively no dependence can be detected In this case χ2 statisticis equal to zero which coincides with our intuition
We call the neighborhood Nh(x(0)) a section and maximize X 2(x(0) h) (22) with
respect to the section size h Let plim denote limit in probability Our local measureof association is ζ(x(0)) as given in
Definition 1 Local Contingency Efficacy (For Numerical Predictor) The local con-tingency section efficacy and the Local Contingency Efficacy (LCE) of Y on a numericalpredictor X at the point x(0) are
ζ(x(0) h) = plimnrarrinfinnminus12
radicX 2(x(0) h) ζ(x(0)) = sup
hgt0ζ(x(0) h) (24)
If pc(x) is constant in x for all c isin 1 middot middot middot C for x near x(0) then as n rarr infinX 2(x(0) h) converges in distribution to a χ2
Cminus1 variable and in this case ζ(x(0) h) = 0On the other hand if pc(x) is not constant near x(0) ζ(x(0) h) determines the asymptoticpower for Pitmanrsquos alternatives of the test based on X 2(x(0) h) for testing whether Xis locally independent of Y and is called the efficacy of this test Moreover ζ2(x(0) h)is the asymptotic noncentrality parameter of the statistic nminus1X 2(x(0) h)
The estimators of ζ(x(0) h) and ζ(x(0)) are
ζ(x(0) h) = nminus12radicX 2(x(0) h) ζ(x(0)) = sup
hgt0ζ(x(0) h) (25)
We selected the section size h as
h = argmaxζ(x(0) h) h isin h1 hg
for some grid h1 hg This corresponds to selecting h by maximizing estimatedpower See Doksum and Schafer [5] Gao and Gijbels [7] Doksum et al [3] Schafer andDoksum [16]
4
22 Categorical Predictor Contingency Efficacy
Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet
X 2 =Csumc=1
Cprimesumcprime=1
(n(c cprime)minus E(c cprime))2
E(c cprime)2 (26)
where
E(c cprime) =
sumCc=1 n(c cprime)
sumCprime
cprime=1 n(c cprime)
n (27)
Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is
ζ = plimnrarrinfin nminus12radicX 2 (28)
Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2
(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic
noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is
ζ = nminus12radicX 2 (29)
3 Variable Selection for Classification Problems The CATCH Algorithm
Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies
31 Numerical Predictors
To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a
ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)
T
to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later
Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set
Tl equiv Tl(x(0) δD) equiv x D(xminusl x
(0)minusl ) le δ (31)
5
We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression
Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of
Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let
NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)
be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in
NlhδD(x(0)) we consider
X 2lδD(x(0) h) (33)
where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)
Analogous to the definitions of local contingency efficacy (24) we define
Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are
ζ(x(0) h l) = plimnrarrinfin nminus12
radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (34)
ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (35)
For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose
logit P (Y = 2|x) =(xminus 1
2
)2 x isin [0 1] Then the local efficacy with h small will be
large while if h ge 1 the efficacy will be zero
32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct
tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables
unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x
(0) δD) by using only the observationsin the tube Tl(x
(0) δD) We denote this version of (26) by
X 2lδD(x(0)) (36)
Analogous to the definition of contingency efficacy given in (28) we define
Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare
ζl(x(0)) = plimnrarrinfin n
minus12radicX 2lδD(x(0)) ζl(x
(0)) = nminus12radicX 2lδD(x(0)) (37)
6
33 The Empirical Importance Scores
The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points
Definition 6 The CATCH empirical importance score for variable l is
sl = Mminus1Msuma=1
ζl(xlowasta) (38)
where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors
respectively
34 Adaptive Tube Distances
Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2
on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form
Dl(xminusl x(0)minusl ) =
dsumj=1
wj|xj minus x(0)j |I(j 6= l) (39)
where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained
The weights wj are proportional to sj minus sprimej which are determined iteratively and
adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the
importance score s(2)j using (38) For the second iteration we use the weights sj = s
(2)j
and the distance
Dl(xminusl x(0)minusl ) =
sumdj=1(sj minus sprimej)+|xj minus x
(0)j |I(j 6= l)sumd
j=1(sj minus sprimej)+I(j 6= l)(310)
where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor
7
with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the
predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and
use s(3)j in the next iteration and so on
In the definition above we need to define |xj minus x(0)j | appropriately for a categorical
variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant
when xj and x(0)j take any pair of different categories One approach is to define |xj minus
x(0)j | =infinI(xj 6= x
(0)j ) in which case all points in the tube are given the same Xj value
as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj
does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the
tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The
value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)
1 minus X(2)1 |) where X
(1)1 and
X(2)1 are two independent realizations of X1 Note that k0 =
radic2π if X1 follows a
standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =
radic2π in the real data example
Remark 31 (choice of k0) k0 =radic
2π is used all the time unless the sample size is
large enough in the sense to be explained below Suppose k0 = infin then Dl
(xminusl x
(0)minusl
)is finite only for those observations with xj = x
(0)j When there are multiple categorical
variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for
all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =
radic2π is used as in the real data example
On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples
35 The CATCH Algorithm
The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations
8
(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below
(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X
lowastM with
replacement from the n observations Set k = 1 do (b)
(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk
(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated
local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a
categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y
on Xl at Xlowastk as defined in (37)
(d) Thresholding Let Y lowast denote a random variable which is identically distributed
as Y but independent of X Let (Y(1)0 middot middot middot Y (n)
0 ) be a random permutation of
(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)
0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y
If k lt M set k = k + 1 go to (b) otherwise go to (e)
(e) Updating importance scores Set
snewl =1
M
Msumk=1
ζl(Xlowastk) s
primenewl =
1
M
Msumk=1
ζlowastl (Xlowastk)
Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s
primenew1 middot middot middot sprimenewd ) denote the updates of S
and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt
radic2SD(sprimestop)
If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm
Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes
9
of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming
(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y
but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)
0 ) be a random permutation of the
Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)
0 ) can be viewed as a realization of Y lowast If Xl
is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of
Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the
estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)
Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl
given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic
2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt
radic2SE(sprimestop) is similar to the one standard
deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01
Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]
Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel
10
4 Simulation Studies
41 Variable Selection with Categorical and Numerical Predictors
Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]
When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion
Y =
0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05
(41)
When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent
In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors
Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors
The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is
11
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
22 Categorical Predictor Contingency Efficacy
Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet
X 2 =Csumc=1
Cprimesumcprime=1
(n(c cprime)minus E(c cprime))2
E(c cprime)2 (26)
where
E(c cprime) =
sumCc=1 n(c cprime)
sumCprime
cprime=1 n(c cprime)
n (27)
Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is
ζ = plimnrarrinfin nminus12radicX 2 (28)
Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2
(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic
noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is
ζ = nminus12radicX 2 (29)
3 Variable Selection for Classification Problems The CATCH Algorithm
Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies
31 Numerical Predictors
To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a
ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)
T
to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later
Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set
Tl equiv Tl(x(0) δD) equiv x D(xminusl x
(0)minusl ) le δ (31)
5
We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression
Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of
Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let
NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)
be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in
NlhδD(x(0)) we consider
X 2lδD(x(0) h) (33)
where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)
Analogous to the definitions of local contingency efficacy (24) we define
Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are
ζ(x(0) h l) = plimnrarrinfin nminus12
radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (34)
ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (35)
For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose
logit P (Y = 2|x) =(xminus 1
2
)2 x isin [0 1] Then the local efficacy with h small will be
large while if h ge 1 the efficacy will be zero
32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct
tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables
unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x
(0) δD) by using only the observationsin the tube Tl(x
(0) δD) We denote this version of (26) by
X 2lδD(x(0)) (36)
Analogous to the definition of contingency efficacy given in (28) we define
Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare
ζl(x(0)) = plimnrarrinfin n
minus12radicX 2lδD(x(0)) ζl(x
(0)) = nminus12radicX 2lδD(x(0)) (37)
6
33 The Empirical Importance Scores
The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points
Definition 6 The CATCH empirical importance score for variable l is
sl = Mminus1Msuma=1
ζl(xlowasta) (38)
where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors
respectively
34 Adaptive Tube Distances
Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2
on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form
Dl(xminusl x(0)minusl ) =
dsumj=1
wj|xj minus x(0)j |I(j 6= l) (39)
where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained
The weights wj are proportional to sj minus sprimej which are determined iteratively and
adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the
importance score s(2)j using (38) For the second iteration we use the weights sj = s
(2)j
and the distance
Dl(xminusl x(0)minusl ) =
sumdj=1(sj minus sprimej)+|xj minus x
(0)j |I(j 6= l)sumd
j=1(sj minus sprimej)+I(j 6= l)(310)
where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor
7
with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the
predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and
use s(3)j in the next iteration and so on
In the definition above we need to define |xj minus x(0)j | appropriately for a categorical
variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant
when xj and x(0)j take any pair of different categories One approach is to define |xj minus
x(0)j | =infinI(xj 6= x
(0)j ) in which case all points in the tube are given the same Xj value
as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj
does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the
tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The
value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)
1 minus X(2)1 |) where X
(1)1 and
X(2)1 are two independent realizations of X1 Note that k0 =
radic2π if X1 follows a
standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =
radic2π in the real data example
Remark 31 (choice of k0) k0 =radic
2π is used all the time unless the sample size is
large enough in the sense to be explained below Suppose k0 = infin then Dl
(xminusl x
(0)minusl
)is finite only for those observations with xj = x
(0)j When there are multiple categorical
variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for
all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =
radic2π is used as in the real data example
On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples
35 The CATCH Algorithm
The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations
8
(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below
(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X
lowastM with
replacement from the n observations Set k = 1 do (b)
(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk
(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated
local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a
categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y
on Xl at Xlowastk as defined in (37)
(d) Thresholding Let Y lowast denote a random variable which is identically distributed
as Y but independent of X Let (Y(1)0 middot middot middot Y (n)
0 ) be a random permutation of
(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)
0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y
If k lt M set k = k + 1 go to (b) otherwise go to (e)
(e) Updating importance scores Set
snewl =1
M
Msumk=1
ζl(Xlowastk) s
primenewl =
1
M
Msumk=1
ζlowastl (Xlowastk)
Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s
primenew1 middot middot middot sprimenewd ) denote the updates of S
and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt
radic2SD(sprimestop)
If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm
Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes
9
of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming
(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y
but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)
0 ) be a random permutation of the
Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)
0 ) can be viewed as a realization of Y lowast If Xl
is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of
Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the
estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)
Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl
given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic
2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt
radic2SE(sprimestop) is similar to the one standard
deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01
Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]
Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel
10
4 Simulation Studies
41 Variable Selection with Categorical and Numerical Predictors
Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]
When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion
Y =
0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05
(41)
When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent
In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors
Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors
The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is
11
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression
Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of
Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let
NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)
be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in
NlhδD(x(0)) we consider
X 2lδD(x(0) h) (33)
where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)
Analogous to the definitions of local contingency efficacy (24) we define
Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are
ζ(x(0) h l) = plimnrarrinfin nminus12
radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (34)
ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x
(0)) = suphgt0ζ(x(0) h l) (35)
For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose
logit P (Y = 2|x) =(xminus 1
2
)2 x isin [0 1] Then the local efficacy with h small will be
large while if h ge 1 the efficacy will be zero
32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct
tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables
unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x
(0) δD) by using only the observationsin the tube Tl(x
(0) δD) We denote this version of (26) by
X 2lδD(x(0)) (36)
Analogous to the definition of contingency efficacy given in (28) we define
Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare
ζl(x(0)) = plimnrarrinfin n
minus12radicX 2lδD(x(0)) ζl(x
(0)) = nminus12radicX 2lδD(x(0)) (37)
6
33 The Empirical Importance Scores
The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points
Definition 6 The CATCH empirical importance score for variable l is
sl = Mminus1Msuma=1
ζl(xlowasta) (38)
where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors
respectively
34 Adaptive Tube Distances
Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2
on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form
Dl(xminusl x(0)minusl ) =
dsumj=1
wj|xj minus x(0)j |I(j 6= l) (39)
where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained
The weights wj are proportional to sj minus sprimej which are determined iteratively and
adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the
importance score s(2)j using (38) For the second iteration we use the weights sj = s
(2)j
and the distance
Dl(xminusl x(0)minusl ) =
sumdj=1(sj minus sprimej)+|xj minus x
(0)j |I(j 6= l)sumd
j=1(sj minus sprimej)+I(j 6= l)(310)
where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor
7
with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the
predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and
use s(3)j in the next iteration and so on
In the definition above we need to define |xj minus x(0)j | appropriately for a categorical
variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant
when xj and x(0)j take any pair of different categories One approach is to define |xj minus
x(0)j | =infinI(xj 6= x
(0)j ) in which case all points in the tube are given the same Xj value
as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj
does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the
tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The
value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)
1 minus X(2)1 |) where X
(1)1 and
X(2)1 are two independent realizations of X1 Note that k0 =
radic2π if X1 follows a
standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =
radic2π in the real data example
Remark 31 (choice of k0) k0 =radic
2π is used all the time unless the sample size is
large enough in the sense to be explained below Suppose k0 = infin then Dl
(xminusl x
(0)minusl
)is finite only for those observations with xj = x
(0)j When there are multiple categorical
variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for
all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =
radic2π is used as in the real data example
On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples
35 The CATCH Algorithm
The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations
8
(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below
(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X
lowastM with
replacement from the n observations Set k = 1 do (b)
(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk
(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated
local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a
categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y
on Xl at Xlowastk as defined in (37)
(d) Thresholding Let Y lowast denote a random variable which is identically distributed
as Y but independent of X Let (Y(1)0 middot middot middot Y (n)
0 ) be a random permutation of
(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)
0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y
If k lt M set k = k + 1 go to (b) otherwise go to (e)
(e) Updating importance scores Set
snewl =1
M
Msumk=1
ζl(Xlowastk) s
primenewl =
1
M
Msumk=1
ζlowastl (Xlowastk)
Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s
primenew1 middot middot middot sprimenewd ) denote the updates of S
and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt
radic2SD(sprimestop)
If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm
Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes
9
of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming
(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y
but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)
0 ) be a random permutation of the
Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)
0 ) can be viewed as a realization of Y lowast If Xl
is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of
Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the
estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)
Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl
given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic
2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt
radic2SE(sprimestop) is similar to the one standard
deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01
Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]
Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel
10
4 Simulation Studies
41 Variable Selection with Categorical and Numerical Predictors
Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]
When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion
Y =
0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05
(41)
When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent
In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors
Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors
The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is
11
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
33 The Empirical Importance Scores
The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points
Definition 6 The CATCH empirical importance score for variable l is
sl = Mminus1Msuma=1
ζl(xlowasta) (38)
where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors
respectively
34 Adaptive Tube Distances
Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2
on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form
Dl(xminusl x(0)minusl ) =
dsumj=1
wj|xj minus x(0)j |I(j 6= l) (39)
where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained
The weights wj are proportional to sj minus sprimej which are determined iteratively and
adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the
importance score s(2)j using (38) For the second iteration we use the weights sj = s
(2)j
and the distance
Dl(xminusl x(0)minusl ) =
sumdj=1(sj minus sprimej)+|xj minus x
(0)j |I(j 6= l)sumd
j=1(sj minus sprimej)+I(j 6= l)(310)
where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor
7
with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the
predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and
use s(3)j in the next iteration and so on
In the definition above we need to define |xj minus x(0)j | appropriately for a categorical
variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant
when xj and x(0)j take any pair of different categories One approach is to define |xj minus
x(0)j | =infinI(xj 6= x
(0)j ) in which case all points in the tube are given the same Xj value
as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj
does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the
tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The
value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)
1 minus X(2)1 |) where X
(1)1 and
X(2)1 are two independent realizations of X1 Note that k0 =
radic2π if X1 follows a
standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =
radic2π in the real data example
Remark 31 (choice of k0) k0 =radic
2π is used all the time unless the sample size is
large enough in the sense to be explained below Suppose k0 = infin then Dl
(xminusl x
(0)minusl
)is finite only for those observations with xj = x
(0)j When there are multiple categorical
variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for
all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =
radic2π is used as in the real data example
On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples
35 The CATCH Algorithm
The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations
8
(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below
(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X
lowastM with
replacement from the n observations Set k = 1 do (b)
(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk
(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated
local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a
categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y
on Xl at Xlowastk as defined in (37)
(d) Thresholding Let Y lowast denote a random variable which is identically distributed
as Y but independent of X Let (Y(1)0 middot middot middot Y (n)
0 ) be a random permutation of
(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)
0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y
If k lt M set k = k + 1 go to (b) otherwise go to (e)
(e) Updating importance scores Set
snewl =1
M
Msumk=1
ζl(Xlowastk) s
primenewl =
1
M
Msumk=1
ζlowastl (Xlowastk)
Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s
primenew1 middot middot middot sprimenewd ) denote the updates of S
and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt
radic2SD(sprimestop)
If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm
Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes
9
of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming
(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y
but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)
0 ) be a random permutation of the
Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)
0 ) can be viewed as a realization of Y lowast If Xl
is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of
Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the
estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)
Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl
given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic
2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt
radic2SE(sprimestop) is similar to the one standard
deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01
Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]
Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel
10
4 Simulation Studies
41 Variable Selection with Categorical and Numerical Predictors
Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]
When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion
Y =
0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05
(41)
When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent
In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors
Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors
The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is
11
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the
predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and
use s(3)j in the next iteration and so on
In the definition above we need to define |xj minus x(0)j | appropriately for a categorical
variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant
when xj and x(0)j take any pair of different categories One approach is to define |xj minus
x(0)j | =infinI(xj 6= x
(0)j ) in which case all points in the tube are given the same Xj value
as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj
does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the
tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The
value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)
1 minus X(2)1 |) where X
(1)1 and
X(2)1 are two independent realizations of X1 Note that k0 =
radic2π if X1 follows a
standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =
radic2π in the real data example
Remark 31 (choice of k0) k0 =radic
2π is used all the time unless the sample size is
large enough in the sense to be explained below Suppose k0 = infin then Dl
(xminusl x
(0)minusl
)is finite only for those observations with xj = x
(0)j When there are multiple categorical
variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for
all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =
radic2π is used as in the real data example
On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples
35 The CATCH Algorithm
The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations
8
(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below
(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X
lowastM with
replacement from the n observations Set k = 1 do (b)
(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk
(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated
local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a
categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y
on Xl at Xlowastk as defined in (37)
(d) Thresholding Let Y lowast denote a random variable which is identically distributed
as Y but independent of X Let (Y(1)0 middot middot middot Y (n)
0 ) be a random permutation of
(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)
0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y
If k lt M set k = k + 1 go to (b) otherwise go to (e)
(e) Updating importance scores Set
snewl =1
M
Msumk=1
ζl(Xlowastk) s
primenewl =
1
M
Msumk=1
ζlowastl (Xlowastk)
Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s
primenew1 middot middot middot sprimenewd ) denote the updates of S
and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt
radic2SD(sprimestop)
If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm
Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes
9
of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming
(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y
but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)
0 ) be a random permutation of the
Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)
0 ) can be viewed as a realization of Y lowast If Xl
is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of
Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the
estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)
Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl
given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic
2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt
radic2SE(sprimestop) is similar to the one standard
deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01
Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]
Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel
10
4 Simulation Studies
41 Variable Selection with Categorical and Numerical Predictors
Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]
When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion
Y =
0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05
(41)
When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent
In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors
Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors
The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is
11
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below
(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X
lowastM with
replacement from the n observations Set k = 1 do (b)
(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk
(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated
local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a
categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y
on Xl at Xlowastk as defined in (37)
(d) Thresholding Let Y lowast denote a random variable which is identically distributed
as Y but independent of X Let (Y(1)0 middot middot middot Y (n)
0 ) be a random permutation of
(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)
0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y
If k lt M set k = k + 1 go to (b) otherwise go to (e)
(e) Updating importance scores Set
snewl =1
M
Msumk=1
ζl(Xlowastk) s
primenewl =
1
M
Msumk=1
ζlowastl (Xlowastk)
Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s
primenew1 middot middot middot sprimenewd ) denote the updates of S
and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt
radic2SD(sprimestop)
If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm
Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes
9
of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming
(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y
but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)
0 ) be a random permutation of the
Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)
0 ) can be viewed as a realization of Y lowast If Xl
is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of
Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the
estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)
Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl
given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic
2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt
radic2SE(sprimestop) is similar to the one standard
deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01
Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]
Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel
10
4 Simulation Studies
41 Variable Selection with Categorical and Numerical Predictors
Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]
When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion
Y =
0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05
(41)
When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent
In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors
Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors
The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is
11
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming
(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y
but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)
0 ) be a random permutation of the
Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)
0 ) can be viewed as a realization of Y lowast If Xl
is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of
Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the
estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)
Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl
given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic
2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt
radic2SE(sprimestop) is similar to the one standard
deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01
Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]
Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel
10
4 Simulation Studies
41 Variable Selection with Categorical and Numerical Predictors
Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]
When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion
Y =
0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05
(41)
When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent
In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors
Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors
The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is
11
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
4 Simulation Studies
41 Variable Selection with Categorical and Numerical Predictors
Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]
When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion
Y =
0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05
(41)
When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent
In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors
Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors
The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is
11
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
Tab
le2
Sim
ula
tion
resu
lts
from
Exam
ple
41
wit
hsa
mp
lesi
zen
=400
For
1leile
5
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
rele
vant
vari
ableX
ian
dth
est
and
ard
erro
rb
ase
don
100
sim
ula
tion
sC
olu
mn
6giv
esth
em
ean
of
the
esti
mate
dp
rob
ab
ilit
yof
sele
ctio
nan
dth
est
and
ard
erro
rfo
rth
eir
rele
vant
vari
ab
lesX
6middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5Xiige
6
CA
TC
H10
082
(00
4)0
75(0
04)
078
(00
4)0
82(0
04)
1(0
)0
05(0
02)
500
76(0
04)
076
(00
4)0
87(0
03)
084
(00
4)0
96(0
02)
008
(00
3)
100
077
(00
4)0
73(0
04)
082
(00
4)0
8(0
04)
083
(00
4)0
09(0
03)
Mar
ginal
CA
TC
H10
077
(00
4)0
67(0
05)
07
(00
5)0
8(0
04)
006
(00
2)0
1(0
03)
500
69(0
05)
07
(00
5)0
76(0
04)
079
(00
4)0
07(0
03)
01
(00
3)
100
078
(00
4)0
76(0
04)
073
(00
4)0
73(0
04)
012
(00
3)0
1(0
03)
Ran
dom
For
est
100
71(0
05)
051
(00
5)0
47(0
05)
059
(00
5)0
37(0
05)
006
(00
2)
500
64(0
05)
061
(00
5)0
65(0
05)
058
(00
5)0
28(0
04)
008
(00
3)
100
066
(00
5)0
58(0
05)
059
(00
5)0
65(0
05)
017
(00
4)0
11(0
03)
GU
IDE
100
96(0
02)
098
(00
1)0
93(0
03)
099
(00
1)0
92(0
03)
033
(00
5)
500
82(0
04)
085
(00
4)0
9(0
03)
089
(00
3)0
67(0
05)
012
(00
3)
100
083
(00
4)0
84(0
04)
086
(00
3)0
83(0
04)
061
(00
5)0
08(0
03)
12
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y
biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node
The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors
In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance
13
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2
The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables
To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere
Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent
14
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44
Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi
was selected Column 6 gives the percentage of times irrelevant variables were selected
Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6
M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99
M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102
In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent
Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41
The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here
15
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
00 02 04 06 08 10
00
02
04
06
08
10
(a) X5 = 0
X1
X2
00 02 04 06 08 10
00
02
04
06
08
10
(b) X5 = 1
X3
X4
Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar
We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables
Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable
42 Improving The Performance of SVM and RF Classifiers
The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important
16
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
Tab
le4
Sim
ula
tion
resu
lts
from
Exam
ple
42
wit
hsa
mp
lesi
zen
=400
For
1leile
6
theit
hC
olu
mn
giv
esth
ees
tim
ate
dp
rob
ab
ilit
yof
sele
ctio
nof
vari
ableX
ian
dth
est
and
ard
erro
rb
ased
on
100
sim
ula
tion
sw
her
eX
ii
=1middotmiddotmiddot
5are
rele
vant
vari
ab
les
wit
hcor(X
1X
4)
=5
X
6is
anir
rele
vant
vari
able
that
isco
rrel
ated
wit
hX
3w
ith
corr
elati
on
9
Colu
mn
7giv
esth
em
ean
of
the
esti
mate
dpro
bab
ilit
yof
sele
ctio
nan
dth
atof
the
stan
dar
der
ror
for
the
irre
leva
nt
vari
ab
lesX
7middotmiddotmiddotX
d
Num
ber
ofSel
ecti
ons(
100
sim
ula
tion
s)
ME
TH
OD
dX
1X
2X
3X
4X
5X
6Xiige
7
CA
TC
H10
076
(00
4)0
77(0
04)
073
(00
4)0
68(0
05)
099
(00
1)0
34(0
05)
007
(00
2)
500
78(0
04)
077
(00
4)0
74(0
04)
063
(00
5)0
92(0
03)
057
(00
5)0
09(0
03)
100
076
(00
4)0
77(0
04)
08
(00
4)0
74(0
04)
082
(00
4)0
55(0
05)
009
(00
3)
Mar
ginal
CA
TC
H10
035
(00
5)0
63(0
05)
08
(00
4)0
32(0
05)
009
(00
3)0
71(0
05)
012
(00
3)
500
34(0
05)
071
(00
5)0
7(0
05)
03
(00
5)0
1(0
03)
06
(00
5)0
11(0
03)
100
034
(00
5)0
68(0
05)
075
(00
4)0
3(0
05)
01
(00
3)0
59(0
05)
01
(00
3)
Ran
dom
For
est
100
29(0
05)
048
(00
5)0
55(0
05)
02
(00
4)0
3(0
05)
047
(00
5)0
05(0
02)
500
26(0
04)
052
(00
5)0
57(0
05)
03
(00
5)0
24(0
04)
045
(00
5)0
08(0
03)
100
036
(00
5)0
61(0
05)
066
(00
5)0
37(0
05)
032
(00
5)0
46(0
05)
012
(00
3)
GU
IDE
100
96(0
02)
096
(00
2)0
94(0
02)
095
(00
2)0
96(0
02)
084
(00
4)0
3(0
05)
500
8(0
04)
092
(00
3)0
88(0
03)
079
(00
4)0
72(0
04)
068
(00
5)0
12(0
03)
100
081
(00
4)0
89(0
03)
09
(00
3)0
76(0
04)
075
(00
4)0
58(0
05)
008
(00
3)
17
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y
Let the true model be
(X Y ) sim P (42)
where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)
The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T
Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define
MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)
Our criterion is
IMR = EE0[MR(XYx(0) y(0))] (44)
where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)
This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples
Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the
18
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms
20 40 60 80 100
025
030
035
040
045
050
055
060
SVM
d
Err
or R
ate
SVMCATCH+SVM
20 40 60 80 100
025
030
035
040
045
050
055
060
Random Forest
d
RFCCATCH+RFC
Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400
Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with
P (Y = 0|X = x) =
F (x) if r = 0G(x) if r = 1
(45)
where x = (x1 xd)
r sim Bernoulli(γ) (46)
F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)
G(x) =
01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0
(48)
and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ
19
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
(1minus γ)nprodi=1
[F (x(i))]Y(i)
[1minus F (x(i))]1minusY(i)
+ γ
nprodi=1
[G(x(i))]Y(i)
[1minusG(x(i))]1minusY(i)
(49)
We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model
01
02
03
04
SVM
γ
Err
or R
ate
0 02 04 06 08 1
SVMCATCH+SVM
01
02
03
04
Random Forest
γ
0 02 04 06 08 1
RFCCATCH+RFC
Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44
5 CATCH Analysis of The ANN Data
In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class
20
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis
Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH
SVM(r) SVM(p) RFC
wo CATCH 0052 0063 0024
with CATCH 0037 0055 0015
Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better
6 Properties of Local Contingency Efficacy
In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let
pr =Csumc=1
prc qc =Rsumr=1
prc (61)
21
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
and
nrmiddot =Csumc=1
nrc nmiddotc =Rsumi=1
nrc n =Rsumi=1
Csumc=1
nrc Erc = nrmiddotnmiddotcn (62)
Consider testing that the rows and the columns are independent ie
H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)
a standard test statistic is
X 2 =Rsumr=1
Csumc=1
(nrc minus Erc)2
Erc (64)
Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we
have the well known result
Lemma 61 As n tends to infinity X 2n tends in probability to
τ 2 equivRsumr=1
Csumc=1
(prc minus prqc)2
prqc
Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true
Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent
Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then
(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)
(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero
(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x
(0)) 6= 0 then ζ(x(0)) gt 0
The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate
normal By Taylor expansion of Tn equivradicχ2n at τ = plim
radicχ2n we find that
22
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)
where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper
Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2
j for the chi-square statistic for the jth variable
7 Consistency Of CATCH
Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj
component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity
We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent
When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin
The key quantities that determine consistency are the limits λj of sj That is
λj = τ(ζj P ) = EP [ζj(X)] =
intζj(x)dP (x) j = 1 d
where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is
sj = τ(ζj P ) =1
M
Msumk=1
ζj(Xlowastk)
where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and
without loss of generality reorder the λj so that
23
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d
(71)
Our variable selection rule is
ldquokeep variable Xj if sj ge t for some threshold trdquo (72)
Rule (72) is consistent if and only if
P (minjled1sj ge t and max
jgtd1sj lt t) minusrarr 1 (73)
To establish (73) it is enough to show that
P (maxjgtd1sj gt t) minusrarr 0 (74)
andP (min
jled1sj le t) minusrarr 0 (75)
We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively
71 Type I consistency
We consider (74) first and note that
P (maxjgtd1sj gt t) le
dsumj=d1+1
P (sj gt t) (76)
For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local
efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x
(0)) is stochastically
larger in the right tail than the average Mminus1sumMa=1 ζj(x
lowasta) That is we assume that
there exist t gt 0 and n0 such that for n ge n0
p(sj gt t) le P (s(0)j gt t) j gt d1 (77)
The existence of such a point x(0) is established by considering the probability limit of
arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established
24
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that
P (sj gt tn) le P (s(0)j gt tn) (78)
for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2
jn given by
(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term
[T (j)rc ]2 equiv
[nrcErcradicnradicErc
]2in s2j = χ2
jn defined for Xj by (64) satisfies
P(|T (j)rc | gt tn
)minusrarr 0
This is because
P
radicsumr
sumc
[T(j)rc ]2 gt tn
le P(RC max
rc[T (j)rc ]2 gt t2n
)le RC maxP
(|T (j)rc | gt tn
radicRC)
and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall
[8] and Jing et al [10] for some constant A
P (|T (j)rc | gt tn) =
2tminus1nradic2πeminust
2n2[1 + At3nn
minus 12 + o
(t3nnminus 1
2
)](79)
provided tn = o(n
16
) If (79) is uniform in j gt d1 then (76) and (79) implies that
type I consistency holds if
d0
(tnradic
2
)minus1eminus
t2n2 rarr 0 (710)
To examine (710) write d0 and tn as d0 = exp(nb)
and tn =radic
2nr By large deviationtheory (710) holds when r lt 1
6 Thus if 0 lt 1
2b lt r lt 1
6 then
d0
(tnradic
2
)minus1eminus
t2n2 = nminusr exp
(nb minus n2r
)rarr 0
25
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency
For instance if there are exactly d0 = exp(n
14
)irrelevant variables and we use the
threshold tn =radic
2n17 the probability of selecting any of the irrelevant variables tends
to zero at the ratenminus
14 exp
(n
14 minus n
27
)72 Type II consistency
We now assume that for each relevant variable Xj there is a least favorable point x(1)
such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance
score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there
exists n1 and x(1) such that for n ge n1
P(s(1)j le tn
)ge P (sj le tn) j le d1 (711)
The existence of such a least favorable x(1) is established by considering the probabilitylimit of
arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM
We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1
P(|T (j)rc | le tn
)rarr 0
This is because
P
(radicsumr
sumc
[T
(j)rc
]2le tn
)le P (minrc
[T
(j)rc
]2le t2n)
le RC maxrc P(|T (j)rc | le tn
)By Theorem 62
radicn(s(1)j minus τj
)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1
are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn
)rarr 0 Thus type II
consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the
T(j)rc Thus set
θ(j)n =
radicn(p(j)rc minus p(j)r p
(j)c
)radicp(j)r p
(j)c
(712)
26
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
Tn(j) = T (j)rc minus θ(j)n
where for simplicity the dependence of T(j)n and θ
(j)n on r and c is not indicated
Lemma 71
P
(minjled1|T (j)rc | le tn
)le
d1sumj=1
P
(|T (j)n | ge min
jled1θ(j)n minus tn
)Proof
P(
minjled1 |T(j)rc | le tn
)= P
(minjled1 |T
(j)n + θ
(j)n | le tn
)le P
(minjled1 |T
(j)n | minusmaxjled1 |θ
(j)n | le tn
)= P
(maxjled1 |T
(j)n | ge minjled1 |θ
(j)n | minus tn
)lesumd1
j=1 P(|T (j)n | ge minjled1 |θ
(j)n | minus tn
)
Setan = min
jled1
∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o
(n
16
)
P (∣∣T (j)n
∣∣ ge an) prop
(anradic
2
)minus1expminusa2n2 (713)
where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71
P
(minjled1
∣∣T (j)rc
∣∣ le tn
)prop d1
(anradic
2
)minus1expminusa2n2
Thus type II consistency holds when an = o(n
16
)and
d1
(anradic
2
)minus1expminusa2n2 rarr 0 (714)
In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an
is of order nr 0 lt r lt 16 and
d1nminusr expminusn2r2 rarr 0
27
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
If an = O(nk) with k ge 16 then P (
∣∣∣T (j)(n)
∣∣∣ ge an) tends to zero at a rate faster than in
(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc
fixed as nrarrinfin then minjled1
∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n
12 minus tn
73 Type I and II consistency
We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is
Theorem 71 Under the regularization conditions of Sections 71 and 72 if
d0 = c0 exp(nb) tn = c1nr
minjled1 θ(j) minus tn = c2nr d1 = c3 exp
(nb)
for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1
6 then CATCH is consistent
AppendixProof of Theorem 61
(1) Let Nh =
sumni=1 I(X(i) isin Nh(x
(0))) Then Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) and
Nh rarras infin as nrarrinfin
ζ(x(0) h) = nminus12radicX 2(x(0) h)
= (Nh n)12(N
h )minus12radicX 2(x(0) h)
Since Nh sim Bi(n
int x(0)+hx(0)minush f(x)dx) we have
Nh nrarr
int x(0)+h
x(0)minushf(x)dx (715)
By Lemma 61 and Table (1)
(Nh )minus12
radicX 2(x(0) h)rarr
( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
(716)
where
p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x
(0)))
28
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
for r = +minus and c = 1 middot middot middot C
pr =Csumc=1
prc qc =sumr=+minus
prc
Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x
(0)))qc = Pr(Y = c|X isin Nh(x
(0)))
By (715) and (716)
ζ(x(0) h)rarr(int x(0)+h
x(0)minushf(x)dx
) 12( sumr=+minus
Csumc=1
(prc minus prqc)2
prqc
) 12
equiv ζ(x(0) h)
(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence
ζ(x(0)) = suphgt0ζ(x(0) h) = 0
(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x
(0)) gt0 Then there exists h such that
Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))
which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc
References
[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32
[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont
[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620
[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473
29
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-
[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141
[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67
[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594
[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York
[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109
[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215
[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008
[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737
[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909
[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008
[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234
[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304
[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288
[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594
[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167
30
- Introduction
- Importance Scores for Univariate Classification
-
- Numerical Predictor Local Contingency Efficacy
- Categorical Predictor Contingency Efficacy
-
- Variable Selection for Classification Problems The CATCH Algorithm
-
- Numerical Predictors
- Numerical and Categorical Predictors
- The Empirical Importance Scores
- Adaptive Tube Distances
- The CATCH Algorithm
-
- Simulation Studies
-
- Variable Selection with Categorical and Numerical Predictors
- Improving The Performance of SVM and RF Classifiers
-
- CATCH Analysis of The ANN Data
- Properties of Local Contingency Efficacy
- Consistency Of CATCH
-
- Type I consistency
- Type II consistency
- Type I and II consistency
-
- Proof of Theorem 61
-