nonparametric variable selection and classi cation: the ...lc436/tang_chen_etal_2012_csda.pdf ·...

32
Nonparametric Variable Selection and Classification: The CATCH Algorithm Shijie Tang a , Lisha Chen b,* , Kam-Wah Tsui c,1 , Kjell Doksum d,1 a Boehringer Ingelheim Pharmaceuticals, Danbury, CT b Department of Statistics, Yale University, New Haven, CT c Department of Statistics, University of Wisconsin, Madison, WI d Department of Statistics, University of Wisconsin, Madison, WI, and Department of Statistics, Columbia University, New York City, NY * Corresponding and joint first author. 1 Partially supported by NSF grant DMS-0604931.

Upload: others

Post on 27-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

Nonparametric Variable Selection and Classification The

CATCH Algorithm

Shijie Tanga Lisha Chenblowast Kam-Wah Tsuic1 Kjell Doksumd1

aBoehringer Ingelheim Pharmaceuticals Danbury CTbDepartment of Statistics Yale University New Haven CT

cDepartment of Statistics University of Wisconsin Madison WIdDepartment of Statistics University of Wisconsin Madison WI and Department of Statistics

Columbia University New York City NY

lowastCorresponding and joint first author1Partially supported by NSF grant DMS-0604931

Abstract

In a nonparametric framework we consider the problem of classifying a categoricalresponse Y whose distribution depends on a vector of predictors X where the coordi-nates Xj of X may be continuous discrete or categorical To select the variables tobe used for classification we construct an algorithm which for each variable Xj com-putes an importance score sj to measure the strength of association of Xj with Y The algorithm deletes Xj if sj falls below a certain threshold It is shown in MonteCarlo simulations that the algorithm has a high probability of only selecting variablesassociated with Y Moreover when this variable selection rule is used for dimensionreduction prior to applying classification procedures it improves the performance ofthese procedures Our approach for computing importance scores is based on root Chi-square type statistics computed for randomly selected regions (tubes) of the samplespace The size and shape of the regions are adjusted iteratively and adaptively usingthe data to enhance the ability of the importance score to detect local relationshipsbetween the response and the predictors These local scores are then averaged over thetubes to form a global importance score sj for variable Xj When confounding andspurious associations are issues the nonparametric importance score for variable Xj

is computed conditionally by using tubes to restrict the other variables We call thisvariable selection procedure CATCH (Categorical Adaptive Tube Covariate Hunting)We establish asymptotic properties including consistency

Keywords Adaptive variable selection Importance score Chi-square statistic

0

1 Introduction

We consider classification problems with a large number of predictors that can benumerical or categorical We are interested in the case in which many of the predictorsmay be irrelevant for classification thus the variables useful for classification need to beselected In genomic research the classes considered are often cases and controls Thusvariable selection is the same as the important problem of deciding which variables areassociated with disease With modern technology data of large dimension arise in manyscientific disciplines including biology genomics astronomy economics and computerscience In particular the number of variables can be greater than the sample sizewhich poses a considerable challenge to the statistical analysis If only some of thevariables are useful for classification over-fitting is a problem for methods that use allvariables and therefore variable selection becomes critical for statistical analysis

Methods for variable selection in the classification context include methods that in-corporate variable selection as part of the classification procedure This class includesrandom forest (Breiman [1]) CART (Breiman et al [2]) and GUIDE (Loh [12]) Ran-dom forest assigns an importance score for each of the predictors and one can dropthose variables whose importance score fall below a certain threshold CART afterpruning will choose a subset of optimal splitting variables to be the most significantvariables GUIDE is a tree-based method particularly powerful in unbiased variableselection and interaction detection Other research includes methods that incorporatevariable selection by applying shrinkage methods with L1 norm constraints on the pa-rameters (Tibshirani [17]) that generate sparse vectors of parameter estimates Wangand Shen [18] and Zhang et al [19] incorporate variable selection with classificationbased on support vector machine (SVM) methods Qiao et al [14] consider variable se-lection based on linear discriminant analysis One limitation of these variable selectionmethods is that they are not based on nonparametric methods and therefore they maynot work well when there is a complex relationship between the prediction variablesand the variables to be classified

We propose a nonparametric method for variable selection called Categorical Adap-tive Tube Covariate Hunting (CATCH) that performs well as a variable selection algo-rithm and can be used to improve the performance of available classification proceduresThe idea is to construct a nonparametric measure of the relational strength betweeneach predictor and the categorical response and to retain those predictors whose rela-tionship to the response is above a certain threshold The nonparametric measure ofimportance for each predictor is obtained by first measuring the importance of the pre-dictor using local information and then combing such local importance scores to obtainan overall importance score The local importance scores are based on root chi-squaretype statistics for local contingency tables

In addition to the aforementioned nonparametric feature the CATCH procedurehas another property it measures the importance of each variable conditioning on allother variables thereby reducing the confounding that may lead to selection of variables

1

spuriously related to the categorical variable Y This is accomplished by constraining allpredictors but the one we are focusing on For the case where the number of predictorsis huge (d n) this can be done by restricting principal components for some typesof studies See Remark 34

Our approach to nonparametric variable selection is related to the EARTH algo-rithm (Doksum et al [3]) which applies to nonparametric regression problems with acontinuous response variable and continuous predictors It measures the conditionalassociation between a predictor i and the response variable conditional on all the otherpredictors jj 6=i by constraining the jj 6=i variables to regions called tubes The localimportance score is based on a local linear or a local polynomial regression The contri-bution of the current paper is to develop variable selection methods for the classificationproblem with a categorical response variable and predictors that can be continuous dis-crete or categorical

Our CATCH algorithm can be used as a variable selection step before classificationAny classification method preferably nonparametric can be used after we single out theimportant variables In particular SVM and random forest are statistical classificationmethods that can be used with CATCH We show in simulation studies that when thetrue model is highly nonlinear and there are many irrelevant predictors using CATCHto screen out irrelevant or weak predictors greatly improves the performance of SVMand random forest

The CATCH algorithm works with general classification data with both numericaland categorical predictors Moreover the CATCH algorithm is robust when the pre-dictors interact with each other especially when numerical and categorical predictorsinteract eg for hierarchical interactive association between the numerical and cate-gorical predictors We present a model with a reasonable sample size and predictordimension for which random forest has a relatively low chance of detecting the signifi-cance of the categorical predictor The CART and GUIDE algorithms with pruningcan find the correct splitting variables including the categorical one but yield relativelyhigh classification errors Moreover CART and GUIDE have trouble choosing the split-ting predictors in the correct order The CATCH algorithm achieves higher accuracyin the task of variable selection This is due to the importance scores being conditionalas illustrated in the simulation example in section 41

The rest of the paper will proceed as follows In section 2 we introduce importancescores for univariate predictors In sections 31 - 34 we extend these scores to multi-variate predictors and in section 35 we introduce the CATCH algorithm In section4 we use simulation studies to show the effectiveness of the CATCH algorithm A realexample is provided in section 5 In section 6 we provide some theoretical properties tojustify the definition of local contingency efficacy and in section 7 we show asymptoticconsistency of the algorithm

2

2 Importance Scores for Univariate Classification

Let (X(i) Y (i)) i = 1 middot middot middot n be independent and identically distributed (iid) as(X Y ) sim P We first consider the case of a univariate X then use the methodsconstructed for the univariate case to construct the multivariate versions

21 Numerical Predictor Local Contingency Efficacy

Consider the classification problem with a numerical predictor Let

Pr(Y = c|x) = pc(x) c = 1 middot middot middot CCsumc=1

pc(x) equiv 1 (21)

We introduce a measure of how strongly X is associated with Y in a neighbor-hood of a ldquofixedrdquo point x(0) Points x(0) will be selected at random by our algorithmLocal importance scores will be computed for each point then averaged For a fixedh gt 0 define Nh(x

(0)) = x 0 lt |x minus x(0)| le h to be a neighborhood of x(0) andlet n(x(0) h) =

sumni=1 I(X(i) isin Nh(x

(0))) denote the number of data points in the neigh-borhood For c = 1 middot middot middot C let nminusc (x(0) h) be the number of observations (X(i) Y (i))satisfying x(0)minush le X(i) lt x(0) and Y (i) = c let n+

c (x(0) h) be the number of (X(i) Y (i))satisfying x(0) lt X(i) le x(0) +h and Y (i) = c Let nc(x

(0) h) = n+c (x(0) h) +nminusc (x(0) h)

nminus(x(0) h) =sumC

c=1 nminusc (x(0) h) and n+(x(0) h) =

sumCc=1 n

+c (x(0) h) Table 1 gives the

resulting local contingency table

Table 1 Local contingency table

Y = 1 Y = 2 middot middot middot middot middot middot Y = C Total

x(0) minus h le X lt x(0) nminus1 (x(0) h) nminus2 (x(0) h) middot middot middot middot middot middot nminusC(x(0) h) nminus(x(0) h)

x(0) lt X le x(0) + h n+1 (x(0) h) n+

2 (x(0) h) middot middot middot middot middot middot n+C(x(0) h) n+(x(0) h)

Total n1(x(0) h) n2(x

(0) h) middot middot middot middot middot middot nC(x(0) h) n(x(0) h)

To measure how strongly Y relates to local restrictions on X we consider the chi-square statistic

X 2(x(0) h) =Csumc=1

((nminusc (x(0) h)minus Eminusc (x(0) h))2

Eminusc (x(0) h)+

(n+c (x(0) h)minus E+

c (x(0) h))2

E+c (x(0) h)

) (22)

3

where 00 equiv 0 and

Eminusc (x(0) h) =nminus(x(0) h)nc(x

(0) h)

n(x(0) h) E+

c (x(0) h) =n+(x(0) h)nc(x

(0) h)

n(x(0) h) (23)

Due to the local restriction on X the local contingency table might have some zerocells However this will not be an issue for the χ2 statistic The χ2 statistic can detectlocal dependence between Y and X as long as there exist observations from at leasttwo categories of Y in the neighborhood of x(0) When all observations belong to thesame category of Y intuitively no dependence can be detected In this case χ2 statisticis equal to zero which coincides with our intuition

We call the neighborhood Nh(x(0)) a section and maximize X 2(x(0) h) (22) with

respect to the section size h Let plim denote limit in probability Our local measureof association is ζ(x(0)) as given in

Definition 1 Local Contingency Efficacy (For Numerical Predictor) The local con-tingency section efficacy and the Local Contingency Efficacy (LCE) of Y on a numericalpredictor X at the point x(0) are

ζ(x(0) h) = plimnrarrinfinnminus12

radicX 2(x(0) h) ζ(x(0)) = sup

hgt0ζ(x(0) h) (24)

If pc(x) is constant in x for all c isin 1 middot middot middot C for x near x(0) then as n rarr infinX 2(x(0) h) converges in distribution to a χ2

Cminus1 variable and in this case ζ(x(0) h) = 0On the other hand if pc(x) is not constant near x(0) ζ(x(0) h) determines the asymptoticpower for Pitmanrsquos alternatives of the test based on X 2(x(0) h) for testing whether Xis locally independent of Y and is called the efficacy of this test Moreover ζ2(x(0) h)is the asymptotic noncentrality parameter of the statistic nminus1X 2(x(0) h)

The estimators of ζ(x(0) h) and ζ(x(0)) are

ζ(x(0) h) = nminus12radicX 2(x(0) h) ζ(x(0)) = sup

hgt0ζ(x(0) h) (25)

We selected the section size h as

h = argmaxζ(x(0) h) h isin h1 hg

for some grid h1 hg This corresponds to selecting h by maximizing estimatedpower See Doksum and Schafer [5] Gao and Gijbels [7] Doksum et al [3] Schafer andDoksum [16]

4

22 Categorical Predictor Contingency Efficacy

Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet

X 2 =Csumc=1

Cprimesumcprime=1

(n(c cprime)minus E(c cprime))2

E(c cprime)2 (26)

where

E(c cprime) =

sumCc=1 n(c cprime)

sumCprime

cprime=1 n(c cprime)

n (27)

Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is

ζ = plimnrarrinfin nminus12radicX 2 (28)

Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2

(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic

noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is

ζ = nminus12radicX 2 (29)

3 Variable Selection for Classification Problems The CATCH Algorithm

Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies

31 Numerical Predictors

To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a

ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)

T

to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later

Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set

Tl equiv Tl(x(0) δD) equiv x D(xminusl x

(0)minusl ) le δ (31)

5

We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression

Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of

Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let

NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)

be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in

NlhδD(x(0)) we consider

X 2lδD(x(0) h) (33)

where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)

Analogous to the definitions of local contingency efficacy (24) we define

Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are

ζ(x(0) h l) = plimnrarrinfin nminus12

radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (34)

ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (35)

For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose

logit P (Y = 2|x) =(xminus 1

2

)2 x isin [0 1] Then the local efficacy with h small will be

large while if h ge 1 the efficacy will be zero

32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct

tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables

unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x

(0) δD) by using only the observationsin the tube Tl(x

(0) δD) We denote this version of (26) by

X 2lδD(x(0)) (36)

Analogous to the definition of contingency efficacy given in (28) we define

Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare

ζl(x(0)) = plimnrarrinfin n

minus12radicX 2lδD(x(0)) ζl(x

(0)) = nminus12radicX 2lδD(x(0)) (37)

6

33 The Empirical Importance Scores

The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points

Definition 6 The CATCH empirical importance score for variable l is

sl = Mminus1Msuma=1

ζl(xlowasta) (38)

where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors

respectively

34 Adaptive Tube Distances

Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2

on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form

Dl(xminusl x(0)minusl ) =

dsumj=1

wj|xj minus x(0)j |I(j 6= l) (39)

where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained

The weights wj are proportional to sj minus sprimej which are determined iteratively and

adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the

importance score s(2)j using (38) For the second iteration we use the weights sj = s

(2)j

and the distance

Dl(xminusl x(0)minusl ) =

sumdj=1(sj minus sprimej)+|xj minus x

(0)j |I(j 6= l)sumd

j=1(sj minus sprimej)+I(j 6= l)(310)

where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor

7

with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the

predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and

use s(3)j in the next iteration and so on

In the definition above we need to define |xj minus x(0)j | appropriately for a categorical

variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant

when xj and x(0)j take any pair of different categories One approach is to define |xj minus

x(0)j | =infinI(xj 6= x

(0)j ) in which case all points in the tube are given the same Xj value

as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj

does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the

tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The

value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)

1 minus X(2)1 |) where X

(1)1 and

X(2)1 are two independent realizations of X1 Note that k0 =

radic2π if X1 follows a

standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =

radic2π in the real data example

Remark 31 (choice of k0) k0 =radic

2π is used all the time unless the sample size is

large enough in the sense to be explained below Suppose k0 = infin then Dl

(xminusl x

(0)minusl

)is finite only for those observations with xj = x

(0)j When there are multiple categorical

variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for

all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =

radic2π is used as in the real data example

On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples

35 The CATCH Algorithm

The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations

8

(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below

(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X

lowastM with

replacement from the n observations Set k = 1 do (b)

(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk

(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated

local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a

categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y

on Xl at Xlowastk as defined in (37)

(d) Thresholding Let Y lowast denote a random variable which is identically distributed

as Y but independent of X Let (Y(1)0 middot middot middot Y (n)

0 ) be a random permutation of

(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)

0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y

If k lt M set k = k + 1 go to (b) otherwise go to (e)

(e) Updating importance scores Set

snewl =1

M

Msumk=1

ζl(Xlowastk) s

primenewl =

1

M

Msumk=1

ζlowastl (Xlowastk)

Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s

primenew1 middot middot middot sprimenewd ) denote the updates of S

and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt

radic2SD(sprimestop)

If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm

Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes

9

of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming

(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y

but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)

0 ) be a random permutation of the

Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)

0 ) can be viewed as a realization of Y lowast If Xl

is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of

Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the

estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)

Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl

given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic

2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt

radic2SE(sprimestop) is similar to the one standard

deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01

Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]

Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel

10

4 Simulation Studies

41 Variable Selection with Categorical and Numerical Predictors

Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]

When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion

Y =

0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05

(41)

When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent

In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors

Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors

The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is

11

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 2: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

Abstract

In a nonparametric framework we consider the problem of classifying a categoricalresponse Y whose distribution depends on a vector of predictors X where the coordi-nates Xj of X may be continuous discrete or categorical To select the variables tobe used for classification we construct an algorithm which for each variable Xj com-putes an importance score sj to measure the strength of association of Xj with Y The algorithm deletes Xj if sj falls below a certain threshold It is shown in MonteCarlo simulations that the algorithm has a high probability of only selecting variablesassociated with Y Moreover when this variable selection rule is used for dimensionreduction prior to applying classification procedures it improves the performance ofthese procedures Our approach for computing importance scores is based on root Chi-square type statistics computed for randomly selected regions (tubes) of the samplespace The size and shape of the regions are adjusted iteratively and adaptively usingthe data to enhance the ability of the importance score to detect local relationshipsbetween the response and the predictors These local scores are then averaged over thetubes to form a global importance score sj for variable Xj When confounding andspurious associations are issues the nonparametric importance score for variable Xj

is computed conditionally by using tubes to restrict the other variables We call thisvariable selection procedure CATCH (Categorical Adaptive Tube Covariate Hunting)We establish asymptotic properties including consistency

Keywords Adaptive variable selection Importance score Chi-square statistic

0

1 Introduction

We consider classification problems with a large number of predictors that can benumerical or categorical We are interested in the case in which many of the predictorsmay be irrelevant for classification thus the variables useful for classification need to beselected In genomic research the classes considered are often cases and controls Thusvariable selection is the same as the important problem of deciding which variables areassociated with disease With modern technology data of large dimension arise in manyscientific disciplines including biology genomics astronomy economics and computerscience In particular the number of variables can be greater than the sample sizewhich poses a considerable challenge to the statistical analysis If only some of thevariables are useful for classification over-fitting is a problem for methods that use allvariables and therefore variable selection becomes critical for statistical analysis

Methods for variable selection in the classification context include methods that in-corporate variable selection as part of the classification procedure This class includesrandom forest (Breiman [1]) CART (Breiman et al [2]) and GUIDE (Loh [12]) Ran-dom forest assigns an importance score for each of the predictors and one can dropthose variables whose importance score fall below a certain threshold CART afterpruning will choose a subset of optimal splitting variables to be the most significantvariables GUIDE is a tree-based method particularly powerful in unbiased variableselection and interaction detection Other research includes methods that incorporatevariable selection by applying shrinkage methods with L1 norm constraints on the pa-rameters (Tibshirani [17]) that generate sparse vectors of parameter estimates Wangand Shen [18] and Zhang et al [19] incorporate variable selection with classificationbased on support vector machine (SVM) methods Qiao et al [14] consider variable se-lection based on linear discriminant analysis One limitation of these variable selectionmethods is that they are not based on nonparametric methods and therefore they maynot work well when there is a complex relationship between the prediction variablesand the variables to be classified

We propose a nonparametric method for variable selection called Categorical Adap-tive Tube Covariate Hunting (CATCH) that performs well as a variable selection algo-rithm and can be used to improve the performance of available classification proceduresThe idea is to construct a nonparametric measure of the relational strength betweeneach predictor and the categorical response and to retain those predictors whose rela-tionship to the response is above a certain threshold The nonparametric measure ofimportance for each predictor is obtained by first measuring the importance of the pre-dictor using local information and then combing such local importance scores to obtainan overall importance score The local importance scores are based on root chi-squaretype statistics for local contingency tables

In addition to the aforementioned nonparametric feature the CATCH procedurehas another property it measures the importance of each variable conditioning on allother variables thereby reducing the confounding that may lead to selection of variables

1

spuriously related to the categorical variable Y This is accomplished by constraining allpredictors but the one we are focusing on For the case where the number of predictorsis huge (d n) this can be done by restricting principal components for some typesof studies See Remark 34

Our approach to nonparametric variable selection is related to the EARTH algo-rithm (Doksum et al [3]) which applies to nonparametric regression problems with acontinuous response variable and continuous predictors It measures the conditionalassociation between a predictor i and the response variable conditional on all the otherpredictors jj 6=i by constraining the jj 6=i variables to regions called tubes The localimportance score is based on a local linear or a local polynomial regression The contri-bution of the current paper is to develop variable selection methods for the classificationproblem with a categorical response variable and predictors that can be continuous dis-crete or categorical

Our CATCH algorithm can be used as a variable selection step before classificationAny classification method preferably nonparametric can be used after we single out theimportant variables In particular SVM and random forest are statistical classificationmethods that can be used with CATCH We show in simulation studies that when thetrue model is highly nonlinear and there are many irrelevant predictors using CATCHto screen out irrelevant or weak predictors greatly improves the performance of SVMand random forest

The CATCH algorithm works with general classification data with both numericaland categorical predictors Moreover the CATCH algorithm is robust when the pre-dictors interact with each other especially when numerical and categorical predictorsinteract eg for hierarchical interactive association between the numerical and cate-gorical predictors We present a model with a reasonable sample size and predictordimension for which random forest has a relatively low chance of detecting the signifi-cance of the categorical predictor The CART and GUIDE algorithms with pruningcan find the correct splitting variables including the categorical one but yield relativelyhigh classification errors Moreover CART and GUIDE have trouble choosing the split-ting predictors in the correct order The CATCH algorithm achieves higher accuracyin the task of variable selection This is due to the importance scores being conditionalas illustrated in the simulation example in section 41

The rest of the paper will proceed as follows In section 2 we introduce importancescores for univariate predictors In sections 31 - 34 we extend these scores to multi-variate predictors and in section 35 we introduce the CATCH algorithm In section4 we use simulation studies to show the effectiveness of the CATCH algorithm A realexample is provided in section 5 In section 6 we provide some theoretical properties tojustify the definition of local contingency efficacy and in section 7 we show asymptoticconsistency of the algorithm

2

2 Importance Scores for Univariate Classification

Let (X(i) Y (i)) i = 1 middot middot middot n be independent and identically distributed (iid) as(X Y ) sim P We first consider the case of a univariate X then use the methodsconstructed for the univariate case to construct the multivariate versions

21 Numerical Predictor Local Contingency Efficacy

Consider the classification problem with a numerical predictor Let

Pr(Y = c|x) = pc(x) c = 1 middot middot middot CCsumc=1

pc(x) equiv 1 (21)

We introduce a measure of how strongly X is associated with Y in a neighbor-hood of a ldquofixedrdquo point x(0) Points x(0) will be selected at random by our algorithmLocal importance scores will be computed for each point then averaged For a fixedh gt 0 define Nh(x

(0)) = x 0 lt |x minus x(0)| le h to be a neighborhood of x(0) andlet n(x(0) h) =

sumni=1 I(X(i) isin Nh(x

(0))) denote the number of data points in the neigh-borhood For c = 1 middot middot middot C let nminusc (x(0) h) be the number of observations (X(i) Y (i))satisfying x(0)minush le X(i) lt x(0) and Y (i) = c let n+

c (x(0) h) be the number of (X(i) Y (i))satisfying x(0) lt X(i) le x(0) +h and Y (i) = c Let nc(x

(0) h) = n+c (x(0) h) +nminusc (x(0) h)

nminus(x(0) h) =sumC

c=1 nminusc (x(0) h) and n+(x(0) h) =

sumCc=1 n

+c (x(0) h) Table 1 gives the

resulting local contingency table

Table 1 Local contingency table

Y = 1 Y = 2 middot middot middot middot middot middot Y = C Total

x(0) minus h le X lt x(0) nminus1 (x(0) h) nminus2 (x(0) h) middot middot middot middot middot middot nminusC(x(0) h) nminus(x(0) h)

x(0) lt X le x(0) + h n+1 (x(0) h) n+

2 (x(0) h) middot middot middot middot middot middot n+C(x(0) h) n+(x(0) h)

Total n1(x(0) h) n2(x

(0) h) middot middot middot middot middot middot nC(x(0) h) n(x(0) h)

To measure how strongly Y relates to local restrictions on X we consider the chi-square statistic

X 2(x(0) h) =Csumc=1

((nminusc (x(0) h)minus Eminusc (x(0) h))2

Eminusc (x(0) h)+

(n+c (x(0) h)minus E+

c (x(0) h))2

E+c (x(0) h)

) (22)

3

where 00 equiv 0 and

Eminusc (x(0) h) =nminus(x(0) h)nc(x

(0) h)

n(x(0) h) E+

c (x(0) h) =n+(x(0) h)nc(x

(0) h)

n(x(0) h) (23)

Due to the local restriction on X the local contingency table might have some zerocells However this will not be an issue for the χ2 statistic The χ2 statistic can detectlocal dependence between Y and X as long as there exist observations from at leasttwo categories of Y in the neighborhood of x(0) When all observations belong to thesame category of Y intuitively no dependence can be detected In this case χ2 statisticis equal to zero which coincides with our intuition

We call the neighborhood Nh(x(0)) a section and maximize X 2(x(0) h) (22) with

respect to the section size h Let plim denote limit in probability Our local measureof association is ζ(x(0)) as given in

Definition 1 Local Contingency Efficacy (For Numerical Predictor) The local con-tingency section efficacy and the Local Contingency Efficacy (LCE) of Y on a numericalpredictor X at the point x(0) are

ζ(x(0) h) = plimnrarrinfinnminus12

radicX 2(x(0) h) ζ(x(0)) = sup

hgt0ζ(x(0) h) (24)

If pc(x) is constant in x for all c isin 1 middot middot middot C for x near x(0) then as n rarr infinX 2(x(0) h) converges in distribution to a χ2

Cminus1 variable and in this case ζ(x(0) h) = 0On the other hand if pc(x) is not constant near x(0) ζ(x(0) h) determines the asymptoticpower for Pitmanrsquos alternatives of the test based on X 2(x(0) h) for testing whether Xis locally independent of Y and is called the efficacy of this test Moreover ζ2(x(0) h)is the asymptotic noncentrality parameter of the statistic nminus1X 2(x(0) h)

The estimators of ζ(x(0) h) and ζ(x(0)) are

ζ(x(0) h) = nminus12radicX 2(x(0) h) ζ(x(0)) = sup

hgt0ζ(x(0) h) (25)

We selected the section size h as

h = argmaxζ(x(0) h) h isin h1 hg

for some grid h1 hg This corresponds to selecting h by maximizing estimatedpower See Doksum and Schafer [5] Gao and Gijbels [7] Doksum et al [3] Schafer andDoksum [16]

4

22 Categorical Predictor Contingency Efficacy

Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet

X 2 =Csumc=1

Cprimesumcprime=1

(n(c cprime)minus E(c cprime))2

E(c cprime)2 (26)

where

E(c cprime) =

sumCc=1 n(c cprime)

sumCprime

cprime=1 n(c cprime)

n (27)

Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is

ζ = plimnrarrinfin nminus12radicX 2 (28)

Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2

(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic

noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is

ζ = nminus12radicX 2 (29)

3 Variable Selection for Classification Problems The CATCH Algorithm

Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies

31 Numerical Predictors

To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a

ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)

T

to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later

Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set

Tl equiv Tl(x(0) δD) equiv x D(xminusl x

(0)minusl ) le δ (31)

5

We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression

Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of

Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let

NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)

be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in

NlhδD(x(0)) we consider

X 2lδD(x(0) h) (33)

where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)

Analogous to the definitions of local contingency efficacy (24) we define

Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are

ζ(x(0) h l) = plimnrarrinfin nminus12

radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (34)

ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (35)

For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose

logit P (Y = 2|x) =(xminus 1

2

)2 x isin [0 1] Then the local efficacy with h small will be

large while if h ge 1 the efficacy will be zero

32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct

tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables

unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x

(0) δD) by using only the observationsin the tube Tl(x

(0) δD) We denote this version of (26) by

X 2lδD(x(0)) (36)

Analogous to the definition of contingency efficacy given in (28) we define

Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare

ζl(x(0)) = plimnrarrinfin n

minus12radicX 2lδD(x(0)) ζl(x

(0)) = nminus12radicX 2lδD(x(0)) (37)

6

33 The Empirical Importance Scores

The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points

Definition 6 The CATCH empirical importance score for variable l is

sl = Mminus1Msuma=1

ζl(xlowasta) (38)

where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors

respectively

34 Adaptive Tube Distances

Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2

on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form

Dl(xminusl x(0)minusl ) =

dsumj=1

wj|xj minus x(0)j |I(j 6= l) (39)

where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained

The weights wj are proportional to sj minus sprimej which are determined iteratively and

adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the

importance score s(2)j using (38) For the second iteration we use the weights sj = s

(2)j

and the distance

Dl(xminusl x(0)minusl ) =

sumdj=1(sj minus sprimej)+|xj minus x

(0)j |I(j 6= l)sumd

j=1(sj minus sprimej)+I(j 6= l)(310)

where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor

7

with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the

predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and

use s(3)j in the next iteration and so on

In the definition above we need to define |xj minus x(0)j | appropriately for a categorical

variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant

when xj and x(0)j take any pair of different categories One approach is to define |xj minus

x(0)j | =infinI(xj 6= x

(0)j ) in which case all points in the tube are given the same Xj value

as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj

does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the

tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The

value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)

1 minus X(2)1 |) where X

(1)1 and

X(2)1 are two independent realizations of X1 Note that k0 =

radic2π if X1 follows a

standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =

radic2π in the real data example

Remark 31 (choice of k0) k0 =radic

2π is used all the time unless the sample size is

large enough in the sense to be explained below Suppose k0 = infin then Dl

(xminusl x

(0)minusl

)is finite only for those observations with xj = x

(0)j When there are multiple categorical

variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for

all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =

radic2π is used as in the real data example

On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples

35 The CATCH Algorithm

The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations

8

(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below

(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X

lowastM with

replacement from the n observations Set k = 1 do (b)

(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk

(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated

local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a

categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y

on Xl at Xlowastk as defined in (37)

(d) Thresholding Let Y lowast denote a random variable which is identically distributed

as Y but independent of X Let (Y(1)0 middot middot middot Y (n)

0 ) be a random permutation of

(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)

0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y

If k lt M set k = k + 1 go to (b) otherwise go to (e)

(e) Updating importance scores Set

snewl =1

M

Msumk=1

ζl(Xlowastk) s

primenewl =

1

M

Msumk=1

ζlowastl (Xlowastk)

Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s

primenew1 middot middot middot sprimenewd ) denote the updates of S

and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt

radic2SD(sprimestop)

If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm

Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes

9

of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming

(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y

but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)

0 ) be a random permutation of the

Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)

0 ) can be viewed as a realization of Y lowast If Xl

is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of

Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the

estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)

Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl

given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic

2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt

radic2SE(sprimestop) is similar to the one standard

deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01

Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]

Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel

10

4 Simulation Studies

41 Variable Selection with Categorical and Numerical Predictors

Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]

When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion

Y =

0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05

(41)

When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent

In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors

Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors

The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is

11

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 3: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

1 Introduction

We consider classification problems with a large number of predictors that can benumerical or categorical We are interested in the case in which many of the predictorsmay be irrelevant for classification thus the variables useful for classification need to beselected In genomic research the classes considered are often cases and controls Thusvariable selection is the same as the important problem of deciding which variables areassociated with disease With modern technology data of large dimension arise in manyscientific disciplines including biology genomics astronomy economics and computerscience In particular the number of variables can be greater than the sample sizewhich poses a considerable challenge to the statistical analysis If only some of thevariables are useful for classification over-fitting is a problem for methods that use allvariables and therefore variable selection becomes critical for statistical analysis

Methods for variable selection in the classification context include methods that in-corporate variable selection as part of the classification procedure This class includesrandom forest (Breiman [1]) CART (Breiman et al [2]) and GUIDE (Loh [12]) Ran-dom forest assigns an importance score for each of the predictors and one can dropthose variables whose importance score fall below a certain threshold CART afterpruning will choose a subset of optimal splitting variables to be the most significantvariables GUIDE is a tree-based method particularly powerful in unbiased variableselection and interaction detection Other research includes methods that incorporatevariable selection by applying shrinkage methods with L1 norm constraints on the pa-rameters (Tibshirani [17]) that generate sparse vectors of parameter estimates Wangand Shen [18] and Zhang et al [19] incorporate variable selection with classificationbased on support vector machine (SVM) methods Qiao et al [14] consider variable se-lection based on linear discriminant analysis One limitation of these variable selectionmethods is that they are not based on nonparametric methods and therefore they maynot work well when there is a complex relationship between the prediction variablesand the variables to be classified

We propose a nonparametric method for variable selection called Categorical Adap-tive Tube Covariate Hunting (CATCH) that performs well as a variable selection algo-rithm and can be used to improve the performance of available classification proceduresThe idea is to construct a nonparametric measure of the relational strength betweeneach predictor and the categorical response and to retain those predictors whose rela-tionship to the response is above a certain threshold The nonparametric measure ofimportance for each predictor is obtained by first measuring the importance of the pre-dictor using local information and then combing such local importance scores to obtainan overall importance score The local importance scores are based on root chi-squaretype statistics for local contingency tables

In addition to the aforementioned nonparametric feature the CATCH procedurehas another property it measures the importance of each variable conditioning on allother variables thereby reducing the confounding that may lead to selection of variables

1

spuriously related to the categorical variable Y This is accomplished by constraining allpredictors but the one we are focusing on For the case where the number of predictorsis huge (d n) this can be done by restricting principal components for some typesof studies See Remark 34

Our approach to nonparametric variable selection is related to the EARTH algo-rithm (Doksum et al [3]) which applies to nonparametric regression problems with acontinuous response variable and continuous predictors It measures the conditionalassociation between a predictor i and the response variable conditional on all the otherpredictors jj 6=i by constraining the jj 6=i variables to regions called tubes The localimportance score is based on a local linear or a local polynomial regression The contri-bution of the current paper is to develop variable selection methods for the classificationproblem with a categorical response variable and predictors that can be continuous dis-crete or categorical

Our CATCH algorithm can be used as a variable selection step before classificationAny classification method preferably nonparametric can be used after we single out theimportant variables In particular SVM and random forest are statistical classificationmethods that can be used with CATCH We show in simulation studies that when thetrue model is highly nonlinear and there are many irrelevant predictors using CATCHto screen out irrelevant or weak predictors greatly improves the performance of SVMand random forest

The CATCH algorithm works with general classification data with both numericaland categorical predictors Moreover the CATCH algorithm is robust when the pre-dictors interact with each other especially when numerical and categorical predictorsinteract eg for hierarchical interactive association between the numerical and cate-gorical predictors We present a model with a reasonable sample size and predictordimension for which random forest has a relatively low chance of detecting the signifi-cance of the categorical predictor The CART and GUIDE algorithms with pruningcan find the correct splitting variables including the categorical one but yield relativelyhigh classification errors Moreover CART and GUIDE have trouble choosing the split-ting predictors in the correct order The CATCH algorithm achieves higher accuracyin the task of variable selection This is due to the importance scores being conditionalas illustrated in the simulation example in section 41

The rest of the paper will proceed as follows In section 2 we introduce importancescores for univariate predictors In sections 31 - 34 we extend these scores to multi-variate predictors and in section 35 we introduce the CATCH algorithm In section4 we use simulation studies to show the effectiveness of the CATCH algorithm A realexample is provided in section 5 In section 6 we provide some theoretical properties tojustify the definition of local contingency efficacy and in section 7 we show asymptoticconsistency of the algorithm

2

2 Importance Scores for Univariate Classification

Let (X(i) Y (i)) i = 1 middot middot middot n be independent and identically distributed (iid) as(X Y ) sim P We first consider the case of a univariate X then use the methodsconstructed for the univariate case to construct the multivariate versions

21 Numerical Predictor Local Contingency Efficacy

Consider the classification problem with a numerical predictor Let

Pr(Y = c|x) = pc(x) c = 1 middot middot middot CCsumc=1

pc(x) equiv 1 (21)

We introduce a measure of how strongly X is associated with Y in a neighbor-hood of a ldquofixedrdquo point x(0) Points x(0) will be selected at random by our algorithmLocal importance scores will be computed for each point then averaged For a fixedh gt 0 define Nh(x

(0)) = x 0 lt |x minus x(0)| le h to be a neighborhood of x(0) andlet n(x(0) h) =

sumni=1 I(X(i) isin Nh(x

(0))) denote the number of data points in the neigh-borhood For c = 1 middot middot middot C let nminusc (x(0) h) be the number of observations (X(i) Y (i))satisfying x(0)minush le X(i) lt x(0) and Y (i) = c let n+

c (x(0) h) be the number of (X(i) Y (i))satisfying x(0) lt X(i) le x(0) +h and Y (i) = c Let nc(x

(0) h) = n+c (x(0) h) +nminusc (x(0) h)

nminus(x(0) h) =sumC

c=1 nminusc (x(0) h) and n+(x(0) h) =

sumCc=1 n

+c (x(0) h) Table 1 gives the

resulting local contingency table

Table 1 Local contingency table

Y = 1 Y = 2 middot middot middot middot middot middot Y = C Total

x(0) minus h le X lt x(0) nminus1 (x(0) h) nminus2 (x(0) h) middot middot middot middot middot middot nminusC(x(0) h) nminus(x(0) h)

x(0) lt X le x(0) + h n+1 (x(0) h) n+

2 (x(0) h) middot middot middot middot middot middot n+C(x(0) h) n+(x(0) h)

Total n1(x(0) h) n2(x

(0) h) middot middot middot middot middot middot nC(x(0) h) n(x(0) h)

To measure how strongly Y relates to local restrictions on X we consider the chi-square statistic

X 2(x(0) h) =Csumc=1

((nminusc (x(0) h)minus Eminusc (x(0) h))2

Eminusc (x(0) h)+

(n+c (x(0) h)minus E+

c (x(0) h))2

E+c (x(0) h)

) (22)

3

where 00 equiv 0 and

Eminusc (x(0) h) =nminus(x(0) h)nc(x

(0) h)

n(x(0) h) E+

c (x(0) h) =n+(x(0) h)nc(x

(0) h)

n(x(0) h) (23)

Due to the local restriction on X the local contingency table might have some zerocells However this will not be an issue for the χ2 statistic The χ2 statistic can detectlocal dependence between Y and X as long as there exist observations from at leasttwo categories of Y in the neighborhood of x(0) When all observations belong to thesame category of Y intuitively no dependence can be detected In this case χ2 statisticis equal to zero which coincides with our intuition

We call the neighborhood Nh(x(0)) a section and maximize X 2(x(0) h) (22) with

respect to the section size h Let plim denote limit in probability Our local measureof association is ζ(x(0)) as given in

Definition 1 Local Contingency Efficacy (For Numerical Predictor) The local con-tingency section efficacy and the Local Contingency Efficacy (LCE) of Y on a numericalpredictor X at the point x(0) are

ζ(x(0) h) = plimnrarrinfinnminus12

radicX 2(x(0) h) ζ(x(0)) = sup

hgt0ζ(x(0) h) (24)

If pc(x) is constant in x for all c isin 1 middot middot middot C for x near x(0) then as n rarr infinX 2(x(0) h) converges in distribution to a χ2

Cminus1 variable and in this case ζ(x(0) h) = 0On the other hand if pc(x) is not constant near x(0) ζ(x(0) h) determines the asymptoticpower for Pitmanrsquos alternatives of the test based on X 2(x(0) h) for testing whether Xis locally independent of Y and is called the efficacy of this test Moreover ζ2(x(0) h)is the asymptotic noncentrality parameter of the statistic nminus1X 2(x(0) h)

The estimators of ζ(x(0) h) and ζ(x(0)) are

ζ(x(0) h) = nminus12radicX 2(x(0) h) ζ(x(0)) = sup

hgt0ζ(x(0) h) (25)

We selected the section size h as

h = argmaxζ(x(0) h) h isin h1 hg

for some grid h1 hg This corresponds to selecting h by maximizing estimatedpower See Doksum and Schafer [5] Gao and Gijbels [7] Doksum et al [3] Schafer andDoksum [16]

4

22 Categorical Predictor Contingency Efficacy

Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet

X 2 =Csumc=1

Cprimesumcprime=1

(n(c cprime)minus E(c cprime))2

E(c cprime)2 (26)

where

E(c cprime) =

sumCc=1 n(c cprime)

sumCprime

cprime=1 n(c cprime)

n (27)

Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is

ζ = plimnrarrinfin nminus12radicX 2 (28)

Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2

(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic

noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is

ζ = nminus12radicX 2 (29)

3 Variable Selection for Classification Problems The CATCH Algorithm

Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies

31 Numerical Predictors

To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a

ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)

T

to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later

Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set

Tl equiv Tl(x(0) δD) equiv x D(xminusl x

(0)minusl ) le δ (31)

5

We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression

Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of

Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let

NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)

be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in

NlhδD(x(0)) we consider

X 2lδD(x(0) h) (33)

where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)

Analogous to the definitions of local contingency efficacy (24) we define

Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are

ζ(x(0) h l) = plimnrarrinfin nminus12

radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (34)

ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (35)

For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose

logit P (Y = 2|x) =(xminus 1

2

)2 x isin [0 1] Then the local efficacy with h small will be

large while if h ge 1 the efficacy will be zero

32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct

tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables

unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x

(0) δD) by using only the observationsin the tube Tl(x

(0) δD) We denote this version of (26) by

X 2lδD(x(0)) (36)

Analogous to the definition of contingency efficacy given in (28) we define

Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare

ζl(x(0)) = plimnrarrinfin n

minus12radicX 2lδD(x(0)) ζl(x

(0)) = nminus12radicX 2lδD(x(0)) (37)

6

33 The Empirical Importance Scores

The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points

Definition 6 The CATCH empirical importance score for variable l is

sl = Mminus1Msuma=1

ζl(xlowasta) (38)

where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors

respectively

34 Adaptive Tube Distances

Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2

on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form

Dl(xminusl x(0)minusl ) =

dsumj=1

wj|xj minus x(0)j |I(j 6= l) (39)

where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained

The weights wj are proportional to sj minus sprimej which are determined iteratively and

adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the

importance score s(2)j using (38) For the second iteration we use the weights sj = s

(2)j

and the distance

Dl(xminusl x(0)minusl ) =

sumdj=1(sj minus sprimej)+|xj minus x

(0)j |I(j 6= l)sumd

j=1(sj minus sprimej)+I(j 6= l)(310)

where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor

7

with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the

predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and

use s(3)j in the next iteration and so on

In the definition above we need to define |xj minus x(0)j | appropriately for a categorical

variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant

when xj and x(0)j take any pair of different categories One approach is to define |xj minus

x(0)j | =infinI(xj 6= x

(0)j ) in which case all points in the tube are given the same Xj value

as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj

does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the

tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The

value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)

1 minus X(2)1 |) where X

(1)1 and

X(2)1 are two independent realizations of X1 Note that k0 =

radic2π if X1 follows a

standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =

radic2π in the real data example

Remark 31 (choice of k0) k0 =radic

2π is used all the time unless the sample size is

large enough in the sense to be explained below Suppose k0 = infin then Dl

(xminusl x

(0)minusl

)is finite only for those observations with xj = x

(0)j When there are multiple categorical

variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for

all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =

radic2π is used as in the real data example

On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples

35 The CATCH Algorithm

The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations

8

(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below

(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X

lowastM with

replacement from the n observations Set k = 1 do (b)

(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk

(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated

local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a

categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y

on Xl at Xlowastk as defined in (37)

(d) Thresholding Let Y lowast denote a random variable which is identically distributed

as Y but independent of X Let (Y(1)0 middot middot middot Y (n)

0 ) be a random permutation of

(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)

0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y

If k lt M set k = k + 1 go to (b) otherwise go to (e)

(e) Updating importance scores Set

snewl =1

M

Msumk=1

ζl(Xlowastk) s

primenewl =

1

M

Msumk=1

ζlowastl (Xlowastk)

Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s

primenew1 middot middot middot sprimenewd ) denote the updates of S

and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt

radic2SD(sprimestop)

If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm

Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes

9

of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming

(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y

but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)

0 ) be a random permutation of the

Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)

0 ) can be viewed as a realization of Y lowast If Xl

is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of

Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the

estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)

Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl

given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic

2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt

radic2SE(sprimestop) is similar to the one standard

deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01

Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]

Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel

10

4 Simulation Studies

41 Variable Selection with Categorical and Numerical Predictors

Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]

When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion

Y =

0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05

(41)

When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent

In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors

Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors

The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is

11

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 4: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

spuriously related to the categorical variable Y This is accomplished by constraining allpredictors but the one we are focusing on For the case where the number of predictorsis huge (d n) this can be done by restricting principal components for some typesof studies See Remark 34

Our approach to nonparametric variable selection is related to the EARTH algo-rithm (Doksum et al [3]) which applies to nonparametric regression problems with acontinuous response variable and continuous predictors It measures the conditionalassociation between a predictor i and the response variable conditional on all the otherpredictors jj 6=i by constraining the jj 6=i variables to regions called tubes The localimportance score is based on a local linear or a local polynomial regression The contri-bution of the current paper is to develop variable selection methods for the classificationproblem with a categorical response variable and predictors that can be continuous dis-crete or categorical

Our CATCH algorithm can be used as a variable selection step before classificationAny classification method preferably nonparametric can be used after we single out theimportant variables In particular SVM and random forest are statistical classificationmethods that can be used with CATCH We show in simulation studies that when thetrue model is highly nonlinear and there are many irrelevant predictors using CATCHto screen out irrelevant or weak predictors greatly improves the performance of SVMand random forest

The CATCH algorithm works with general classification data with both numericaland categorical predictors Moreover the CATCH algorithm is robust when the pre-dictors interact with each other especially when numerical and categorical predictorsinteract eg for hierarchical interactive association between the numerical and cate-gorical predictors We present a model with a reasonable sample size and predictordimension for which random forest has a relatively low chance of detecting the signifi-cance of the categorical predictor The CART and GUIDE algorithms with pruningcan find the correct splitting variables including the categorical one but yield relativelyhigh classification errors Moreover CART and GUIDE have trouble choosing the split-ting predictors in the correct order The CATCH algorithm achieves higher accuracyin the task of variable selection This is due to the importance scores being conditionalas illustrated in the simulation example in section 41

The rest of the paper will proceed as follows In section 2 we introduce importancescores for univariate predictors In sections 31 - 34 we extend these scores to multi-variate predictors and in section 35 we introduce the CATCH algorithm In section4 we use simulation studies to show the effectiveness of the CATCH algorithm A realexample is provided in section 5 In section 6 we provide some theoretical properties tojustify the definition of local contingency efficacy and in section 7 we show asymptoticconsistency of the algorithm

2

2 Importance Scores for Univariate Classification

Let (X(i) Y (i)) i = 1 middot middot middot n be independent and identically distributed (iid) as(X Y ) sim P We first consider the case of a univariate X then use the methodsconstructed for the univariate case to construct the multivariate versions

21 Numerical Predictor Local Contingency Efficacy

Consider the classification problem with a numerical predictor Let

Pr(Y = c|x) = pc(x) c = 1 middot middot middot CCsumc=1

pc(x) equiv 1 (21)

We introduce a measure of how strongly X is associated with Y in a neighbor-hood of a ldquofixedrdquo point x(0) Points x(0) will be selected at random by our algorithmLocal importance scores will be computed for each point then averaged For a fixedh gt 0 define Nh(x

(0)) = x 0 lt |x minus x(0)| le h to be a neighborhood of x(0) andlet n(x(0) h) =

sumni=1 I(X(i) isin Nh(x

(0))) denote the number of data points in the neigh-borhood For c = 1 middot middot middot C let nminusc (x(0) h) be the number of observations (X(i) Y (i))satisfying x(0)minush le X(i) lt x(0) and Y (i) = c let n+

c (x(0) h) be the number of (X(i) Y (i))satisfying x(0) lt X(i) le x(0) +h and Y (i) = c Let nc(x

(0) h) = n+c (x(0) h) +nminusc (x(0) h)

nminus(x(0) h) =sumC

c=1 nminusc (x(0) h) and n+(x(0) h) =

sumCc=1 n

+c (x(0) h) Table 1 gives the

resulting local contingency table

Table 1 Local contingency table

Y = 1 Y = 2 middot middot middot middot middot middot Y = C Total

x(0) minus h le X lt x(0) nminus1 (x(0) h) nminus2 (x(0) h) middot middot middot middot middot middot nminusC(x(0) h) nminus(x(0) h)

x(0) lt X le x(0) + h n+1 (x(0) h) n+

2 (x(0) h) middot middot middot middot middot middot n+C(x(0) h) n+(x(0) h)

Total n1(x(0) h) n2(x

(0) h) middot middot middot middot middot middot nC(x(0) h) n(x(0) h)

To measure how strongly Y relates to local restrictions on X we consider the chi-square statistic

X 2(x(0) h) =Csumc=1

((nminusc (x(0) h)minus Eminusc (x(0) h))2

Eminusc (x(0) h)+

(n+c (x(0) h)minus E+

c (x(0) h))2

E+c (x(0) h)

) (22)

3

where 00 equiv 0 and

Eminusc (x(0) h) =nminus(x(0) h)nc(x

(0) h)

n(x(0) h) E+

c (x(0) h) =n+(x(0) h)nc(x

(0) h)

n(x(0) h) (23)

Due to the local restriction on X the local contingency table might have some zerocells However this will not be an issue for the χ2 statistic The χ2 statistic can detectlocal dependence between Y and X as long as there exist observations from at leasttwo categories of Y in the neighborhood of x(0) When all observations belong to thesame category of Y intuitively no dependence can be detected In this case χ2 statisticis equal to zero which coincides with our intuition

We call the neighborhood Nh(x(0)) a section and maximize X 2(x(0) h) (22) with

respect to the section size h Let plim denote limit in probability Our local measureof association is ζ(x(0)) as given in

Definition 1 Local Contingency Efficacy (For Numerical Predictor) The local con-tingency section efficacy and the Local Contingency Efficacy (LCE) of Y on a numericalpredictor X at the point x(0) are

ζ(x(0) h) = plimnrarrinfinnminus12

radicX 2(x(0) h) ζ(x(0)) = sup

hgt0ζ(x(0) h) (24)

If pc(x) is constant in x for all c isin 1 middot middot middot C for x near x(0) then as n rarr infinX 2(x(0) h) converges in distribution to a χ2

Cminus1 variable and in this case ζ(x(0) h) = 0On the other hand if pc(x) is not constant near x(0) ζ(x(0) h) determines the asymptoticpower for Pitmanrsquos alternatives of the test based on X 2(x(0) h) for testing whether Xis locally independent of Y and is called the efficacy of this test Moreover ζ2(x(0) h)is the asymptotic noncentrality parameter of the statistic nminus1X 2(x(0) h)

The estimators of ζ(x(0) h) and ζ(x(0)) are

ζ(x(0) h) = nminus12radicX 2(x(0) h) ζ(x(0)) = sup

hgt0ζ(x(0) h) (25)

We selected the section size h as

h = argmaxζ(x(0) h) h isin h1 hg

for some grid h1 hg This corresponds to selecting h by maximizing estimatedpower See Doksum and Schafer [5] Gao and Gijbels [7] Doksum et al [3] Schafer andDoksum [16]

4

22 Categorical Predictor Contingency Efficacy

Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet

X 2 =Csumc=1

Cprimesumcprime=1

(n(c cprime)minus E(c cprime))2

E(c cprime)2 (26)

where

E(c cprime) =

sumCc=1 n(c cprime)

sumCprime

cprime=1 n(c cprime)

n (27)

Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is

ζ = plimnrarrinfin nminus12radicX 2 (28)

Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2

(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic

noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is

ζ = nminus12radicX 2 (29)

3 Variable Selection for Classification Problems The CATCH Algorithm

Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies

31 Numerical Predictors

To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a

ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)

T

to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later

Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set

Tl equiv Tl(x(0) δD) equiv x D(xminusl x

(0)minusl ) le δ (31)

5

We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression

Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of

Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let

NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)

be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in

NlhδD(x(0)) we consider

X 2lδD(x(0) h) (33)

where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)

Analogous to the definitions of local contingency efficacy (24) we define

Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are

ζ(x(0) h l) = plimnrarrinfin nminus12

radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (34)

ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (35)

For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose

logit P (Y = 2|x) =(xminus 1

2

)2 x isin [0 1] Then the local efficacy with h small will be

large while if h ge 1 the efficacy will be zero

32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct

tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables

unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x

(0) δD) by using only the observationsin the tube Tl(x

(0) δD) We denote this version of (26) by

X 2lδD(x(0)) (36)

Analogous to the definition of contingency efficacy given in (28) we define

Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare

ζl(x(0)) = plimnrarrinfin n

minus12radicX 2lδD(x(0)) ζl(x

(0)) = nminus12radicX 2lδD(x(0)) (37)

6

33 The Empirical Importance Scores

The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points

Definition 6 The CATCH empirical importance score for variable l is

sl = Mminus1Msuma=1

ζl(xlowasta) (38)

where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors

respectively

34 Adaptive Tube Distances

Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2

on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form

Dl(xminusl x(0)minusl ) =

dsumj=1

wj|xj minus x(0)j |I(j 6= l) (39)

where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained

The weights wj are proportional to sj minus sprimej which are determined iteratively and

adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the

importance score s(2)j using (38) For the second iteration we use the weights sj = s

(2)j

and the distance

Dl(xminusl x(0)minusl ) =

sumdj=1(sj minus sprimej)+|xj minus x

(0)j |I(j 6= l)sumd

j=1(sj minus sprimej)+I(j 6= l)(310)

where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor

7

with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the

predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and

use s(3)j in the next iteration and so on

In the definition above we need to define |xj minus x(0)j | appropriately for a categorical

variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant

when xj and x(0)j take any pair of different categories One approach is to define |xj minus

x(0)j | =infinI(xj 6= x

(0)j ) in which case all points in the tube are given the same Xj value

as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj

does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the

tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The

value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)

1 minus X(2)1 |) where X

(1)1 and

X(2)1 are two independent realizations of X1 Note that k0 =

radic2π if X1 follows a

standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =

radic2π in the real data example

Remark 31 (choice of k0) k0 =radic

2π is used all the time unless the sample size is

large enough in the sense to be explained below Suppose k0 = infin then Dl

(xminusl x

(0)minusl

)is finite only for those observations with xj = x

(0)j When there are multiple categorical

variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for

all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =

radic2π is used as in the real data example

On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples

35 The CATCH Algorithm

The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations

8

(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below

(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X

lowastM with

replacement from the n observations Set k = 1 do (b)

(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk

(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated

local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a

categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y

on Xl at Xlowastk as defined in (37)

(d) Thresholding Let Y lowast denote a random variable which is identically distributed

as Y but independent of X Let (Y(1)0 middot middot middot Y (n)

0 ) be a random permutation of

(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)

0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y

If k lt M set k = k + 1 go to (b) otherwise go to (e)

(e) Updating importance scores Set

snewl =1

M

Msumk=1

ζl(Xlowastk) s

primenewl =

1

M

Msumk=1

ζlowastl (Xlowastk)

Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s

primenew1 middot middot middot sprimenewd ) denote the updates of S

and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt

radic2SD(sprimestop)

If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm

Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes

9

of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming

(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y

but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)

0 ) be a random permutation of the

Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)

0 ) can be viewed as a realization of Y lowast If Xl

is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of

Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the

estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)

Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl

given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic

2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt

radic2SE(sprimestop) is similar to the one standard

deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01

Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]

Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel

10

4 Simulation Studies

41 Variable Selection with Categorical and Numerical Predictors

Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]

When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion

Y =

0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05

(41)

When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent

In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors

Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors

The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is

11

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 5: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

2 Importance Scores for Univariate Classification

Let (X(i) Y (i)) i = 1 middot middot middot n be independent and identically distributed (iid) as(X Y ) sim P We first consider the case of a univariate X then use the methodsconstructed for the univariate case to construct the multivariate versions

21 Numerical Predictor Local Contingency Efficacy

Consider the classification problem with a numerical predictor Let

Pr(Y = c|x) = pc(x) c = 1 middot middot middot CCsumc=1

pc(x) equiv 1 (21)

We introduce a measure of how strongly X is associated with Y in a neighbor-hood of a ldquofixedrdquo point x(0) Points x(0) will be selected at random by our algorithmLocal importance scores will be computed for each point then averaged For a fixedh gt 0 define Nh(x

(0)) = x 0 lt |x minus x(0)| le h to be a neighborhood of x(0) andlet n(x(0) h) =

sumni=1 I(X(i) isin Nh(x

(0))) denote the number of data points in the neigh-borhood For c = 1 middot middot middot C let nminusc (x(0) h) be the number of observations (X(i) Y (i))satisfying x(0)minush le X(i) lt x(0) and Y (i) = c let n+

c (x(0) h) be the number of (X(i) Y (i))satisfying x(0) lt X(i) le x(0) +h and Y (i) = c Let nc(x

(0) h) = n+c (x(0) h) +nminusc (x(0) h)

nminus(x(0) h) =sumC

c=1 nminusc (x(0) h) and n+(x(0) h) =

sumCc=1 n

+c (x(0) h) Table 1 gives the

resulting local contingency table

Table 1 Local contingency table

Y = 1 Y = 2 middot middot middot middot middot middot Y = C Total

x(0) minus h le X lt x(0) nminus1 (x(0) h) nminus2 (x(0) h) middot middot middot middot middot middot nminusC(x(0) h) nminus(x(0) h)

x(0) lt X le x(0) + h n+1 (x(0) h) n+

2 (x(0) h) middot middot middot middot middot middot n+C(x(0) h) n+(x(0) h)

Total n1(x(0) h) n2(x

(0) h) middot middot middot middot middot middot nC(x(0) h) n(x(0) h)

To measure how strongly Y relates to local restrictions on X we consider the chi-square statistic

X 2(x(0) h) =Csumc=1

((nminusc (x(0) h)minus Eminusc (x(0) h))2

Eminusc (x(0) h)+

(n+c (x(0) h)minus E+

c (x(0) h))2

E+c (x(0) h)

) (22)

3

where 00 equiv 0 and

Eminusc (x(0) h) =nminus(x(0) h)nc(x

(0) h)

n(x(0) h) E+

c (x(0) h) =n+(x(0) h)nc(x

(0) h)

n(x(0) h) (23)

Due to the local restriction on X the local contingency table might have some zerocells However this will not be an issue for the χ2 statistic The χ2 statistic can detectlocal dependence between Y and X as long as there exist observations from at leasttwo categories of Y in the neighborhood of x(0) When all observations belong to thesame category of Y intuitively no dependence can be detected In this case χ2 statisticis equal to zero which coincides with our intuition

We call the neighborhood Nh(x(0)) a section and maximize X 2(x(0) h) (22) with

respect to the section size h Let plim denote limit in probability Our local measureof association is ζ(x(0)) as given in

Definition 1 Local Contingency Efficacy (For Numerical Predictor) The local con-tingency section efficacy and the Local Contingency Efficacy (LCE) of Y on a numericalpredictor X at the point x(0) are

ζ(x(0) h) = plimnrarrinfinnminus12

radicX 2(x(0) h) ζ(x(0)) = sup

hgt0ζ(x(0) h) (24)

If pc(x) is constant in x for all c isin 1 middot middot middot C for x near x(0) then as n rarr infinX 2(x(0) h) converges in distribution to a χ2

Cminus1 variable and in this case ζ(x(0) h) = 0On the other hand if pc(x) is not constant near x(0) ζ(x(0) h) determines the asymptoticpower for Pitmanrsquos alternatives of the test based on X 2(x(0) h) for testing whether Xis locally independent of Y and is called the efficacy of this test Moreover ζ2(x(0) h)is the asymptotic noncentrality parameter of the statistic nminus1X 2(x(0) h)

The estimators of ζ(x(0) h) and ζ(x(0)) are

ζ(x(0) h) = nminus12radicX 2(x(0) h) ζ(x(0)) = sup

hgt0ζ(x(0) h) (25)

We selected the section size h as

h = argmaxζ(x(0) h) h isin h1 hg

for some grid h1 hg This corresponds to selecting h by maximizing estimatedpower See Doksum and Schafer [5] Gao and Gijbels [7] Doksum et al [3] Schafer andDoksum [16]

4

22 Categorical Predictor Contingency Efficacy

Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet

X 2 =Csumc=1

Cprimesumcprime=1

(n(c cprime)minus E(c cprime))2

E(c cprime)2 (26)

where

E(c cprime) =

sumCc=1 n(c cprime)

sumCprime

cprime=1 n(c cprime)

n (27)

Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is

ζ = plimnrarrinfin nminus12radicX 2 (28)

Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2

(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic

noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is

ζ = nminus12radicX 2 (29)

3 Variable Selection for Classification Problems The CATCH Algorithm

Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies

31 Numerical Predictors

To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a

ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)

T

to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later

Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set

Tl equiv Tl(x(0) δD) equiv x D(xminusl x

(0)minusl ) le δ (31)

5

We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression

Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of

Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let

NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)

be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in

NlhδD(x(0)) we consider

X 2lδD(x(0) h) (33)

where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)

Analogous to the definitions of local contingency efficacy (24) we define

Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are

ζ(x(0) h l) = plimnrarrinfin nminus12

radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (34)

ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (35)

For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose

logit P (Y = 2|x) =(xminus 1

2

)2 x isin [0 1] Then the local efficacy with h small will be

large while if h ge 1 the efficacy will be zero

32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct

tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables

unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x

(0) δD) by using only the observationsin the tube Tl(x

(0) δD) We denote this version of (26) by

X 2lδD(x(0)) (36)

Analogous to the definition of contingency efficacy given in (28) we define

Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare

ζl(x(0)) = plimnrarrinfin n

minus12radicX 2lδD(x(0)) ζl(x

(0)) = nminus12radicX 2lδD(x(0)) (37)

6

33 The Empirical Importance Scores

The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points

Definition 6 The CATCH empirical importance score for variable l is

sl = Mminus1Msuma=1

ζl(xlowasta) (38)

where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors

respectively

34 Adaptive Tube Distances

Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2

on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form

Dl(xminusl x(0)minusl ) =

dsumj=1

wj|xj minus x(0)j |I(j 6= l) (39)

where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained

The weights wj are proportional to sj minus sprimej which are determined iteratively and

adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the

importance score s(2)j using (38) For the second iteration we use the weights sj = s

(2)j

and the distance

Dl(xminusl x(0)minusl ) =

sumdj=1(sj minus sprimej)+|xj minus x

(0)j |I(j 6= l)sumd

j=1(sj minus sprimej)+I(j 6= l)(310)

where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor

7

with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the

predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and

use s(3)j in the next iteration and so on

In the definition above we need to define |xj minus x(0)j | appropriately for a categorical

variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant

when xj and x(0)j take any pair of different categories One approach is to define |xj minus

x(0)j | =infinI(xj 6= x

(0)j ) in which case all points in the tube are given the same Xj value

as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj

does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the

tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The

value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)

1 minus X(2)1 |) where X

(1)1 and

X(2)1 are two independent realizations of X1 Note that k0 =

radic2π if X1 follows a

standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =

radic2π in the real data example

Remark 31 (choice of k0) k0 =radic

2π is used all the time unless the sample size is

large enough in the sense to be explained below Suppose k0 = infin then Dl

(xminusl x

(0)minusl

)is finite only for those observations with xj = x

(0)j When there are multiple categorical

variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for

all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =

radic2π is used as in the real data example

On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples

35 The CATCH Algorithm

The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations

8

(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below

(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X

lowastM with

replacement from the n observations Set k = 1 do (b)

(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk

(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated

local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a

categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y

on Xl at Xlowastk as defined in (37)

(d) Thresholding Let Y lowast denote a random variable which is identically distributed

as Y but independent of X Let (Y(1)0 middot middot middot Y (n)

0 ) be a random permutation of

(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)

0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y

If k lt M set k = k + 1 go to (b) otherwise go to (e)

(e) Updating importance scores Set

snewl =1

M

Msumk=1

ζl(Xlowastk) s

primenewl =

1

M

Msumk=1

ζlowastl (Xlowastk)

Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s

primenew1 middot middot middot sprimenewd ) denote the updates of S

and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt

radic2SD(sprimestop)

If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm

Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes

9

of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming

(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y

but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)

0 ) be a random permutation of the

Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)

0 ) can be viewed as a realization of Y lowast If Xl

is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of

Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the

estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)

Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl

given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic

2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt

radic2SE(sprimestop) is similar to the one standard

deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01

Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]

Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel

10

4 Simulation Studies

41 Variable Selection with Categorical and Numerical Predictors

Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]

When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion

Y =

0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05

(41)

When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent

In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors

Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors

The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is

11

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 6: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

where 00 equiv 0 and

Eminusc (x(0) h) =nminus(x(0) h)nc(x

(0) h)

n(x(0) h) E+

c (x(0) h) =n+(x(0) h)nc(x

(0) h)

n(x(0) h) (23)

Due to the local restriction on X the local contingency table might have some zerocells However this will not be an issue for the χ2 statistic The χ2 statistic can detectlocal dependence between Y and X as long as there exist observations from at leasttwo categories of Y in the neighborhood of x(0) When all observations belong to thesame category of Y intuitively no dependence can be detected In this case χ2 statisticis equal to zero which coincides with our intuition

We call the neighborhood Nh(x(0)) a section and maximize X 2(x(0) h) (22) with

respect to the section size h Let plim denote limit in probability Our local measureof association is ζ(x(0)) as given in

Definition 1 Local Contingency Efficacy (For Numerical Predictor) The local con-tingency section efficacy and the Local Contingency Efficacy (LCE) of Y on a numericalpredictor X at the point x(0) are

ζ(x(0) h) = plimnrarrinfinnminus12

radicX 2(x(0) h) ζ(x(0)) = sup

hgt0ζ(x(0) h) (24)

If pc(x) is constant in x for all c isin 1 middot middot middot C for x near x(0) then as n rarr infinX 2(x(0) h) converges in distribution to a χ2

Cminus1 variable and in this case ζ(x(0) h) = 0On the other hand if pc(x) is not constant near x(0) ζ(x(0) h) determines the asymptoticpower for Pitmanrsquos alternatives of the test based on X 2(x(0) h) for testing whether Xis locally independent of Y and is called the efficacy of this test Moreover ζ2(x(0) h)is the asymptotic noncentrality parameter of the statistic nminus1X 2(x(0) h)

The estimators of ζ(x(0) h) and ζ(x(0)) are

ζ(x(0) h) = nminus12radicX 2(x(0) h) ζ(x(0)) = sup

hgt0ζ(x(0) h) (25)

We selected the section size h as

h = argmaxζ(x(0) h) h isin h1 hg

for some grid h1 hg This corresponds to selecting h by maximizing estimatedpower See Doksum and Schafer [5] Gao and Gijbels [7] Doksum et al [3] Schafer andDoksum [16]

4

22 Categorical Predictor Contingency Efficacy

Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet

X 2 =Csumc=1

Cprimesumcprime=1

(n(c cprime)minus E(c cprime))2

E(c cprime)2 (26)

where

E(c cprime) =

sumCc=1 n(c cprime)

sumCprime

cprime=1 n(c cprime)

n (27)

Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is

ζ = plimnrarrinfin nminus12radicX 2 (28)

Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2

(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic

noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is

ζ = nminus12radicX 2 (29)

3 Variable Selection for Classification Problems The CATCH Algorithm

Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies

31 Numerical Predictors

To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a

ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)

T

to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later

Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set

Tl equiv Tl(x(0) δD) equiv x D(xminusl x

(0)minusl ) le δ (31)

5

We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression

Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of

Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let

NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)

be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in

NlhδD(x(0)) we consider

X 2lδD(x(0) h) (33)

where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)

Analogous to the definitions of local contingency efficacy (24) we define

Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are

ζ(x(0) h l) = plimnrarrinfin nminus12

radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (34)

ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (35)

For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose

logit P (Y = 2|x) =(xminus 1

2

)2 x isin [0 1] Then the local efficacy with h small will be

large while if h ge 1 the efficacy will be zero

32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct

tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables

unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x

(0) δD) by using only the observationsin the tube Tl(x

(0) δD) We denote this version of (26) by

X 2lδD(x(0)) (36)

Analogous to the definition of contingency efficacy given in (28) we define

Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare

ζl(x(0)) = plimnrarrinfin n

minus12radicX 2lδD(x(0)) ζl(x

(0)) = nminus12radicX 2lδD(x(0)) (37)

6

33 The Empirical Importance Scores

The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points

Definition 6 The CATCH empirical importance score for variable l is

sl = Mminus1Msuma=1

ζl(xlowasta) (38)

where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors

respectively

34 Adaptive Tube Distances

Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2

on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form

Dl(xminusl x(0)minusl ) =

dsumj=1

wj|xj minus x(0)j |I(j 6= l) (39)

where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained

The weights wj are proportional to sj minus sprimej which are determined iteratively and

adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the

importance score s(2)j using (38) For the second iteration we use the weights sj = s

(2)j

and the distance

Dl(xminusl x(0)minusl ) =

sumdj=1(sj minus sprimej)+|xj minus x

(0)j |I(j 6= l)sumd

j=1(sj minus sprimej)+I(j 6= l)(310)

where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor

7

with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the

predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and

use s(3)j in the next iteration and so on

In the definition above we need to define |xj minus x(0)j | appropriately for a categorical

variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant

when xj and x(0)j take any pair of different categories One approach is to define |xj minus

x(0)j | =infinI(xj 6= x

(0)j ) in which case all points in the tube are given the same Xj value

as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj

does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the

tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The

value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)

1 minus X(2)1 |) where X

(1)1 and

X(2)1 are two independent realizations of X1 Note that k0 =

radic2π if X1 follows a

standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =

radic2π in the real data example

Remark 31 (choice of k0) k0 =radic

2π is used all the time unless the sample size is

large enough in the sense to be explained below Suppose k0 = infin then Dl

(xminusl x

(0)minusl

)is finite only for those observations with xj = x

(0)j When there are multiple categorical

variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for

all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =

radic2π is used as in the real data example

On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples

35 The CATCH Algorithm

The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations

8

(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below

(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X

lowastM with

replacement from the n observations Set k = 1 do (b)

(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk

(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated

local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a

categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y

on Xl at Xlowastk as defined in (37)

(d) Thresholding Let Y lowast denote a random variable which is identically distributed

as Y but independent of X Let (Y(1)0 middot middot middot Y (n)

0 ) be a random permutation of

(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)

0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y

If k lt M set k = k + 1 go to (b) otherwise go to (e)

(e) Updating importance scores Set

snewl =1

M

Msumk=1

ζl(Xlowastk) s

primenewl =

1

M

Msumk=1

ζlowastl (Xlowastk)

Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s

primenew1 middot middot middot sprimenewd ) denote the updates of S

and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt

radic2SD(sprimestop)

If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm

Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes

9

of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming

(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y

but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)

0 ) be a random permutation of the

Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)

0 ) can be viewed as a realization of Y lowast If Xl

is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of

Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the

estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)

Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl

given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic

2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt

radic2SE(sprimestop) is similar to the one standard

deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01

Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]

Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel

10

4 Simulation Studies

41 Variable Selection with Categorical and Numerical Predictors

Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]

When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion

Y =

0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05

(41)

When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent

In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors

Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors

The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is

11

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 7: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

22 Categorical Predictor Contingency Efficacy

Let (X(i) Y (i)) i = 1 middot middot middot n be as before except X isin 1 middot middot middot C prime is a categoricalpredictor Let n(c cprime) be the number of observations satisfying Y = c and X = cprime andlet

X 2 =Csumc=1

Cprimesumcprime=1

(n(c cprime)minus E(c cprime))2

E(c cprime)2 (26)

where

E(c cprime) =

sumCc=1 n(c cprime)

sumCprime

cprime=1 n(c cprime)

n (27)

Definition 2 Contingency Efficacy (For Categorical Predictor) The ContingencyEfficacy (CE) of Y on a categorical predictor X is

ζ = plimnrarrinfin nminus12radicX 2 (28)

Under the null hypothesis that Y and X are independent X 2 converges in distri-bution to a χ2

(Cminus1)(Cprimeminus1) variable In this case ζ = 0 In general ζ2 is the asymptotic

noncentrality parameter of the statistic nminus1X 2 The contingency efficacy ζ determinesthe asymptotic power of the test based on X 2 for testing whether Y and X are inde-pendent We use ζ as an importance score that measures how strongly Y depends onX The estimator of (28) is

ζ = nminus12radicX 2 (29)

3 Variable Selection for Classification Problems The CATCH Algorithm

Now we consider a multivariate X = (X1 Xd) d gt 1 To examine whether apredictor Xl is related to Y in the presence of all other variables we calculate efficaciesconditional on these variables as well as unconditional or marginal efficacies

31 Numerical Predictors

To examine the effect of Xl in the local region of x(0) = (x(0)1 middot middot middot x(0)d ) we build a

ldquotuberdquo around the center point x(0) in which data points are within a certain radius δof x(0) with respect to all variables except Xl Define Xminusl = (X1 Xlminus1 Xl+1 Xd)

T

to be the complementary vector to Xl Let D(middot middot) be a distance in Rdminus1 which we willdefine later

Definition 3 A tube Tl of size δ (δ gt 0) for the variable Xl at x(0) is the set

Tl equiv Tl(x(0) δD) equiv x D(xminusl x

(0)minusl ) le δ (31)

5

We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression

Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of

Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let

NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)

be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in

NlhδD(x(0)) we consider

X 2lδD(x(0) h) (33)

where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)

Analogous to the definitions of local contingency efficacy (24) we define

Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are

ζ(x(0) h l) = plimnrarrinfin nminus12

radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (34)

ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (35)

For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose

logit P (Y = 2|x) =(xminus 1

2

)2 x isin [0 1] Then the local efficacy with h small will be

large while if h ge 1 the efficacy will be zero

32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct

tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables

unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x

(0) δD) by using only the observationsin the tube Tl(x

(0) δD) We denote this version of (26) by

X 2lδD(x(0)) (36)

Analogous to the definition of contingency efficacy given in (28) we define

Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare

ζl(x(0)) = plimnrarrinfin n

minus12radicX 2lδD(x(0)) ζl(x

(0)) = nminus12radicX 2lδD(x(0)) (37)

6

33 The Empirical Importance Scores

The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points

Definition 6 The CATCH empirical importance score for variable l is

sl = Mminus1Msuma=1

ζl(xlowasta) (38)

where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors

respectively

34 Adaptive Tube Distances

Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2

on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form

Dl(xminusl x(0)minusl ) =

dsumj=1

wj|xj minus x(0)j |I(j 6= l) (39)

where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained

The weights wj are proportional to sj minus sprimej which are determined iteratively and

adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the

importance score s(2)j using (38) For the second iteration we use the weights sj = s

(2)j

and the distance

Dl(xminusl x(0)minusl ) =

sumdj=1(sj minus sprimej)+|xj minus x

(0)j |I(j 6= l)sumd

j=1(sj minus sprimej)+I(j 6= l)(310)

where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor

7

with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the

predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and

use s(3)j in the next iteration and so on

In the definition above we need to define |xj minus x(0)j | appropriately for a categorical

variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant

when xj and x(0)j take any pair of different categories One approach is to define |xj minus

x(0)j | =infinI(xj 6= x

(0)j ) in which case all points in the tube are given the same Xj value

as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj

does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the

tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The

value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)

1 minus X(2)1 |) where X

(1)1 and

X(2)1 are two independent realizations of X1 Note that k0 =

radic2π if X1 follows a

standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =

radic2π in the real data example

Remark 31 (choice of k0) k0 =radic

2π is used all the time unless the sample size is

large enough in the sense to be explained below Suppose k0 = infin then Dl

(xminusl x

(0)minusl

)is finite only for those observations with xj = x

(0)j When there are multiple categorical

variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for

all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =

radic2π is used as in the real data example

On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples

35 The CATCH Algorithm

The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations

8

(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below

(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X

lowastM with

replacement from the n observations Set k = 1 do (b)

(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk

(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated

local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a

categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y

on Xl at Xlowastk as defined in (37)

(d) Thresholding Let Y lowast denote a random variable which is identically distributed

as Y but independent of X Let (Y(1)0 middot middot middot Y (n)

0 ) be a random permutation of

(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)

0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y

If k lt M set k = k + 1 go to (b) otherwise go to (e)

(e) Updating importance scores Set

snewl =1

M

Msumk=1

ζl(Xlowastk) s

primenewl =

1

M

Msumk=1

ζlowastl (Xlowastk)

Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s

primenew1 middot middot middot sprimenewd ) denote the updates of S

and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt

radic2SD(sprimestop)

If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm

Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes

9

of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming

(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y

but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)

0 ) be a random permutation of the

Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)

0 ) can be viewed as a realization of Y lowast If Xl

is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of

Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the

estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)

Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl

given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic

2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt

radic2SE(sprimestop) is similar to the one standard

deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01

Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]

Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel

10

4 Simulation Studies

41 Variable Selection with Categorical and Numerical Predictors

Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]

When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion

Y =

0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05

(41)

When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent

In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors

Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors

The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is

11

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 8: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

We adjust the size of the tube so that the number of points in the tube is a preas-signed number k We find that k ge 50 is in general a good choice Note that k = ncorresponds to unconditional or marginal nonparametric regression

Within the tube Tl(x(0) δD) Xl is unconstrained To measure the dependence of

Y on Xl locally we consider neighborhoods of x(0) along the direction of Xl within thetube Let

NlhδD(x(0)) = x = (x1 middot middot middot xd)T isin Rd x isin Tl(x(0) δD) |xl minus x(0)l | le h (32)

be a section of the tube Tl(x(0) δD) To measure how strongly Xl relates to Y in

NlhδD(x(0)) we consider

X 2lδD(x(0) h) (33)

where (33) is defined by (22) using only the observations in the tube Tl(x(0) δD)

Analogous to the definitions of local contingency efficacy (24) we define

Definition 4 The local contingency tube section efficacy and the local contingencytube efficacy of Y on Xl at the point x(0) and their estimates are

ζ(x(0) h l) = plimnrarrinfin nminus12

radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (34)

ζ(x(0) h l) = nminus12radicX 2lδD(x(0) h) ζl(x

(0)) = suphgt0ζ(x(0) h l) (35)

For an illustration of local contingency tube efficacy consider Y isin 1 2 and suppose

logit P (Y = 2|x) =(xminus 1

2

)2 x isin [0 1] Then the local efficacy with h small will be

large while if h ge 1 the efficacy will be zero

32 Numerical and Categorical PredictorsIf Xl is categorical but some of the other predictors are numerical we construct

tubes Tl(x(0) δD) based on the numerical predictors leaving the categorical variables

unconstrained and we use the statistic as defined by (26) to measure the strength of theassociation between Y and Xl in the tube Tl(x

(0) δD) by using only the observationsin the tube Tl(x

(0) δD) We denote this version of (26) by

X 2lδD(x(0)) (36)

Analogous to the definition of contingency efficacy given in (28) we define

Definition 5 The local contingency tube efficacy of Y on Xl at x(0) and its estimateare

ζl(x(0)) = plimnrarrinfin n

minus12radicX 2lδD(x(0)) ζl(x

(0)) = nminus12radicX 2lδD(x(0)) (37)

6

33 The Empirical Importance Scores

The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points

Definition 6 The CATCH empirical importance score for variable l is

sl = Mminus1Msuma=1

ζl(xlowasta) (38)

where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors

respectively

34 Adaptive Tube Distances

Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2

on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form

Dl(xminusl x(0)minusl ) =

dsumj=1

wj|xj minus x(0)j |I(j 6= l) (39)

where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained

The weights wj are proportional to sj minus sprimej which are determined iteratively and

adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the

importance score s(2)j using (38) For the second iteration we use the weights sj = s

(2)j

and the distance

Dl(xminusl x(0)minusl ) =

sumdj=1(sj minus sprimej)+|xj minus x

(0)j |I(j 6= l)sumd

j=1(sj minus sprimej)+I(j 6= l)(310)

where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor

7

with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the

predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and

use s(3)j in the next iteration and so on

In the definition above we need to define |xj minus x(0)j | appropriately for a categorical

variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant

when xj and x(0)j take any pair of different categories One approach is to define |xj minus

x(0)j | =infinI(xj 6= x

(0)j ) in which case all points in the tube are given the same Xj value

as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj

does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the

tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The

value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)

1 minus X(2)1 |) where X

(1)1 and

X(2)1 are two independent realizations of X1 Note that k0 =

radic2π if X1 follows a

standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =

radic2π in the real data example

Remark 31 (choice of k0) k0 =radic

2π is used all the time unless the sample size is

large enough in the sense to be explained below Suppose k0 = infin then Dl

(xminusl x

(0)minusl

)is finite only for those observations with xj = x

(0)j When there are multiple categorical

variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for

all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =

radic2π is used as in the real data example

On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples

35 The CATCH Algorithm

The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations

8

(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below

(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X

lowastM with

replacement from the n observations Set k = 1 do (b)

(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk

(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated

local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a

categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y

on Xl at Xlowastk as defined in (37)

(d) Thresholding Let Y lowast denote a random variable which is identically distributed

as Y but independent of X Let (Y(1)0 middot middot middot Y (n)

0 ) be a random permutation of

(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)

0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y

If k lt M set k = k + 1 go to (b) otherwise go to (e)

(e) Updating importance scores Set

snewl =1

M

Msumk=1

ζl(Xlowastk) s

primenewl =

1

M

Msumk=1

ζlowastl (Xlowastk)

Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s

primenew1 middot middot middot sprimenewd ) denote the updates of S

and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt

radic2SD(sprimestop)

If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm

Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes

9

of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming

(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y

but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)

0 ) be a random permutation of the

Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)

0 ) can be viewed as a realization of Y lowast If Xl

is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of

Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the

estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)

Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl

given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic

2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt

radic2SE(sprimestop) is similar to the one standard

deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01

Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]

Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel

10

4 Simulation Studies

41 Variable Selection with Categorical and Numerical Predictors

Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]

When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion

Y =

0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05

(41)

When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent

In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors

Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors

The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is

11

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 9: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

33 The Empirical Importance Scores

The empirical efficacies (35) and (37) measure how strongly Y depends on Xl nearone point x(0) To measure the overall dependency of Y on Xl we randomly sampleM bootstrap points xlowasta a = 1 M with replacement from x(i) 1 le i le n andcalculate the average of local tube efficacies estimates at these points

Definition 6 The CATCH empirical importance score for variable l is

sl = Mminus1Msuma=1

ζl(xlowasta) (38)

where ζl(x(ia))is defined in (35) and (37) for numerical and categorical predictors

respectively

34 Adaptive Tube Distances

Doksum et al [3] used an example to demonstrate that the distance D needs tobe adaptive to keep variables strongly associated with Y from obscuring the effect ofweaker variables Suppose Y depends on three variable X1 X2 and X3 among the largerset of predictors The strength of the dependency ranges from weak to strong whichwill be estimated by the empirical importance scores As we examine the effect of X2

on Y we want to define the tube so that there is relatively less variation in X3 than inX1 in the tube because X3 is more likely to obscure the effect of X2 than X1 To thisend we introduce a tube distance for the variable Xl of the form

Dl(xminusl x(0)minusl ) =

dsumj=1

wj|xj minus x(0)j |I(j 6= l) (39)

where the weights wj adjust the contribution of strong variables to the distance so thatthe strong variables are more constrained

The weights wj are proportional to sj minus sprimej which are determined iteratively and

adaptively as follows in the first iteration set sj = s(1)j = 1 in (39) and compute the

importance score s(2)j using (38) For the second iteration we use the weights sj = s

(2)j

and the distance

Dl(xminusl x(0)minusl ) =

sumdj=1(sj minus sprimej)+|xj minus x

(0)j |I(j 6= l)sumd

j=1(sj minus sprimej)+I(j 6= l)(310)

where 00 equiv 1 and sprimej is a threshold value for sj under the assumption of no conditionalassociation of Xl with Y computed by a simple Monte Carlo technique This adjust-ment is important because some predictors by definition have larger importance scoresthan others when they are all independent of Y For example a categorical predictor

7

with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the

predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and

use s(3)j in the next iteration and so on

In the definition above we need to define |xj minus x(0)j | appropriately for a categorical

variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant

when xj and x(0)j take any pair of different categories One approach is to define |xj minus

x(0)j | =infinI(xj 6= x

(0)j ) in which case all points in the tube are given the same Xj value

as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj

does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the

tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The

value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)

1 minus X(2)1 |) where X

(1)1 and

X(2)1 are two independent realizations of X1 Note that k0 =

radic2π if X1 follows a

standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =

radic2π in the real data example

Remark 31 (choice of k0) k0 =radic

2π is used all the time unless the sample size is

large enough in the sense to be explained below Suppose k0 = infin then Dl

(xminusl x

(0)minusl

)is finite only for those observations with xj = x

(0)j When there are multiple categorical

variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for

all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =

radic2π is used as in the real data example

On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples

35 The CATCH Algorithm

The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations

8

(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below

(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X

lowastM with

replacement from the n observations Set k = 1 do (b)

(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk

(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated

local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a

categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y

on Xl at Xlowastk as defined in (37)

(d) Thresholding Let Y lowast denote a random variable which is identically distributed

as Y but independent of X Let (Y(1)0 middot middot middot Y (n)

0 ) be a random permutation of

(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)

0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y

If k lt M set k = k + 1 go to (b) otherwise go to (e)

(e) Updating importance scores Set

snewl =1

M

Msumk=1

ζl(Xlowastk) s

primenewl =

1

M

Msumk=1

ζlowastl (Xlowastk)

Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s

primenew1 middot middot middot sprimenewd ) denote the updates of S

and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt

radic2SD(sprimestop)

If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm

Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes

9

of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming

(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y

but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)

0 ) be a random permutation of the

Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)

0 ) can be viewed as a realization of Y lowast If Xl

is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of

Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the

estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)

Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl

given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic

2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt

radic2SE(sprimestop) is similar to the one standard

deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01

Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]

Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel

10

4 Simulation Studies

41 Variable Selection with Categorical and Numerical Predictors

Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]

When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion

Y =

0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05

(41)

When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent

In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors

Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors

The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is

11

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 10: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

with more levels has a larger importance score than that with fewer levels Thereforethe unadjusted scores sj can not accurately quantify the relative importance of the

predictors in the tube distance calculation Next we use (38) to produce sj = s(3)j and

use s(3)j in the next iteration and so on

In the definition above we need to define |xj minus x(0)j | appropriately for a categorical

variable Xj Because Xj is categorical we need to assume that |xj minus x(0)j | is constant

when xj and x(0)j take any pair of different categories One approach is to define |xj minus

x(0)j | =infinI(xj 6= x

(0)j ) in which case all points in the tube are given the same Xj value

as the tube center The approach is a very sensible one as it strictly implements theidea of conditioning on Xj when evaluating the effect of another predictor Xl and Xj

does not contribute to any variation in Y The problem with this definition though isthat when the number of categories of Xj is relatively large as compared to sample sizethere will not be enough points in the tube if we restrict Xj to a single category Insuch a case we do not restrict Xj which means that Xj can take any category in the

tube We define |xj minus x(0)j | = k0I(xj 6= x(0)j ) where k0 is a normalizing constant The

value of k0 is chosen to achieve a balance between numerical variables and categoricalvariables Suppose that X1 is a numerical variable with standard deviation one andX2 is a categorical variable For X2 we set k0 equiv E(|X(1)

1 minus X(2)1 |) where X

(1)1 and

X(2)1 are two independent realizations of X1 Note that k0 =

radic2π if X1 follows a

standard normal distribution Also k0 can be based on the empirical distributions ofthe numerical variables Based on the proceeding discussion of the choice of k0 we usek0 =infin in the simulations and k0 =

radic2π in the real data example

Remark 31 (choice of k0) k0 =radic

2π is used all the time unless the sample size is

large enough in the sense to be explained below Suppose k0 = infin then Dl

(xminusl x

(0)minusl

)is finite only for those observations with xj = x

(0)j When there are multiple categorical

variables Dl is finite only for those points denoted by set Ω(0) that satisfy xj = x(0)j for

all categorical xj When there are many categorical variables or when the total samplesize is small the size of this set can be very small or close to zero in which case thetube cannot be well defined therefore k0 =

radic2π is used as in the real data example

On the other hand when the size of Ω(0) is larger than the specified tube size we canuse k0 =infin as in the simulation examples

35 The CATCH Algorithm

The variable selection algorithm for a classification problem is based on importancescores and an iteration procedureThe Algorithm(0) Standardizing Standardize all the numerical predictors to have sample mean 0and sample standard deviation 1 using linear transformations

8

(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below

(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X

lowastM with

replacement from the n observations Set k = 1 do (b)

(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk

(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated

local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a

categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y

on Xl at Xlowastk as defined in (37)

(d) Thresholding Let Y lowast denote a random variable which is identically distributed

as Y but independent of X Let (Y(1)0 middot middot middot Y (n)

0 ) be a random permutation of

(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)

0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y

If k lt M set k = k + 1 go to (b) otherwise go to (e)

(e) Updating importance scores Set

snewl =1

M

Msumk=1

ζl(Xlowastk) s

primenewl =

1

M

Msumk=1

ζlowastl (Xlowastk)

Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s

primenew1 middot middot middot sprimenewd ) denote the updates of S

and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt

radic2SD(sprimestop)

If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm

Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes

9

of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming

(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y

but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)

0 ) be a random permutation of the

Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)

0 ) can be viewed as a realization of Y lowast If Xl

is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of

Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the

estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)

Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl

given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic

2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt

radic2SE(sprimestop) is similar to the one standard

deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01

Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]

Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel

10

4 Simulation Studies

41 Variable Selection with Categorical and Numerical Predictors

Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]

When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion

Y =

0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05

(41)

When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent

In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors

Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors

The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is

11

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 11: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

(1) Initializing Importance Scores Let S = (s1 middot middot middot sd) be importance scores forthe predictors Let S prime = (sprime1 middot middot middot sprimed) such that sprimel is a threshold importance score whichcorresponds to the model where Xl is independent of Y given Xminusl Initialize S to be(1 middot middot middot 1) and S prime to be (0 middot middot middot 0) Let M be the number of tube centers and m thenumber of observations within each tube(2) Tube Hunting Loop For l = 1 middot middot middot d do (a) (b) (c) and (d) below

(a) Selecting tube centers For 0 lt b le 1 set M = [nb] where [ ] is the great-est integer function and randomly sample M bootstrap points Xlowast1 X

lowastM with

replacement from the n observations Set k = 1 do (b)

(b) Constructing tubes For 0 lt a lt 1 set m = [na] Using the tube distanceDl defined in (310) select the tube size a so that there are exactly m ge 10observations inside the tube for Xl at Xlowastk

(c) Calculating efficacy If Xl is a numerical variable let ζl(Xlowastk) be the estimated

local contingency tube efficacy of Y on Xl at Xlowastk as defined in (35) If Xl is a

categorical variable let ζl(Xlowastk) be the estimated contingency tube efficacy of Y

on Xl at Xlowastk as defined in (37)

(d) Thresholding Let Y lowast denote a random variable which is identically distributed

as Y but independent of X Let (Y(1)0 middot middot middot Y (n)

0 ) be a random permutation of

(Y (1) middot middot middot Y (n)) then Y0 equiv (Y(1)0 middot middot middot Y (n)

0 ) can be viewed as a realization of Y lowastLet ζlowastl (Xlowastk) be the estimated local contingency tube efficacy of Y0 on Xl at Xlowastkcalculated as in (c) with Y0 in place of Y

If k lt M set k = k + 1 go to (b) otherwise go to (e)

(e) Updating importance scores Set

snewl =1

M

Msumk=1

ζl(Xlowastk) s

primenewl =

1

M

Msumk=1

ζlowastl (Xlowastk)

Let Snew = (snew1 middot middot middot snewd ) and Sprimenew = (s

primenew1 middot middot middot sprimenewd ) denote the updates of S

and S prime(3) Iterations and Stopping Rule Repeat (2) I times Stop when the change inSnew is small Record the last S and S prime as Sstop and S primestop respectively(4) Deletion step Using Sstop and S primestop delete Xl if sstopl minus sprimestopl lt

radic2SD(sprimestop)

If d is small we can generate several sets of S primestop and then use the sample standarddeviation of all the values in these sets to be SD(sprimestop) See Remark 33End of the algorithm

Remark 32 Step (d) needs to be replaced by (drsquo) below when a global permutation of Ydoes not result in appropriate ldquolocal null distributionrdquo For example when sample sizes

9

of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming

(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y

but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)

0 ) be a random permutation of the

Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)

0 ) can be viewed as a realization of Y lowast If Xl

is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of

Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the

estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)

Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl

given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic

2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt

radic2SE(sprimestop) is similar to the one standard

deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01

Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]

Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel

10

4 Simulation Studies

41 Variable Selection with Categorical and Numerical Predictors

Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]

When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion

Y =

0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05

(41)

When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent

In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors

Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors

The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is

11

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 12: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

of different classes are extremely unequal which is known as unbalanced classificationproblemsit is very likely to have a pure class of Y in local region after the global per-mutation and the threshold S prime will be spuriously low And therefore irrelevant variableswill be falsely selected into the model This approach is more nonparametric but moretime consuming

(drsquo) Local Thresholding Let Y lowast denote a random variable which is distributed as Y

but independent of Xl given Xminusl Let (Y(1)0 middot middot middot Y (m)

0 ) be a random permutation of the

Y rsquos inside the tube then (Y(1)0 middot middot middot Y (m)

0 ) can be viewed as a realization of Y lowast If Xl

is a numerical variable let ζlprime(Xlowastk) be the estimated local contingency tube efficacy of

Y lowast on Xl at Xlowastk as defined in (35) If Xl is a categorical variable let ζlprime(Xlowastk) be the

estimated contingency tube efficacy of Y lowast on Xl at Xlowastk as defined in (37)

Remark 33 (The threshold) Under the assumption H(l)0 that Y is independent of Xl

given Xminusl in Tl sl and sprimel are nearly independent and SD(sl minus sprimel) asympradic

2SD(sprimel) Thusthe rule that deletes Xl when sstopl minus sprimestopl lt

radic2SE(sprimestop) is similar to the one standard

deviation rule of CART and GUIDE The choice of threshold is also discussed in Section7 Alternatively we calculate the p-values for each covariate by permutation tests Morespecifically we permute the response variable and obtain the important scores for thepermeated data and then calculate how extreme sprimestopl compared to permutated scoresWe then identify a covariate as important if its p-value is less than 01

Remark 34 (Large d small n) A key idea in this paper is that when determining theimportance of the variable Xj we restrict (condition) the vector Xminusj to a relatively smallset This is done to address the problem of the spurious correlation of great concern ingenomics (Price et al [13] Lin and Zeng [11]) In such genomic studies the number ofpredictors d typically greatly exceeds sample size n For numerical variables this can bedealt with by replace Xminusj with a vector of principal components explaining most of thevariation in Xminusj Although our paper does not deal with the d n case directly thisremark shows that it is also applicable to the genome wide association studies describedin Price et al [13]

Remark 35 We now provide computational complexity of the algorithm Recall thenotations I is the number of repetitions M is the number of tube centers m is thetube size d is the dimension and n is the sample size The number of operationsneeded for the algorithm is I times d times M times (2b+ 2c+ 2d) + 2e where 2b 2c 2d 2edenote the number of operations required for those steps The computational complexityrequired for 2b 2c 2d and 2e are O (nd) O (m) O (m)and O (M) Since nd m andthe required computation is dominated by step 2b the computational complexity of thealgorithm is O (IMnd2) Solving least squares problem requires O (nd2) operations Soour algorithm requires higher computational complexity by a factor of IM compared toleast squares However the computations for M tube centers are independent of theother computations and are performed in parallel

10

4 Simulation Studies

41 Variable Selection with Categorical and Numerical Predictors

Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]

When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion

Y =

0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05

(41)

When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent

In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors

Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors

The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is

11

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 13: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

4 Simulation Studies

41 Variable Selection with Categorical and Numerical Predictors

Example 41 Consider d ge 5 predictors X1 X2 Xd and a categorical responsevariable Y The fifth predictor X5 is a Bernoulli random variable which takes twovalues 0 and 1 with equal probability All other predictors are uniformly distributed on[01]

When X5 = 0 the dependent variable Y is related to the predictors in the followingfashion

Y =

0 if X1 minusX2 minus 025 sin(16πX2) le minus051 if minus05 lt X1 minusX2 minus 025 sin(16πX2) le 02 if 0 lt X1 minusX2 minus 025 sin(16πX2) le 053 if X1 minusX2 minus 025 sin(16πX2) gt 05

(41)

When X5 = 1 Y depends on X3 and X4 in the same way as above with X1 replacedby X3 and X2 replaced by X4 The other variables X6 Xd are irrelevant noisevariables All d predictors are independent

In addition to the presence of noise variables X6 Xd this example is challengingin two aspects First there is an interaction effect between the continuous predictorsX1 X4 and the categorical predictor X5 Such interaction effect may arise in prac-tice For example suppose Y represents the effect resulting from multiple treatmentsX1 X4 and that X5 represents the gender of a patient Our model allows males andfemales to respond differently to treatments Secondly the classification boundaries forfixed X5 are highly nonlinear as shown in Figure 1 This design is to demonstrate theour method can detect nonlinear dependence between the response and the predictors

Table 2 shows the number of simulations where the predictors X1 middot middot middot X5 are cor-rectly kept in the model and in column 6 the average number of simulations wheredminus 5 irrelevant predictors are falsely selected to be in the model out of 100 simulationtrials from model (41) In the simulation n = 400 d = 10 50 and 100 are usedWe compare CATCH with other methods including Marginal CATCH Random Forestclassifier (RFC) (Breiman [1]) and GUIDE (Loh [12]) For Marginal CATCH we testthe association between Y and Xl without conditioning on Xminusl RFC is well-known asa powerful tool for prediction Here we use importance scores from RFC for variableselection We create additional 30 noise variables and use their average importancescores multiplied by a factor of 2 as a threshold (Doksum et al [3]) to decide whether avariable should be selected into the model In particular we generate the noise variablesusing uniform and Bernoulli distributions respectively for numerical and categoricalpredictors

The main goal of GUIDE is to solve the variable selection bias problem in regressionand classification tree methods As an illustration if a categorical predictor takes Kdistinct values then the splitting test exhausts 2K minus 1 possibilities and therefore is

11

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 14: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

Tab

le2

Sim

ula

tion

resu

lts

from

Exam

ple

41

wit

hsa

mp

lesi

zen

=400

For

1leile

5

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

rele

vant

vari

ableX

ian

dth

est

and

ard

erro

rb

ase

don

100

sim

ula

tion

sC

olu

mn

6giv

esth

em

ean

of

the

esti

mate

dp

rob

ab

ilit

yof

sele

ctio

nan

dth

est

and

ard

erro

rfo

rth

eir

rele

vant

vari

ab

lesX

6middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5Xiige

6

CA

TC

H10

082

(00

4)0

75(0

04)

078

(00

4)0

82(0

04)

1(0

)0

05(0

02)

500

76(0

04)

076

(00

4)0

87(0

03)

084

(00

4)0

96(0

02)

008

(00

3)

100

077

(00

4)0

73(0

04)

082

(00

4)0

8(0

04)

083

(00

4)0

09(0

03)

Mar

ginal

CA

TC

H10

077

(00

4)0

67(0

05)

07

(00

5)0

8(0

04)

006

(00

2)0

1(0

03)

500

69(0

05)

07

(00

5)0

76(0

04)

079

(00

4)0

07(0

03)

01

(00

3)

100

078

(00

4)0

76(0

04)

073

(00

4)0

73(0

04)

012

(00

3)0

1(0

03)

Ran

dom

For

est

100

71(0

05)

051

(00

5)0

47(0

05)

059

(00

5)0

37(0

05)

006

(00

2)

500

64(0

05)

061

(00

5)0

65(0

05)

058

(00

5)0

28(0

04)

008

(00

3)

100

066

(00

5)0

58(0

05)

059

(00

5)0

65(0

05)

017

(00

4)0

11(0

03)

GU

IDE

100

96(0

02)

098

(00

1)0

93(0

03)

099

(00

1)0

92(0

03)

033

(00

5)

500

82(0

04)

085

(00

4)0

9(0

03)

089

(00

3)0

67(0

05)

012

(00

3)

100

083

(00

4)0

84(0

04)

086

(00

3)0

83(0

04)

061

(00

5)0

08(0

03)

12

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 15: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

Figure 1 The relationship between Y and X1 X2 for model (41) The four areas represent the fourvalues of Y

biased toward being selected The way that GUIDE alleviate this bias is to employthe lack-of-fit tests ie χ2-test applicable to both numerical and categorical predictorswith the p-values adjusted by the bootstrap procedure Since GUIDE has negligible se-lection bias it can include tests for local pairwise interactions Furthermore it achievessensitivity to local curvature by using linear model at each node

The results in table 2 show that Marginal CATCH does a very poor job of select-ing X5 This illustrates the importance of local conditioning RFC does better thanMarginal CATCH but also has difficulties while CATCH and GUIDE can successfullyidentify X5 along with X1 to X4 with high frequency When d = 10 GUIDE does verywell for the relevant variables but selects too many irrelevant variables as d increasesSurprisingly GUIDE selects fewer irrelevant variables as d increases It becomes morecautious as the dimension d increases CATCH performs very well for the high dimen-sional cases (d = 50 and d = 100) This indicates that the CATCH algorithm is robustto the intricate interaction structures between the predictors

In model (41) above the predictors interact in a hierarchical way When X5 = 0the response is related to X1 and X2 as shown in Figure 1 where the two axes arethe two relevant predictors and the four areas partitioned by the curves correspondto different values of Y When X5 = 1 the response depends on X3 and X4 in thesame way This type of hierarchical interaction between the categorical and numericalpredictors are difficult to detect even by state of the art classification methods such assupport vector machine(SVM) and RFC In particular RFC produces an importance

13

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 16: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

score vector that identifies the four numerical variables X1 X4 as very significantbut often leaves the categorical variable X5 unidentified as we see in table 2

The CATCH algorithm however performs very well for this complicated model Wenow explain why CATCH is able to identify X5 as an important variable To explorethe association between X5 and Y we construct M contingency tables of 2 rows and4 columns based on M randomly centered tubes calculate contingency efficacy scoresand then average the scores Figure 2 (a) and (b) illustrate how one of the contingencyefficacy scores is calculated 1000 data points are generated from model (41) and theyare split into two sets according to x5 (a) shows approximately half of the points withx5 = 0 in (x1 x2) dimensions and (b) shows the rest points with x5 = 1 in (x3 x4)dimensions The data points are colored according to Y This representation enablesus to clearly see the classification boundaries (grey lines) The randomly selected tubecenter when x5 is zero is shown as a star in (a) The data points within the tube closeto the tube center in terms of Xminus5 are highlighted by larger dots Since the distancedefining the tube does not involve X5 the larger dots have x5 equal 0 or 1 and arepresent in both (a) and (b) The counts of the larger dots in different colors in (a) and(b) correspond to the cell counts of the two rows x5 = 0 and x5 = 1 in the contingencytable As we can see larger dots in (a) are mostly red and larger dots in (b) are mostlyblue and orange which shows that distribution in Y depends on the value of X5 andthe contingency efficacy score is high We expect the contingency efficacy score wouldbe low only if the tube center has similar values in (x1 x2) and (x3 x4) Such tubecenters would be very few among the M tube centers since they are randomly chosenThus the average contingency efficacy scores is high and X5 is identified as an importantvariables

To contrast this example with a simpler data generating model suppose the effectsof predictors are additive that is the categorical variable is related to the response asan added term to the effect of the numerical predictors In this type of simpler modelsthe RFC and GUIDE algorithm can efficiently identify relevant predictors but so canthe CATCH algorithm We do not include the simulation results for additive modelshere

Remark 41 We observe that the CATCH algorithm is not very sensitive to the choiceof the number of tube centers M or the tube size m = [na] where 0 lt a le 1 and [ ] isthe greatest integer function For example 41 we perform the simulation using M = nand M = n2 and a = 01 015 02 03 05 The results are summarized in Table 3The CATCH algorithm performs similarly for M = n and M = n2 with M = n havinga small winning edge We generally suggest to use M = n if computation resource isof a less concern or n is not large (le 200) When computational cost is a concern(Remark 35) one can reduce tube size to M = n2 for a compromise between detectionpower and the computational cost Table 3 also shows that the tube size works well inthe range between 01 and 03 while the detection power of X5 decreases for the casea = 05 where the tube size is too big to achieve local conditioning This is a consistent

14

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 17: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

observation with the comparison to Marginal CATCH in Table 2 On the other handthe tube size should not be too small to achieve sufficient detection power We suggestto use a = 01 or 015 for sample sizes n ge 400 or m = 50 for smaller sample sizesIn light of these guidelines we use M = n2 = 200 and a = 015 for Examples 41-43M = n = 200 and m = 50 for Example 44

Table 3 Sensitivity study of CATCH for Example 41 (n = 400 d = 50) using different number oftube centers M and tube size m = [na] For 1 le i le 5 the ith column gives the number of times Xi

was selected Column 6 gives the percentage of times irrelevant variables were selected

Number of Selections(100 simulations)M a X1 X2 X3 X4 X5 Xi i ge 6

M = 200 010 75 67 77 69 92 78015 76 76 87 84 96 82020 85 84 90 87 93 99030 89 87 95 89 85 96050 89 90 96 93 62 99

M = 400 010 74 76 82 77 92 72015 85 82 89 87 94 86020 87 88 90 86 96 91030 88 89 91 91 92 94050 89 91 94 93 61 102

In Example 41 all predictors are independent In a similar setting we next considerthe case where the predictors are dependent

Example 42 We generate standard normal random variables Zi for i = 1 middot middot middot d insuch a way that cor(Z1 Z4) = 05 cor(Z3 Z6) = 09and the other Zirsquos are independentWe set Xi = Φ(Zi) for i = 1 middot middot middot d where Φ(middot) is the CDF of a standard normal andtherefore Xi are uniformly distributed marginally as in Example 41 The correlationstructure between the Xirsquos are similar to that of the Zirsquos The dependent variable Y isgenerated in the same way as in Example 41

The results of Example 42 are given in Table 4 By comparing with results in Table2 both CATCH and GUIDE are doing well in the presence of dependence between thenumerical variables The explanation is two-fold First both methods could identify thecategorical X5 as an important predictor most of the time which make the identificationof other relevant predictors possible Second conditioning on X5 Y depends on a pairof variables (ldquoX1 and X2rdquo or ldquoX3 and X4rdquo) jointly and in particular this dependence isnonlinear which makes both methods less affected by the correlation introduced here

15

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 18: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

00 02 04 06 08 10

00

02

04

06

08

10

(a) X5 = 0

X1

X2

00 02 04 06 08 10

00

02

04

06

08

10

(b) X5 = 1

X3

X4

Figure 2 Sample scatter plots of (x1 x2) and (x3 x4) for simulated data using model (41) The leftpanel includes all the pairs (x1 x2) with x5 = 0 while the right panel includes all the pairs (x3 x4)with x5 = 1 The larger dots in the two plots are within the tube around a tube center shown as thestar

We now explain the latter in details for both methods In CATCH the algorithmadaptively weighs the effects of predictors by their importance scores in defining thetubes when they are being conditioned on therefore identifying either of the two willhelp to identify the other whereas an irrelevant variable though strongly associated withone of the relevant variables is down-weighted iteratively and does not strongly affectthe selection of the relevant variable This can explain the slight decreasing selectionprobability in X3 in presence of a strongly correlated X6 and consequently overall lessselection of X3 and X4 compared to X1 and X2 given the correlation between X1 andX4 In GUIDE the difference between the dependent and independent cases is evenless because GUIDE is very powerful in detecting local interaction by literally includingpairwise interactions when selecting splitting variables

Marginal CATCH and Random Forest do poorly in this case mainly because X5 cannot be detected as we explained earlier More specifically the dependence between Yand X4 in X5 = 1 case is the same as that between Y and X2 in X5 = 0 The linearterms of X1 and X2 have opposite signs in decision function therefore marginal effectsof X1 and X4 are obscured when X5 is not identified as an important variable

42 Improving The Performance of SVM and RF Classifiers

The criterion used in the CATCH algorithm is not based on classification accuracyBut if we use CATCH to screen out the irrelevant variables and only use the important

16

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 19: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

Tab

le4

Sim

ula

tion

resu

lts

from

Exam

ple

42

wit

hsa

mp

lesi

zen

=400

For

1leile

6

theit

hC

olu

mn

giv

esth

ees

tim

ate

dp

rob

ab

ilit

yof

sele

ctio

nof

vari

ableX

ian

dth

est

and

ard

erro

rb

ased

on

100

sim

ula

tion

sw

her

eX

ii

=1middotmiddotmiddot

5are

rele

vant

vari

ab

les

wit

hcor(X

1X

4)

=5

X

6is

anir

rele

vant

vari

able

that

isco

rrel

ated

wit

hX

3w

ith

corr

elati

on

9

Colu

mn

7giv

esth

em

ean

of

the

esti

mate

dpro

bab

ilit

yof

sele

ctio

nan

dth

atof

the

stan

dar

der

ror

for

the

irre

leva

nt

vari

ab

lesX

7middotmiddotmiddotX

d

Num

ber

ofSel

ecti

ons(

100

sim

ula

tion

s)

ME

TH

OD

dX

1X

2X

3X

4X

5X

6Xiige

7

CA

TC

H10

076

(00

4)0

77(0

04)

073

(00

4)0

68(0

05)

099

(00

1)0

34(0

05)

007

(00

2)

500

78(0

04)

077

(00

4)0

74(0

04)

063

(00

5)0

92(0

03)

057

(00

5)0

09(0

03)

100

076

(00

4)0

77(0

04)

08

(00

4)0

74(0

04)

082

(00

4)0

55(0

05)

009

(00

3)

Mar

ginal

CA

TC

H10

035

(00

5)0

63(0

05)

08

(00

4)0

32(0

05)

009

(00

3)0

71(0

05)

012

(00

3)

500

34(0

05)

071

(00

5)0

7(0

05)

03

(00

5)0

1(0

03)

06

(00

5)0

11(0

03)

100

034

(00

5)0

68(0

05)

075

(00

4)0

3(0

05)

01

(00

3)0

59(0

05)

01

(00

3)

Ran

dom

For

est

100

29(0

05)

048

(00

5)0

55(0

05)

02

(00

4)0

3(0

05)

047

(00

5)0

05(0

02)

500

26(0

04)

052

(00

5)0

57(0

05)

03

(00

5)0

24(0

04)

045

(00

5)0

08(0

03)

100

036

(00

5)0

61(0

05)

066

(00

5)0

37(0

05)

032

(00

5)0

46(0

05)

012

(00

3)

GU

IDE

100

96(0

02)

096

(00

2)0

94(0

02)

095

(00

2)0

96(0

02)

084

(00

4)0

3(0

05)

500

8(0

04)

092

(00

3)0

88(0

03)

079

(00

4)0

72(0

04)

068

(00

5)0

12(0

03)

100

081

(00

4)0

89(0

03)

09

(00

3)0

76(0

04)

075

(00

4)0

58(0

05)

008

(00

3)

17

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 20: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

variables for classification the performance of other algorithms such as Support VectorMachine (SVM) and Random Forest Classifiers (RFC) can be significantly improvedWe compare the prediction performance of (1) SVM (2) RFC (3) CATCH followedby SVM and (4) CATCH followed by RFC in this section We label (3) and (4)with CATCH+SVM and CATCH+RFC respectively In this section we use the ldquosvmrdquofunction in the ldquoe1071rdquo package in R with radial kernel to perform SVM analysisand use the ldquorandomForestrdquo function in the ldquorandomForestrdquo package to perform RFCanalysis We use simulation experiments to illustrate the improvement obtained byusing CATCH prior to classifying Y

Let the true model be

(X Y ) sim P (42)

where P will be specified later In each of 100 Monte Carlo experiments n pairs of(X(i) Y (i)) i = 1 middot middot middot n are generated from the true model (42) Based on the simu-lated data (X(i) Y (i)) i = 1 middot middot middot n the methods (1) SVM (2) RFC (3) CATCH+SVM(4) CATCH+RFC are used to classify a future Y (0) for which we have available a cor-responding X(0)

The integrated-misclassification rate (IMR Friedman [6]) defined below is used toevaluate the performance of the classification algorithms Let FXY(middot) be the classifierconstructed by a classification algorithm based on X = (X(1) middot middot middot X(n)) and Y =(Y (1) middot middot middot Y (n))T

Let (X(0) Y (0)) be a future observation with the same distribution as (X(i) Y (i))We only know X(0) = x(0) and want to classify Y (0) For a possible future observation(x(0) y(0)) define

MR(XYx(0) y(0)) = P (FXY(x(0)) 6= y(0)|XY) (43)

Our criterion is

IMR = EE0[MR(XYx(0) y(0))] (44)

where E0 is with respect to the distribution of (x(0) y(0)) random and E is with respectto the distribution of (XY)

This double expectation is evaluated using Monte Carlo simulation The result isdenoted as IMRlowast More precisely E0 is estimated using 5000 simulated (x(0) y(0)) obser-vations then IMRlowast is computed on the basis of 100 simulated (XiYi) i = 1 middot middot middot 100samples

Example 43 First consider model (41) in section 41 Figure 3 shows IMRlowast versusthe number of irrelevant variables for this example The left panel shows that theIMRlowastrsquos of the SVM approach increases dramatically as the number of irrelevant variablesincreases while the IMRlowastrsquos of CATCH+SVM are stable Thus CATCH stabilizes the

18

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 21: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

performance of SVM for this complex model The right panel shows that CATCH alsostabilizes the performance of RFC These results indicate that CATCH can improve theprediction accuracy of other classification algorithms

20 40 60 80 100

025

030

035

040

045

050

055

060

SVM

d

Err

or R

ate

SVMCATCH+SVM

20 40 60 80 100

025

030

035

040

045

050

055

060

Random Forest

d

RFCCATCH+RFC

Figure 3 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus the number of predictors forthe simulation study in model (41) in example 41 The sample size is n = 400

Example 44 Contaminated Logistic Model Consider a binary response Y isin 0 1with

P (Y = 0|X = x) =

F (x) if r = 0G(x) if r = 1

(45)

where x = (x1 xd)

r sim Bernoulli(γ) (46)

F (x) = (1 + expminus(025 + 05x1 + x2))minus1 (47)

G(x) =

01 if 025 + 05x1 + x2 lt 009 if 025 + 05x1 + x2 ge 0

(48)

and x1 x2 xd are realizations of independent standard normal random variablesHence given X(i) = x(i) 1 le i le n the joint distribution of Y (1) Y (2) Y (n) is thecontaminated logistic model with contamination parameter γ

19

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 22: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

(1minus γ)nprodi=1

[F (x(i))]Y(i)

[1minus F (x(i))]1minusY(i)

+ γ

nprodi=1

[G(x(i))]Y(i)

[1minusG(x(i))]1minusY(i)

(49)

We perform simulation study for n = 200 d = 50 and γ = 0 02 04 06 08 and1 where γ = 0 corresponds to no contamination Figure 4 shows the IMRlowastrsquos versus thedifferent values of γ for this example CATCH improves the performance of the SVMand Random Forest classifiers for this contaminated logistic regression model

01

02

03

04

SVM

γ

Err

or R

ate

0 02 04 06 08 1

SVMCATCH+SVM

01

02

03

04

Random Forest

γ

0 02 04 06 08 1

RFCCATCH+RFC

Figure 4 IMRlowastrsquos of SVM RFC CATCH+SVM CATCH+RFC versus different contamination pa-rameters γ for simulation model (49) in example 44

5 CATCH Analysis of The ANN Data

In this section CATCH is applied to a data set available at the UCI data base(httpftpicsuciedupubmachine-learning-databasesthyroid-disease) The data is from a clinical trial to determine whether a patient is hypothyroid (seeeg Quinlan [15]) The dataset is split into two parts a training set with 3772 ob-servations and a testing set with 3428 observations The response is a categoricalvariable with three classes normal (not hypothyroid) hyperfunction and subnormalfunctioning There are 21 predictors including 15 categorical predictors and 6 numer-ical predictors The three classes are very imbalanced with 925 of normal patientsA good classifier should produce a significant decrease from an error rate 75 whichcan be obtained by classifying all observations to the normal class

20

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 23: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

CATCH identifies 7 variables as important including 4 continuous variables and3 categorical variables We consider three classification methods 1 Support VectorMachine with radial kernel denoted by SVM(r) 2 Support Vector Machine withpolynomial kernel denoted by SVM(p) 3 RFC We apply these three methods onthe training set to build classification models and use the test set to estimate themisclassification rate of the methods We apply the methods with CATCH and withoutCATCH respectively where ldquowith CATCHrdquo means that only the 7 important variablesidentified by CATCH are used to build the model and ldquowithout CATCHrdquo means thatall the 21 variables are used in the analysis

Table 5 Misclassification rates of three classification methods (SVM(r) SVM(p) RFC) with CATCHand without CATCH

SVM(r) SVM(p) RFC

wo CATCH 0052 0063 0024

with CATCH 0037 0055 0015

Table 5 shows that if CATCH is used to select important variables before the clas-sification methods are applied the misclassification rates decrease CATCH reducesthe misclassification rates by 29 13 and 37 for SVM(r) SVM(p) and RFC respec-tively CATCH + RFC has the best performance Although CATCH is not designed forimproving classification accuracy nonetheless it can help other classification methodsto perform better

6 Properties of Local Contingency Efficacy

In this section we investigate the properties of local contingency efficacy For a uni-variate continuous predictor Theorem 61 below shows that local contingency efficacyis well-defined and equals to 0 if the response variable is independent of the predictorConsider an R times C contingency table Let nrc r = 1 middot middot middot R c = 1 middot middot middot C be theobserved frequency in the (r c)-cell and let prc be the probability that an observationbelongs to the (r c)-cell Let

pr =Csumc=1

prc qc =Rsumr=1

prc (61)

21

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 24: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

and

nrmiddot =Csumc=1

nrc nmiddotc =Rsumi=1

nrc n =Rsumi=1

Csumc=1

nrc Erc = nrmiddotnmiddotcn (62)

Consider testing that the rows and the columns are independent ie

H0 forallr c prc = prqc versus H1 existr c prc 6= prqc (63)

a standard test statistic is

X 2 =Rsumr=1

Csumc=1

(nrc minus Erc)2

Erc (64)

Under H0 X 2 converges in distribution to χ2(Rminus1)(Cminus1) Moreover define 00 = 0 we

have the well known result

Lemma 61 As n tends to infinity X 2n tends in probability to

τ 2 equivRsumr=1

Csumc=1

(prc minus prqc)2

prqc

Moreover if prqc is bounded away from 0 and 1 as n rarr infin and if R and C are fixedas nrarrinfin then τ 2 = 0 if and only if H0 is true

Remark 61 Lemma 61 shows that the contingency efficacy ζ in (28) is well definedMoreover ζ = 0 if and only if X and Y are independent

Theorem 61 Consider a categorical response Y and a continuous predictor X withdensity function f(x) such that for a fixed h gt 0 f(x) gt 0 for x isin (x(0) minus h x(0) + h)then

(1) For fixed h the estimated local section efficacy ζ(x(0) h) defined in (25) convergesalmost surly to the section efficacy ζ(x(0) h) defined in (24)

(2) If X and Y are independent then the local efficacy ζ(x(0)) defined in (24) is zero

(3) If there exists c isin 1 middot middot middot C such that pc(x) = P (Y = c|x) has a continuousderivative at x = x(0) with pprimec(x

(0)) 6= 0 then ζ(x(0)) gt 0

The χ2 statistic (64) can be written in terms of multinomial proportions prc withradicn(prc minus prc) 1 le r le R 1 le c le C converging in distribution to a multivariate

normal By Taylor expansion of Tn equivradicχ2n at τ = plim

radicχ2n we find that

22

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 25: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

Theorem 62 radicn(Tn minus τ) minusrarrd N(0 v2) (65)

where the asymptotic variance v2 which can be obtained from the Taylor expansion isnot needed in this paper

Remark 62 The result (65) applies to the multivariate χ2-based efficacies ζ with nreplaced by the number of observations that goes in to the computations of ζ In thiscase we use the notation χ2

j for the chi-square statistic for the jth variable

7 Consistency Of CATCH

Let Y denote a variable to be classified into one of the categories 1 C by usingthe Xjrsquos in the vector X = (X1 Xd) Let Xminusj equiv X minusXj denote X without the Xj

component Assume that the squared Pearson nonparametric correlation (Doksum andSamarov [4] Huang and Chen [9]) between Xj and Xminusj is less than 05 We call thevariable Xj relevant for Y if conditionally given Xminusj L(Y |Xj = x) is not constant in xand irrelevant otherwise Our data are (X(1) Y (1)) (X(n) Y (n)) iid as (X Y ) Avariable selection procedure is consistent if the probability that all the relevant variablesare selected and none of the irrelevant variables are selected tends to one as n tends toinfinity

We give a general nonparametric framework where the variable selection rule basedon the CATCH importance scores sj given in (38) is consistent

When the predictors are not all categorical and CATCH involves tubes we assumethat the number of observations in each tube tends toinfin as nrarrinfin and we assume thatfor those importance scores that require the selection of a section size h the selection isfrom a finite set hj with minjhj gt h0 for some h0 gt 0 It follows that the numberof observations mjk in the kth tube used to calculate the importance score sj for thevariable Xj tends to infinity as nrarrinfin

The key quantities that determine consistency are the limits λj of sj That is

λj = τ(ζj P ) = EP [ζj(X)] =

intζj(x)dP (x) j = 1 d

where ζj(x) is the local contingency efficacy defined by (24) for numerical Xrsquos and by(28) for categorical Xrsquos and P is the probability distribution of X Our estimate sjof τj is

sj = τ(ζj P ) =1

M

Msumk=1

ζj(Xlowastk)

where ζj is defined in section 3 and P is the empirical distritution of XLet d0 denote the number of Xrsquos that are irrelevant for Y Set d1 = d minus d0 and

without loss of generality reorder the λj so that

23

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 26: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

λj gt 0 if j = 1 d1λj = 0 if j = d1 + 1 d

(71)

Our variable selection rule is

ldquokeep variable Xj if sj ge t for some threshold trdquo (72)

Rule (72) is consistent if and only if

P (minjled1sj ge t and max

jgtd1sj lt t) minusrarr 1 (73)

To establish (73) it is enough to show that

P (maxjgtd1sj gt t) minusrarr 0 (74)

andP (min

jled1sj le t) minusrarr 0 (75)

We call (74) and (75) type I and type II consistency respectively Type I and IIerror probabilities refer to probabilities of any false positives and any false negativesrespectively

71 Type I consistency

We consider (74) first and note that

P (maxjgtd1sj gt t) le

dsumj=d1+1

P (sj gt t) (76)

For the jth irrelevant variable Xj j gt d1 the results in Section 6 apply to the local

efficacy ζj(x) at a fixed point x The importance score sj defined in (38) is the averageof such efficacies over a sample of M random points xlowasta We assume that there is apoint x(0) such that for irrelevant xjrsquos and for n large enough ζj(x

(0)) is stochastically

larger in the right tail than the average Mminus1sumMa=1 ζj(x

lowasta) That is we assume that

there exist t gt 0 and n0 such that for n ge n0

p(sj gt t) le P (s(0)j gt t) j gt d1 (77)

The existence of such a point x(0) is established by considering the probability limit of

arg maxζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

Note by Lemma (61) s(0)j rarrp 0 as n rarr infin It follows that we have established

24

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 27: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

(74) for t and d0 = d minus d1 that are fixed as n rarr infin To examine the interesting casewhere d0 rarr infin as n rarr infin we need large deviation results which we turn to next Inthe d0 rarrinfin case we consider t equiv tn that tend to infin as nrarrinfin and we use the weakerassumption there exists x(0) and n0 such that

P (sj gt tn) le P (s(0)j gt tn) (78)

for n ge n0 and each tn with limnrarrinfin tn = infin We also assume that the number ofcolumns C and rows R in the contingency table that define the efficacy χ2

jn given by

(64) stays fixed as nrarrinfin In this case P (s(0)j gt tn)rarr 0 provided each term

[T (j)rc ]2 equiv

[nrcErcradicnradicErc

]2in s2j = χ2

jn defined for Xj by (64) satisfies

P(|T (j)rc | gt tn

)minusrarr 0

This is because

P

radicsumr

sumc

[T(j)rc ]2 gt tn

le P(RC max

rc[T (j)rc ]2 gt t2n

)le RC maxP

(|T (j)rc | gt tn

radicRC)

and tn and tnradicRC are of the same order as n rarr infin By large deviation theory Hall

[8] and Jing et al [10] for some constant A

P (|T (j)rc | gt tn) =

2tminus1nradic2πeminust

2n2[1 + At3nn

minus 12 + o

(t3nnminus 1

2

)](79)

provided tn = o(n

16

) If (79) is uniform in j gt d1 then (76) and (79) implies that

type I consistency holds if

d0

(tnradic

2

)minus1eminus

t2n2 rarr 0 (710)

To examine (710) write d0 and tn as d0 = exp(nb)

and tn =radic

2nr By large deviationtheory (710) holds when r lt 1

6 Thus if 0 lt 1

2b lt r lt 1

6 then

d0

(tnradic

2

)minus1eminus

t2n2 = nminusr exp

(nb minus n2r

)rarr 0

25

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 28: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

Thus d0 = exp(nb) with b = 13ε for any ε gt 0 is the largest d0 can be to have consistency

For instance if there are exactly d0 = exp(n

14

)irrelevant variables and we use the

threshold tn =radic

2n17 the probability of selecting any of the irrelevant variables tends

to zero at the ratenminus

14 exp

(n

14 minus n

27

)72 Type II consistency

We now assume that for each relevant variable Xj there is a least favorable point x(1)

such that the local efficacy s(1)j equiv ζ(x(1)) is stochastically smaller than the importance

score sj = Mminus1sumMa=1 ζ(xlowasta) That is we assume that for each relevant variable Xj there

exists n1 and x(1) such that for n ge n1

P(s(1)j le tn

)ge P (sj le tn) j le d1 (711)

The existence of such a least favorable x(1) is established by considering the probabilitylimit of

arg minζj(xlowasta) xlowasta isin xlowast1 middot middot middot xlowastM

We again assume that R and C are fixed so that to show type II consistency it isenough to show that for each (r c) and j le d1

P(|T (j)rc | le tn

)rarr 0

This is because

P

(radicsumr

sumc

[T

(j)rc

]2le tn

)le P (minrc

[T

(j)rc

]2le t2n)

le RC maxrc P(|T (j)rc | le tn

)By Theorem 62

radicn(s(1)j minus τj

)rarrd N(0 ν2) τj gt 0 It follows that if tn and d1

are bounded above as n rarr infin and τj gt 0 then P(s(1)j le tn

)rarr 0 Thus type II

consistency holds in this case To examine the more interesting case where tn and d1are not bounded above and τj may tend to zero as nrarrinfin we will need to center the

T(j)rc Thus set

θ(j)n =

radicn(p(j)rc minus p(j)r p

(j)c

)radicp(j)r p

(j)c

(712)

26

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 29: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

Tn(j) = T (j)rc minus θ(j)n

where for simplicity the dependence of T(j)n and θ

(j)n on r and c is not indicated

Lemma 71

P

(minjled1|T (j)rc | le tn

)le

d1sumj=1

P

(|T (j)n | ge min

jled1θ(j)n minus tn

)Proof

P(

minjled1 |T(j)rc | le tn

)= P

(minjled1 |T

(j)n + θ

(j)n | le tn

)le P

(minjled1 |T

(j)n | minusmaxjled1 |θ

(j)n | le tn

)= P

(maxjled1 |T

(j)n | ge minjled1 |θ

(j)n | minus tn

)lesumd1

j=1 P(|T (j)n | ge minjled1 |θ

(j)n | minus tn

)

Setan = min

jled1

∣∣θ(j)n ∣∣minus tnthen by large deviation theory if an = o

(n

16

)

P (∣∣T (j)n

∣∣ ge an) prop

(anradic

2

)minus1expminusa2n2 (713)

where prop signifies asymptotic order as nrarrinfin an rarrinfin It follows that if (713) holdsuniformly in j then by Lemma 71

P

(minjled1

∣∣T (j)rc

∣∣ le tn

)prop d1

(anradic

2

)minus1expminusa2n2

Thus type II consistency holds when an = o(n

16

)and

d1

(anradic

2

)minus1expminusa2n2 rarr 0 (714)

In Section 71 tn is of order nr 0 lt r lt 16 For this tn type II consistency holds if an

is of order nr 0 lt r lt 16 and

d1nminusr expminusn2r2 rarr 0

27

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 30: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

If an = O(nk) with k ge 16 then P (

∣∣∣T (j)(n)

∣∣∣ ge an) tends to zero at a rate faster than in

(713) and consistency is ensured For instance if all relevant variables Xj have p(j)rc

fixed as nrarrinfin then minjled1

∣∣∣θ(j)n ∣∣∣ is of orderradicn and an is of order n

12 minus tn

73 Type I and II consistency

We see from Sections 71 and 72 that consistency holds under a variety of conditionsOne set of conditions is

Theorem 71 Under the regularization conditions of Sections 71 and 72 if

d0 = c0 exp(nb) tn = c1nr

minjled1 θ(j) minus tn = c2nr d1 = c3 exp

(nb)

for positive constants c0 c1 c2 and c3 and 0 lt 12b lt r lt 1

6 then CATCH is consistent

AppendixProof of Theorem 61

(1) Let Nh =

sumni=1 I(X(i) isin Nh(x

(0))) Then Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) and

Nh rarras infin as nrarrinfin

ζ(x(0) h) = nminus12radicX 2(x(0) h)

= (Nh n)12(N

h )minus12radicX 2(x(0) h)

Since Nh sim Bi(n

int x(0)+hx(0)minush f(x)dx) we have

Nh nrarr

int x(0)+h

x(0)minushf(x)dx (715)

By Lemma 61 and Table (1)

(Nh )minus12

radicX 2(x(0) h)rarr

( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

(716)

where

p+c = Pr(X gt x0 Y = c|X isin Nh(x(0))) pminusc = Pr(X le x0 Y = c|X isin Nh(x

(0)))

28

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 31: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

for r = +minus and c = 1 middot middot middot C

pr =Csumc=1

prc qc =sumr=+minus

prc

Actually p+ = Pr(X gt x0|X isin Nh(x(0))) pminus = Pr(X le x0|X isin Nh(x

(0)))qc = Pr(Y = c|X isin Nh(x

(0)))

By (715) and (716)

ζ(x(0) h)rarr(int x(0)+h

x(0)minushf(x)dx

) 12( sumr=+minus

Csumc=1

(prc minus prqc)2

prqc

) 12

equiv ζ(x(0) h)

(2) Since X and Y are independent prc = prqc for r = +minus and c = 1 middot middot middot C thenζ(x(0) h) = 0 for any h Hence

ζ(x(0)) = suphgt0ζ(x(0) h) = 0

(3) Recall that pc(x(0)) = Pr(Y = c|x(0)) Without loss of generality assume pprimec(x

(0)) gt0 Then there exists h such that

Pr(Y = c|x(0) lt X lt x(0) + h) gt Pr(Y = c|x(0) minus h lt X lt x(0))

which is equivelant to p+cp+ gt pminuscpminus This shows that ζ(x(0)) gt 0 since byLemma 61 ζ(x(0)) = 0 results in p+cp+ = pminuscpminus = pc

References

[1] Breiman L 2001 Random forests Machine Learning 45 5ndash32

[2] Breiman L Friedman J H Olshen R A Stone J 1984 Classification andregression trees Wadsworth Belmont

[3] Doksum K Tang S Tsui K-W 2008 Nonparametric variable selection Theearth algorithm Journal of the American Statistical Association 103(484) 1609ndash1620

[4] Doksum K A Samarov A 1995 Nonparametric estimation of global functionalsand a measure of explanatory power of covariates in regression Annals of Statistics23 1443ndash1473

29

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61
Page 32: Nonparametric Variable Selection and Classi cation: The ...lc436/Tang_Chen_etal_2012_CSDA.pdf · pruning, will choose a subset of optimal splitting variables to be the most signi

[5] Doksum K A Schafer C 2006 Powerful choices Tuning parameter selectionbased on power Frontiers in Statistics Dedicated to Peter Bickel Editors J Fanand H L Koul Imperial College Press 113ndash141

[6] Friedman J 1991 Multivariate adaptive regression splines The annals of statis-tics 1ndash67

[7] Gao J Gijbels I 2008 Bandwidth selection in nonparametric kernel testingJournal of the American Statistical Association 103 (484) 1584ndash1594

[8] Hall P 1992 The Bootstrap and Edgeworth Expansions Springer New York

[9] Huang L-S Chen J 2008 Analysis of variance coefficient of determinationand f-test for local polynomial regression Annals of Statistics 36 2085ndash2109

[10] Jing B-Y Shao Q-M Wang Q 2003 Self-normalized cramer-type large devi-ations for independent random variables Annals of Probability 31 2167ndash2215

[11] Lin D Y Zeng D 2011 Correcting for population stratification in genomewideassociation studies Journal of the American Statistical Association 106 997ndash1008

[12] Loh W-Y 2009 Improve the precision of classification trees The Annals ofApplied Statistics 3 1710ndash1737

[13] Price A Patterson N Plenge R Weinblatt M Shadick N Reich D 2006Principal components analysis corrects for stratification in genome-wide associationstudies Nature genetics 38 (8) 904ndash909

[14] Qiao Z Zhou L Huang J 2008 Linear discriminant analysis for high dimen-sional Low Sample Size Data special issue of World Congress on Engineering2008

[15] Quinlan J R 1987 Simplifying decision trees International Journal of Man-Machine Studies 27(3) 221ndash234

[16] Schafer C Doksum K A 2009 Selecting local models in multiple regression bymaximizing power Metrika 69 283ndash304

[17] Tibshirani R 1996 Regression shrinkage and selection via the lasso J RoyalStatist Soc B 58 267ndash288

[18] Wang L Shen X 2007 On l1-norm multi-class support vector machinesmethodology and theory Journal of the American Statistical Association 102 583ndash594

[19] Zhang H H Liu Y Wu Y Zhu J 2008 Variable selection for the multicate-gory svm via adaptive sup-norm regularization Electronic Journal of Statistics 2149ndash167

30

  • Introduction
  • Importance Scores for Univariate Classification
    • Numerical Predictor Local Contingency Efficacy
    • Categorical Predictor Contingency Efficacy
      • Variable Selection for Classification Problems The CATCH Algorithm
        • Numerical Predictors
        • Numerical and Categorical Predictors
        • The Empirical Importance Scores
        • Adaptive Tube Distances
        • The CATCH Algorithm
          • Simulation Studies
            • Variable Selection with Categorical and Numerical Predictors
            • Improving The Performance of SVM and RF Classifiers
              • CATCH Analysis of The ANN Data
              • Properties of Local Contingency Efficacy
              • Consistency Of CATCH
                • Type I consistency
                • Type II consistency
                • Type I and II consistency
                  • Proof of Theorem 61