model selection criteria based on kullback information measures for nonlinear regression

Journal of Statistical Planning andInference 134 (2005) 332–349

www.elsevier.com/locate/jspi

Model selection criteria based on Kullbackinformation measures for nonlinear regression

Hyun-Joo Kima, Joseph E. Cavanaughb,∗aDepartment of Mathematics and Computer Science, Truman State University, USA

bDepartment of Biostatistics, C22-GH, College of Public Health, University of Iowa, 200 Hawkins Drive, IowaCity, IA 52242-1009, USA

Received 1 March 2002; accepted 6 May 2004Available online 22 July 2004

Abstract

In statistical modeling, selecting an optimal model from a class of candidates is a critical issue.During the past three decades, a number of model selection criteria have been proposed based onestimating Kullback’s (Information Theory and Statistics, Dover, Mineola, NY, 1968, p. 5) directeddivergence between the model generating the data and a fitted candidate model. The Akaike (SecondInternational Symposium on Information Theory, Akadémia Kiadó, Budapest, Hungary, 1973, pp.267–281; IEEE Trans. Automat. Control AC-19 (1974) 716) information criterion, AIC, was the firstof these. AIC is justified in a very general framework, and as a result, offers a crude estimator of thedirected divergence: one which exhibits a potentially high degree of negative bias in small-sampleapplications (Biometrika 76 (1989) 297). The “corrected” Akaike information criterion (Biometrika76 (1989) 297), AICc, adjusts for this bias, and consequently often outperforms AIC as a selectioncriterion. However, AICc is less broadly applicable than AIC since its justification depends upon thestructure of the candidate model. AICI (Biometrika 77 (1990) 709) is an “improved” version of AICfeaturing a simulated bias correction.Recently, model selection criteria have been proposed based on estimating Kullback’s (Information

Theory and Statistics, Dover, Mineola, NY, 1968, p. 6) symmetric divergence between the generatingmodel and a fitted candidate model (Statist. Probab. Lett. 42 (1999) 333; Austral. New Zealand J.Statist. 46 (2004) 257). KIC, KICc, and KICI are criteria devised to target the symmetric divergencein the same manner that AIC, AICc, and AICI target the directed divergence.AICc has been justified for the nonlinear regression framework by Hurvich and Tsai (Biometrika

76 (1989) 297). In this paper, we justify KICc for this framework, and propose versions of AICI andKICI suitable for nonlinear regression applications. We evaluate the selection performance of AIC,

∗ Corresponding author. Tel.: +1-319-394-5024; fax: +1-319-384-5018.E-mail address:[email protected](J.E. Cavanaugh).

0378-3758/$ - see front matter © 2004 Elsevier B.V. All rights reserved.doi:10.1016/j.jspi.2004.05.002

http://www.elsevier.com/locate/jspi

mailto:[email protected]

H.-J. Kim, J.E. Cavanaugh / Journal of Statistical Planning and Inference 134 (2005) 332–349333

AICc, AICI , KIC, KICc, and KICI in a simulation study. Our results generally indicate that the“improved” criteria outperform the “corrected” criteria, which in turn outperform the non-adjustedcriteria. Moreover, the KIC family performs favorably against the AIC family.© 2004 Elsevier B.V. All rights reserved.

MSC:62J02; 62F07; 62B10

Keywords:AIC; Akaike information criterion;I-divergence;J-divergence; Kullback–Leibler information;Nonlinear regression

1. Introduction

In statistical modeling, one of the main objectives is to select a suitable model from acandidate class to characterize the underlying data. Model selection criteria provide a usefultool in this regard. A selection criterion assesses whether a fitted model offers an optimalbalance between goodness-of-fit and parsimony. Ideally, a criterion will identify candidatemodels which are either too simplistic to accommodate the data or unnecessarily complex.The first model selection criterion to gain widespread acceptance was theAkaike (1973,

1974)information criterion, (AIC). AIC is applicable in a broad array of modeling frame-works, since its large-sample justification only requires conventional asymptotic propertiesof maximum likelihood estimators. However, in settings where the sample size is small,AIC tends to favor inappropriately high-dimensional candidate models (Hurvich and Tsai,1989); this limits its effectiveness as a model selection criterion.AIC serves as an estimator ofKullback’s (1968, p. 5)directed divergence between the

generating or “true” model (i.e., the model which presumably gave rise to the data) and afitted candidatemodel. The “corrected” AIC, AICc, is an adjusted version of AIC originallyproposed for linear regression with normal errors (Sugiura, 1978; Hurvich and Tsai, 1989).For fitted models in the candidate class which are correctly specified or overfit, AIC isasymptotically unbiased and AICc is exactly unbiased as an estimator of its target measure.In small-sample applications, AICc often dramatically outperforms AIC as a selection

criterion. Since the basic form of AICc is similar to that of AIC, the improvement inselection performance comes without an increase in computational cost. However, AICcis less broadly applicable than AIC since its justification relies upon the structure of thecandidate model.Another adjusted variant of AIC is AICI , an “improved” version of AIC proposed by

Hurvich et al. (1990)for Gaussian autoregressive model selection. The derivation of AICIproceeds by decomposing the expected directed divergence into two terms. The first termsuggests that the empirical log likelihood can be used to form a biased estimator of thedirected divergence; the second term provides the bias adjustment. Exact computation ofthe bias adjustment requires the values of the truemodel parameters, which are inaccessiblein practical applications. Yet for fitted models in the candidate class which are correctlyspecified or overfit, the adjustment is asymptotically independent of the true parameters.Thus, for large-sample applications, the adjustment may be approximated via Monte Carlosimulation using arbitrary values for the parameters.

334 H.-J. Kim, J.E. Cavanaugh / Journal of Statistical Planning and Inference 134 (2005) 332–349

The directed divergence, also known as theKullback–Leibler (1951)information or theI-divergence, accesses the dissimilarity between two statistical models. It is an asymmetricmeasure, meaning that an alternative directed divergence may be obtained by reversing theroles of the two models in the definition of the measure. The sum of the two directed diver-gences isKullback’s (1968, p. 6)symmetric divergence, also known as theJ-divergence.When used to evaluate fitted candidate models, the directed divergence which serves as thebasis for AIC is arguably less sensitive than the symmetric divergence towards detectingimproperly specified models. This premise has been used to justify the development of anew family of selection criteria (Cavanaugh, 1999, 2004). KIC, KICc, and KICI are criteriaconstructed to target the symmetric divergence in the same manner that AIC, AICc, andAICI target the directed divergence. KIC has been justified under the same general condi-tions as AIC (Cavanaugh, 1999); however, KICc has only been justified for linear regressionwith normal errors (Cavanaugh, 2004). As with AICI , KICI must be formulated based uponthe structure of the candidate modeling framework.Hurvich and Tsai (1989)established that AICc serves as an approximately unbiased

estimator of Kullback’s directed divergence for nonlinear regression candidate models withnormal errors. In this paper, we justify KICc in the same framework. We also proposeversions of AICI and KICI suitable for nonlinear regression applications. We evaluate theselection performance of AIC, AICc, AICI , KIC, KICc, and KICI in a simulation study. Ourresults generally indicate that the “improved” criteria outperform the “corrected” criteria,which in turn outperform the non-adjusted criteria. Moreover, the KIC family performsfavorably against the AIC family.In Section 2, we propose and discuss the criteria. Our simulation study is presented and

summarized in Section 3. The formal justification of KICc for nonlinear regression appearsin the appendix.

2. Selection criteria based on Kullback information measures

Thenonlinear regressionmodel is frequently used inmanyareasof thephysical, chemical,engineering, and biological sciences. The traditional regression model assumes that themean structure is linear in the model coefficients: i.e.,E[y] = X′�, wherey is the responsevariable,X is a vector of regressor variables, and� is anunknownparameter vector.However,oneoftenexpectsanonlinear relationshipbetweenE[y]andX, perhapsbecauseof the theorywhich supports the underlying phenomenon. Many nonlinear models fall into categoriesthat are designed for certain situations: thus, there are various families of nonlinear modelscorresponding to specific functional forms of the mean response.Assume a collection of dataY has been generated according to an unknown parametric

densityf (Y | �o), one which corresponds to the normal regression model

Y = ho(�o,Xo) + �, � ∼ N(0,�2oI ). (2.1)

Suppose that the candidate model postulated for the data is of the form

Y = h(�,X) + �, � ∼ N(0,�2I ). (2.2)


Here,Y is ann × 1 response vector,�o and� arepo × 1 andp × 1 parameter vectors, andXo andX aren × so andn × s design matrices with rowsXoi andXi . The mean vectorsare assumed to have the layouts

ho(�o,Xo) = (go(�o, Xo1), . . . , go(�o, Xon))′ and

h(�,X) = (g(�, X1), . . . , g(�, Xn)).′

To establish asymptotic inferential results, the mean response functiong is generally re-quired to be twice continuously differentiable in�. For the sake of brevity, we will subse-quently writeh(�,X) ash(�) andho(�o,Xo) asho(�o).Define the parameter vectors�o and�k as�o = (�′

o,�2o)

′ and�k = (�′,�2)′. The subscriptk on�k will refer to the dimension of the vector; i.e.,k = p + 1.Let f (Y | �k) denote the likelihood under model (2.2). We will use� and�2 to represent

the maximum likelihood estimators (MLEs) of� and�2, which are generally obtainedusing the Gauss–Newton method or some other iterative procedure. We will let�k denote

the MLE for �k; i.e., �k = (�′, �2)′. Accordingly,f (Y | �k) will represent the empirical

likelihood corresponding tof (Y | �k).LetF(k) = {

f (Y | �k)|�k ∈ �(k)}denote thek-dimensional parametric family of den-

sities corresponding to candidate models (2.2) of a particular size. Suppose our goal is tosearch among a collection of families{F(k1),F(k2), . . . , F(kL)} for the fitted modelf (Y | �k), k ∈ {k1, k2, . . . , kL}, which serves as the “best” approximation tof (Y | �o). Wenote that in many applications, some of the families in the candidate collection may havethe same dimension and yet be different; e.g., for some families of linear regressionmodels,the design matrices may have the same rank and yet different column spaces. For ease ofnotation, we do not include an index to delineate between such families.If f (Y | �o) ∈ F(k), andF(k) is such that no smaller family will containf (Y | �o), we

refer tof (Y | �k) ascorrectly specified. If f (Y | �o) ∈ F(k), yetF(k) is such that familiessmaller thanF(k)also containf (Y | �o), we refer tof (Y | �k)asoverfit. If f (Y | �o) /∈F(k),we refer tof (Y | �k) asunderfit.To determine which of the fitted models{f (Y | �k1), f (Y | �k2), . . . , f (Y | �kL

)} best re-semblesf (Y | �o), we require ameasure which provides a suitable reflection of the disparitybetween the true modelf (Y | �o) and a candidate modelf (Y | �k). Kullback’s directed andsymmetric divergence both fulfill this objective.For two arbitrary parametric densitiesf (Y | �) andf (Y | �∗), Kullback’sdirected diver-

gencebetweenf (Y | �) andf (Y | �∗) with respect tof (Y | �) is defined as

I (�, �∗) = E�

[ln

{f (Y | �)

f (Y | �∗)

}], (2.3)

and Kullback’ssymmetric divergencebetweenf (Y | �) andf (Y | �∗) is defined as

J (�, �∗) = E�

[ln

{f (Y | �)

f (Y | �∗)

}]+ E�∗

[ln

{f (Y | �∗)f (Y | �)

}]. (2.4)

Here,E� denotes the expectation underf (Y | �). Note thatJ (�, �∗) is symmetric in itsarguments whereasI (�, �∗) is not. Thus, an alternate directed divergence,I (�∗, �), may


be obtained by switching the roles off (Y | �) andf (Y | �∗) in (2.3). The sum of the twodirected divergences yields the symmetric divergence:J (�, �∗) = I (�, �∗) + I (�∗, �).For the purpose of assessing the proximity between a certain fitted candidate model

f (Y | �k) and the true modelf (Y | �o), we consider the measures

I (�o, �k) = I (�o, �k)|�k=�kand J (�o, �k) = J (�o, �k)|�k=�k

.

Of these two,Cavanaugh (1999, 2004)conjectures thatJ (�o, �k)may be preferred, since itcombinesI (�o, �k)with its counterpartI (�k, �o), a measure which serves a related yet dis-tinct function.Togauge thedisparitybetweenf (Y | �k)andf (Y | �o),I (�o, �k)assesseshowwell samples generated under the truemodelf (Y | �o) conform to the fitted candidatemodelf (Y | �k), whereasI (�k, �o) assesses howwell samples generated under the fitted candidatemodelf (Y | �k) conform to the true modelf (Y | �o). As a result of these contrasting roles,I (�o, �k) tends to be more sensitive towards reflecting overfit models, whereasI (�k, �o)

tends to be more sensitive towards reflecting underfit models. Accordingly,J (�o, �k) maybe more adept at detecting misspecification than either of its components.In what follows, we will show how the AIC family of model selection criteria arises

through estimating a variant ofI (�o, �k). We will then show how an alternate family ofselection criteria, the KIC family, arises through estimating a variant ofJ (�o, �k).For two arbitrary parametric densitiesf (Y | �) andf (Y | �∗), let

d(�, �∗) = E�[−2 ln f (Y | �∗)]. (2.5)

From (2.3) and (2.5), note that we can write

2I (�o, �k) = d(�o, �k) − d(�o, �o). (2.6)

Sinced(�o, �o) does not depend on�k, any ranking of a set of candidate models corre-sponding to values ofI (�o, �k) would be identical to a ranking corresponding to values ofd(�o, �k). Hence, for the purpose at hand,d(�o, �k) serves as a valid substitute forI (�o, �k).Now for a given set of MLEs�k,

d(�o, �k) = d(�o, �k)|�k=�k

would provide a meaningful measure of separation between the true model and a fittedcandidate model. Evaluatingd(�o, �k) is not possible since doing so requires knowledgeof �o. However, the work ofAkaike (1973, 1974)suggests that−2 ln f (Y | �k) serves asa biased estimator ofd(�o, �k), and that in many applications (including those beyond thescope of nonlinear regression models), the bias adjustment

B1(k, �o) = E�o[d(�o, �k)] − E�o

[−2 ln f (Y | �k)] (2.7)

can be asymptotically estimated by twice the dimension of�k. Specifically, if we assumethat �k satisfies the conventional large-sample properties of MLEs, and thatf (Y | �k) iseither correctly specified or overfit (f (Y | �o) ∈ F(k)), it can be shown that

B1(k, �o) � 2k. (2.8)


(See, for instance, Cavanaugh, 1997, p. 204.) With this motivation, we define the criterion

AIC = −2 ln f (Y | �k) + 2k.

As the sample size increases, the difference between the expected value of AIC and theexpected value ofd(�o, �k) should tend to zero. Accordingly, if we define

�(k, �o)=E�o[d(�o, �k)]

= E�o[−2 lnf (Y | �k)] + B1(k, �o), (2.9)

we may regard AIC as an asymptotically unbiased estimator of�(k, �o).Whenn is large andk is comparatively small, the degree of bias incurred in estimating

�(k, �o)with AIC is negligible. However, whenn is small andk is relatively large (e.g.,k �n/2), 2k is often much smaller thanB1(k, �o), making AIC substantially negatively biasedas an estimator of�(k, �o). If AIC severely underestimates�(k, �o) for high-dimensionalfitted models in the candidate class, the criterion may favor these models even though theymay correspond to large values ofd(�o, �k) (Hurvich and Tsai, 1989).AICc and AICI were proposed to serve as estimators of�(k, �o) which are less biased

in small-sample applications than traditional AIC (Hurvich and Tsai, 1989; Hurvich et al.,1990). However, since the justification of AICc and the computation of AICI are contingentupon the structure of the candidate modeling framework, these criteria are less generallyapplicable than AIC.AICcwas originally proposed bySugiura (1978)in the setting of linear regressionmodels

with normal errors. In this framework, the bias adjustment (2.7) can be evaluated exactlyfor correctly specified and overfit models. Wherep represents the rank of the design matrixfor the candidate model, it can be shown that whenf (Y | �o) ∈ F(k),

B1(k, �o) = 2n(p + 1)(n − p − 2) . (2.10)

(See Cavanaugh, 1997, pp. 204–205.) Thus, an exactly unbiased estimator of�(k, �o) isgiven by

AICc= −2 ln f (Y | �k) + 2n(p + 1)(n − p − 2) .

Although relation (2.10) does not hold precisely in the normal nonlinear regression frame-work, the arguments and results ofHurvich and Tsai (1989)suggest thatB1(k, �o) is wellapproximated by{2n(p + 1)}/(n − p − 2) even for relatively smalln.AICI was originally proposed for Gaussian autoregressive models byHurvich et al.

(1990). In this framework, relation (2.10) only holds approximately. However, whenf (Y | �o) ∈ F(k), the bias adjustmentB1(k, �o) is asymptotically independent of the truemodel parameters�o. Thus, for largen, B1(k, �o) may be approximated via Monte Carlosimulation after setting�o equal to a conveniently chosen vector.


To propose an AICI for the normal nonlinear regression framework, we utilize the fol-lowing results:

−2 ln f (Y | �k)=n ln �2+ {Y − h(�)}′{Y − h(�)}�2

,

−2 ln f (Y | �k)=n(ln �2+ 1), (2.11)

E�o[d(�o, �k)]=E�o

[n ln �2+ n�2o

�2+ {ho(�o) − h(�)}′{ho(�o) − h(�)}

�2

].

(2.12)

(In the preceding relations and throughout our development, we have neglected the additiveconstantn ln 2�.) Note that by using (2.11) and (2.12) in conjunction with (2.7) and (2.9),we may write

�(k, �o)=E�o[d(�o, �k)]

= E�o[−2 ln f (Y | �k)]

+ E�o

[n�2o�2

+ {ho(�o) − h(�)}′{ho(�o) − h(�)}�2

− n

]. (2.13)

The form of AICI is suggested by (2.13). To evaluate AICI , we set the parameters�2o and�o at conveniently chosen values, generateR samples according to model (2.1), solve fortheR sets of corresponding MLEs{(�2(1), �(1)), . . . , (�2(R), �(R))} under model (2.2),and compute the criterion via

AICI= − 2 ln f (Y | �k) + 1

R

R∑j=1

[n�2o

�2(j)

+ {ho(�o) − h(�(j))}′{ho(�o) − h(�(j))}�2(j)

− n

].

In frameworks where it is not possible to evaluateB1(k, �o) exactly, AICI may estimate�(k, �o) with less bias than AICc, and may outperform AICc as a selection criterion(Hurvich et al., 1990).Next, we propose selection criteria devised to targetJ (�o, �k) in the same manner that

AIC, AICc, and AICI targetI (�o, �k).Similar to (2.6), using (2.4) and (2.5), we can write

2J (�o, �k) = {d(�o, �k) − d(�o, �o)} + {d(�k, �o) − d(�k, �k)}.Discarding the constantd(�o, �o) from the preceding yields

K(�o, �k) = d(�o, �k) + {d(�k, �o) − d(�k, �k)}.For the purpose of discriminating among various candidate models,K(�o, �k) is equivalentto J (�o, �k). Measures such asK(�o, �k), J (�o, �k), d(�o, �k), andI (�o, �k) are oftencalleddiscrepancies. (See Linhart and Zucchini, 1986, pp. 11–12.)


Now consider estimating

K(�o, �k) = K(�o, �k)|�k=�k.

If −2 ln f (Y | �k) is regarded as a platform for an estimator of this measure, the challengeis then to correct for the bias. The bias adjustment may be expressed as

E�o[K(�o, �k)] − E�o

[−2 ln f (Y | �k)]= E�o

[d(�o, �k)] − E�o[−2 ln f (Y | �k)] (2.14)

+ E�o[d(�k, �o)] − E�o

[d(�k, �k)]. (2.15)

Note that the difference (2.14) is the same asB1(k, �o), the bias adjustment ford(�o, �k)

expressed in (2.7). For difference (2.15), define

B2(k, �o) = E�o[d(�k, �o)] − E�o

[d(�k, �k)]. (2.16)

The penalty terms of AIC, AICc, and AICI provide us with estimators ofB1(k, �o); ourgoal is to seek similar estimators ofB2(k, �o). This will lead us to a set of criteria which areanalogous to AIC, AICc, and AICI, targetingK(�o, �k) in the same way that the AIC-typecriteria targetd(�o, �k).First, we propose an analogue of AIC based on estimatingB2(k, �o) in the same manner

that the penalty term of AIC estimatesB1(k, �o). If we assume thatf (Y | �k) is eithercorrectly specified or overfit (f (Y | �o) ∈ F(k)), it can be shown that for largen,

B2(k, �o) � k. (2.17)

(See Cavanaugh, 1999, pp. 337–338.) As with (2.8), the preceding applies to any modelingframework in which�k satisfies the conventional properties of MLEs. Motivated by thelarge-sample approximations (2.8) and (2.17), we define the criterion

KIC = −2 ln f (Y | �k) + 3k.

As the sample size increases, the difference between the expected value of KIC and theexpected value ofK(�o, �k) should tend to zero. Accordingly, if we define

�(k, �o)=E�o[K(�o, �k)]

= E�o[−2 ln f (Y | �k)] + B1(k, �o) + B2(k, �o), (2.18)

we may regard KIC as an asymptotically unbiased estimator of�(k, �o).Cavanaugh (2004)proposed an analogue of AICc for normal linear regression models

based on estimatingB2(k, �o) in the same manner that the penalty term of AICc estimatesB1(k, �o). In the normal linear regression framework, the bias adjustment (2.16) can beevaluated exactly for correctly specified and overfit models. Whenf (Y | �o) ∈ F(k), it canbe shown that

B2(k, �o) = n ln(n

2

)− n

(n − p

2

), (2.19)


where(·) denotes thepsi or digammafunction. Although(·) does not have a closedform representation, an accurate substitute for (2.19) is suggested by the large-sampleapproximation{

n ln(n

2

)− n

(n − p

2

)}� n ln

(n

n − p

)+ n

n − p. (2.20)

(See Kotz and Johnson, 1982, p. 373.) Based on (2.18), (2.10), (2.19), and (2.20), we definethe criterion

KICc= −2 ln f (Y | �k) + n ln

(n

n − p

)+ n {(n − p)(2p + 3) − 2}

(n − p − 2)(n − p).

For the normal nonlinear regression framework, the justification of KICc as an approxi-mately unbiased estimator ofK(�o, �k) is provided in the appendix.Finally, we introduce KICI as an analogue of AICI based on augmenting AICI with a

simulated approximation toB2(k, �o). We utilize the following results:

E�o[d(�k, �k)]=E�o

[n(ln �2+ 1)], (2.21)

E�o[d(�k, �o)]=E�o

[n ln �2o + n�2

�2o+ {h(�) − ho(�o)}′{h(�) − ho(�o)}

�2o

]. (2.22)

Note that by using (2.11), (2.12), (2.21), and (2.22) in conjunction with (2.7), (2.16), and(2.18), we may write

�(k, �o)=E�o[K(�o, �k)]

= E�o[−2 ln f (Y | �k)]

+ E�o

[n ln

(�2o�2

)+ n�2o

�2+ n�2

�2o

+(1

�2+ 1

�2o

){ho(�o) − h(�)}′{ho(�o) − h(�)} − 2n

]. (2.23)

The form of KICI is suggested by (2.23). To evaluate KICI , we set the parameters�2o and�o at conveniently chosen values, generateR samples according to model (2.1), solve fortheR sets of corresponding MLEs{(�2(1), �(1)), . . . , (�2(R), �(R))} under model (2.2),and compute the criterion via

KICI= − 2 ln f (Y | �k)

+ 1

R

R∑j=1

[n ln

{�2o

�2(j)

}+ n�2o

�2(j)+ n�2(j)

�2o

+{

1

�2(j)+ 1

�2o

}{ho(�o) − h(�(j))}′{ho(�o) − h(�(j))} − 2n

].

Having now presented AIC, AICc, AICI , KIC, KICc, and KICI as selection criteria fornonlinear regression applications, we evaluate the selection performance of these criteriain a simulation study.


3. Simulations

3.1. Simulation sets based on nested candidate models

Consider a setting where a sample of sizen is generated from an exponential regressionmodel of the form (2.1) with

ho(�o,Xo) = (o exp(X′o1�o), . . . , o exp(X

′on�o))

′; (3.1)

i.e.,go(�o, Xoi)=o exp(X′oi�o) and�o = (o,�

′o)

′. Here,Xo is ann× so covariate matrixof rankso with rowsXoi , o is a scale parameter, and�o is anso × 1 regression parametervector. Suppose our objective is to search among a candidate collection of nested familiesfor the fitted model which serves as the best approximation to (2.1).Assumeour candidatemodels are of form the (2.2)with anexponential response function:

h(�,X) = ( exp(X′1�), . . . , exp(X′

n�))′; (3.2)

i.e.,g(�, Xi) = exp(X′i�) and� = (,�′)′. Here,X is ann × s covariate matrix of ranks

with rowsXi , is a scale parameter, and� is ans × 1 regression parameter vector. Let

and� denote the MLEs of and�, and let� = (, �′)′ denote the MLE of�.

In fitting candidate models to the data, we will consider nested design matricesX ofrankss = 1,2, . . . , S. We will assume that the design matrix of rankso (1< so < S) iscorrectly specified. Hence, fitted models for which 1�s < so are underfit, and those forwhich so < s�S are overfit. We will refer tos as theorder of the model and toso as thetrue order.The dimension of the model is given byk = s + 2, and the size of the parametervector� is p = s + 1.We examine the behavior of AIC, AICc, AICI , KIC, KICc, and KICI as order selection

criteria. For the simulated bias adjustments of AICI and KICI, recall that the true modelparametersmay be set to convenient values. The values chosen for the parameters areo=1,�o = (0,0, . . . ,0)′, and�2o = 1. With these specifications, (2.1) and (3.1) imply that theresponse vectorYconsists of independent, identically distributed standard normal variates.Additionally, AICI and KICI may be defined as follows:

AICI=n(ln �2+ 1)

+ 1

R

R∑j=1

[n

�2(j)+ {h(�(j))′h(�(j))}

�2(j)− n

],

KICI=n(ln �2+ 1)

+ 1

R

R∑j=1

[−n ln �2(j) + n

�2(j)+ n�2(j)

+{

1

�2(j)+ 1

} {h(�(j))′h(�(j))

}− 2n

].

(Two R routines are available from the authors which will provide the simulated biasadjustments for AICI and KICI for nonlinear regression applications.)


Table 1Order selections, minimum MSEP selections, and average MSEP for selections; generating model (3.3)

Set S Selections Criterion

so AIC AICc AICI KIC KICc KICIn

1 7 Underfit 1 1 2 2 2 33 Correctly specified 676 811 814 848 908 91150 Overfit 323 188 184 150 90 86

Minimum MSEP 616 741 744 778 838 840Average MSEP 0.0605 0.0526 0.0517 0.0508 0.0476 0.0469



3 10 Underfit 0 0 0 0 0 03 Correctly specified 694 768 792 857 889 912

100 Overfit 306 232 208 143 111 88Minimum MSEP 650 724 747 811 843 866Average MSEP 0.0308 0.0278 0.0271 0.0249 0.0237 0.0232

In the initial six simulation sets, 1000 samples are generated from a true model whereo = 1 and�o = (1,1, . . . ,1)′. In the first three of these sets,so = 3 and�2o = 1. Thus, inscalar form, the true model can be written as

yi = exp(x1i + x2i + x3i ) + ei, ei ∼ iid N(0,1). (3.3)

In the next three sets,so = 5 and�2o = 4, meaning that the scalar representation of the truemodel is

yi = exp(x1i + x2i + x3i + x4i + x5i ) + ei, ei ∼ iid N(0,4). (3.4)

The regressors for all models are produced using a Uniform(−1, 1) distribution. For eachof the true models, three different sample sizesn are employed: 50, 75, and 100. For thesets withn = 50 and 75, candidate models of orders 1–7 are considered; for the sets withn = 100, orders 1–10 are entertained. The simulated bias adjustments for AICI and KICIare based onR = 200 replications. For every sample in a set, the fitted model favored byeach criterion is recorded. Over the 1000 samples, the order selections are tabulated andsummarized.The order selection results for the three sets corresponding to model (3.3) are featured in

Table 1; those corresponding to model (3.4) appear inTable 2. Note that eachJ-divergencecriterion obtainsmore correct selections than itsI-divergence counterpart. Also, within eachfamily of criteria, the “improved” criterion outperforms the “corrected” criterion, which inturn outperforms the non-adjusted criterion. In each of the sets, KICI obtains the mostcorrect selections, followed by KICc. KIC ranks third in every set except the fourth.


Table 2Order selections, minimum MSEP selections, and average MSEP for selections; generating model (3.4)

Set S Selections Criterion

so AIC AICc AICI KIC KICc KICIn





6 10 Underfit 0 0 0 0 0 05 Correctly specified 695 786 796 855 893 897

100 Overfit 305 214 204 145 107 103Minimum MSEP 623 711 721 779 815 819Average MSEP 0.1643 0.1518 0.1487 0.1419 0.1357 0.1348

Order

1 2 3 4 5 6 7

56

58

60

62

64

66

68

70

DeltaAICAICcAICi

Fig. 1. Criterion averages and simulated�(k, �o) (Table 1, Set 1).

Figs. 1and2 help to explain the order selection behaviors of the criteria. These figuresare based on the results of simulation set 1 inTable 1. In Fig. 1, the criterion averagesfor the AIC family and the simulated expected discrepancy�(k, �o) are plotted versus the


Order

1 2 3 4 5 6 7

56

58

60

62

64

66

68

70

OmegaKICKICcKICi

Fig. 2. Criterion averages and simulated�(k, �o) (Table 1, Set 1).

model order; inFig. 2, the criterion averages for the KIC family and simulated expecteddiscrepancy�(k, �o) are plotted versus the model order. The following conclusions can bedrawn.

• Past the truemodel order, the curves for�(k, �o)and theKIC family (Fig. 2) exhibitmoreextreme slopes than the corresponding curves for�(k, �o) and the AIC family (Fig. 1).As a result, theJ-divergence criteria are less likely than theirI-divergence counterpartsto choose overfit models.

• AIC andKIC tend to underestimate�(k, �o) and�(k, �o) for overfit models. As a result,these criteria often select models of an inappropriately high order.

• AICc and KICc also tend to underestimate�(k, �o) and�(k, �o) for overfit models,although the degree of underestimation is much less than that exhibited by AIC andKIC.

• The AICI and KICI curves track the�(k, �o) and�(k, �o) curves very closely. Thus, the“improved” criteria tend to estimate the expected discrepancies with the least amount ofbias.

In addition to evaluating the criteria on the basis of order selections, it is also of interestto investigate whether the criteria choose the fitted model which most accurately predictsthe response. We define the mean squared error of prediction (MSEP) as

MSEP= 1

n

n∑i=1

(yi − E�o[yi])2


(cf. Myers, 1990, p. 180). For every sample in a set, MSEP is computed for each of the fittedcandidate models. Over the 1000 samples, the number of times that each criterion choosesthe fitted model corresponding to the smallest MSEP is recorded. The average MSEP forthe models selected by each criterion is also recorded.The MSEP results for the six sets are featured inTables 1and 2. Within each family

of criteria, the “improved” criterion produces the most minimum MSEP selections, fol-lowed respectively by the “corrected” criterion and the non-adjusted criterion. Also, eachJ-divergence criterion obtains more minimumMSEP selections than itsI-divergence coun-terpart.The results for the average MSEP of the selected models are less straightforward to

characterize. Sets 1, 2, 3, and 6 exhibit the previous pattern. Within each family of crite-ria, the “improved” criterion yields the smallest average MSEP, followed respectively bythe “corrected” criterion and the non-adjusted criterion. Also, eachJ-divergence criterionproduces a smaller average MSEP than itsI-divergence counterpart. This pattern does nothold in sets 4 and 5, however. In these sets, the KIC family does not perform as favorablyrelative to the AIC family.For the sets based on model (3.4), the configuration is more conducive to underfitting

than for the sets based on model (3.3). Although the propensity of the criteria to choose anunderfit model is attenuated as the sample size is increased, the sample sizes in sets 4 and 5are small enough to result in under-specified selections. These selections often correspondto large values of MSEP, which tend to inflate the average MSEP.

3.2. Simulation sets where the true model is vague

Consider the same simulation setting as that described in the previous subsection. In thefour sets to follow, however, we define each true model so there is no unambiguous optimalorder for the class of fitted candidate models.In each set, 1000 samples of size 50 are generated from a true model whereo = 1 and

�2o=1.The�o vectors consideredare�o=(1,1,1,0.05,0.01)′,�o=(1,0.9,0.7,0.1,0.05)′,�o = (1,0.5,0.2,0.1,0.05)′, and�o = (1,0.7,0.45,0.25,0.1,0.05,0.01)′. Thus, in scalarform, the true models can be written as follows:

yi = exp(x1i + x2i + x3i + 0.05x4i + 0.01x5i ) + ei, (3.5)

yi = exp(x1i + 0.9x2i + 0.7x3i + 0.1x4i + 0.05x5i ) + ei, (3.6)

yi = exp(x1i + 0.5x2i + 0.2x3i + 0.1x4i + 0.05x5i ) + ei, (3.7)

yi=exp(x1i + 0.7x2i + 0.45x3i + 0.25x4i + 0.1x5i

+ 0.05x6i + 0.01x7i ) + ei, (3.8)

with ei ∼ iid N(0,1).Candidate models of orders 1–7 are fit to the data. However, a clearly defined optimal

order does not exist for the fitted models in the candidate class. Consider, for instance, thegenerating model (3.5). Although the order of this model is 5, it is questionable whether theinclusion of the fourth or fifth regressor justifies the cost of estimating the correspondingparameter. Thus, whether the optimal model order is 3, 4, or 5 is uncertain.


Table 3Minimum MSEP selections and average MSEP for selections

Generating model Selections Criterion

AIC AICc AICI KIC KICc KICI

(3.5) Minimum MSEP 381 484 487 523 586 587Average MSEP 0.0616 0.0545 0.0537 0.0526 0.0481 0.0476




We refer to models (3.5) through (3.8) asvague. The parameter values are configuredso that each consecutive model is more vague than its predecessor. For these models, it ispointless to investigate how often the criteria choose the fitted model with the same order asthe true model. However, it is still meaningful to explore whether the criteria tend to choosea fitted model which accurate predicts the response.As with the simulation results compiled for the previous subsection, for every sample in a

set, MSEP is computed for each of the fitted candidate models. Over the 1000 samples, thenumber of times that each criterion chooses the fitted model corresponding to the smallestMSEP is recorded. The average MSEP for the models selected by each criterion is alsorecorded. The results for the four sets are featured inTable 3.For the first two sets (based on models (3.5) and (3.6)), the results for the minimum

MSEP selections are consistent with those reported inTables 1and2. Moreover, the resultsfor the average MSEP are congruous with those reported inTable 1and in set 6 ofTable 2.However, as the generating model becomes increasingly vague, the AIC family of criteriabegins to marginally outperform the KIC family. In the last two sets (based on models (3.7)and (3.8)), all of the criteria exhibit difficulty in identifying the fitted model correspondingto the minimumMSEP. However, the average MSEPs, which are similar across all criteria,are comparable to the average MSEPs reported in the first two sets, as well as in set 1 ofTable 1(where the same values ofn and�2o are employed). Thus, although no criterionconsistently chooses the fitted model corresponding to the minimum MSEP, no criterionpersistently selects a fitted model with a large MSEP.

3.3. Conclusions

An extensive collection of simulation sets not featured here reflect selection patternssimilar to those in the sets reported. In most settings, the “improved” criteria outperformthe “corrected” criteria, which in turn outperform the non-adjusted criteria. Moreover, theKIC family performs favorably against the AIC family.


Our results support two conclusions regardingmodel selection criteria based onKullbackinformation measures. First, the performance of a criterion appears to be largely dictatedby how well its penalty term approximates the corresponding bias adjustment. Second,for the purpose of delineating between correctly specified and misspecified models, thesymmetric divergence may serve as a more sensitive discrepancy measure than the directeddivergence.

Acknowledgements

We wish to extend our sincere appreciation to the associate editor and to two refereesfor carefully reading the original version of this manuscript, and for preparing helpful andconstructive critiques which served to greatly improve the exposition and content. We alsoextend our thanks to the editor, Professor Subir Ghosh.

Appendix A. Justification of KICc for nonlinear regression

In what follows, we will require thatf (Y | �o) ∈ F(k); i.e., that the fitted model iscorrectly specified or overfit. Under this assumption, the truemodel and the candidatemodelcan be written using the same response function, the same design matrix, and parametervectors of a common dimension:

Y=h(�o,X) + �, � ∼ N(0,�2oI ), (A.1)

Y=h(�,X) + �, � ∼ N(0,�2I ). (A.2)

Here, the common design matrixX isn×p. The parameter vector for the true model (A.1),�o, and for the candidate model (A.2),�, are both of dimensionp × 1.The ith row of the design matrix will be denoted byXi . The mean vectors are assumed

to have the layouts

h(�o,X) = (g(�o, X1), . . . , g(�o, Xn))′ and

h(�,X) = (g(�, X1), . . . , g(�, Xn)),′

where the common mean response functiong is twice continuously differentiable in�. Asbefore, for brevity, we will writeh(�,X) ash(�) andh(�o,X) ash(�o).We will assume that the final(p − po) components of�o are zero(po �p). This is

permissible since we are requiring that the candidate model is either correctly or overspecified.As before, we will let�o = (�′

o,�2o)

′ and�k = (�′,�2)′, and usef (Y | �o) andf (Y | �k)

to denote the densities corresponding to models (A.1) and (A.2).Now recall (2.18) from Section 2:

�(k, �o) = E�o[−2 ln f (Y | �k)] + B1(k, �o) + B2(k, �o). (A.3)


The first of the three terms on the right-hand side of (A.3) suggests that−2 ln f (Y | �k)

servesasabiasedestimator for�(k, �o). As shown inHurvichandTsai (1989, pp. 299–300),for largen, B1(k, �o) is approximately equal to the penalty term of AICc; i.e.,

B1(k, �o) � 2n(p + 1)(n − p − 2) . (A.4)

By (2.16), (2.21), and (2.22), the bias adjustmentB2(k, �o) can be written as

B2(k, �o)=E�o

[n ln

(�2o�2

)+ n�2

�2o+ {h(�) − h(�o)}′{h(�) − h(�o)}

�2o− n

].

(A.5)

To simplify (A.5), we will use the following large-sample results for MLEs in the normalnonlinear regression framework (Gallant, 1987, p. 17). The linear expansion ofh(�)at�=�o

is given byh(�) � h(�o)+V(�−�o),whereV=�h(�)/�� evaluated at�=�o. Under thetruemodel, the difference(�−�o) is approximatelymultivariate normal,N(0,�2o(V

′V)−1),the ratio (n�2/�2o) is approximately distributed as a�

2 random variable with(n−p) degrees

of freedom, and� and�2 are approximately independent.Using the preceding, we have

E�o

[n�2

�2o

]� n − p (A.6)

and

E�o

[{h(�) − h(�o)}′{h(�) − h(�o)}

�2o

]� E�o

[(� − �o)

′V ′V(� − �o)

�2o

]� p.

(A.7)

Using (A.6) and (A.7) in conjunction with (A.5), we may argue

B2(k, �o) � E�o

[n ln

(�2o�2

)]. (A.8)

Now recall that the expectation of the log of a random variable with a central�2 distribu-tion havingdfdegrees of freedom is ln 2+(df /2), where(·) denotes thepsiordigammafunction. (See, for instance, McQuarrie and Tsai, 1998, p. 67.) The termE�o

[ln(n�2/�2o)

]has the form of the expectation of the log of central�2 random variable with (n−p) degrees


of freedom. This fact along with (A.3), (A.4), and (A.8) yields

�(k, �o) � E�o[f (Y | �k)] + 2n(p + 1)

(n − p − 2) + n ln(n

2

)− n

(n − p

2

). (A.9)

An accurate substitute for{n ln(n/2) − n((n − p)/2)} is provided by the large-sampleapproximation (2.20). Employing this approximation in (A.9) justifies

KICc= −2 ln f (Y | �k) + n ln

(n

n − p

)+ n{(n − p)(2p + 3) − 2}

(n − p − 2)(n − p)

as an approximately unbiased estimator of�(k, �o).

References

Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N.,Csáki, F. (Eds.),Second InternationalSymposiumon InformationTheory.AkadémiaKiadó,Budapest,Hungary,pp. 267–281.

Akaike, H., 1974. A new look at the statistical model identification. IEEE Trans. Automat. Control AC-19,716–723.

Cavanaugh, J.E., 1997. Unifying the derivations of the Akaike and corrected Akaike information criteria. Statist.Probab. Lett. 33, 201–208.

Cavanaugh, J.E., 1999. A large-sample model selection criterion based on Kullback’s symmetric divergence.Statist. Probab. Lett. 42, 333–343.

Cavanaugh, J.E., 2004. Criteria for linear model selection based on Kullback’s symmetric divergence. Austral.New Zealand J. Statist., 46, 257–274.

Gallant, A.R., 1987. Nonlinear Statistical Models. Wiley, New York, NY.Hurvich, C.M., Tsai, C.-L., 1989. Regression and time series model selection in small samples. Biometrika 76,297–307.

Hurvich, C.M., Shumway, R.H., Tsai, C.-L., 1990. Improved estimators of Kullback–Leibler information forautoregressive model selection in small samples. Biometrika 77, 709–719.

Kotz, S., Johnson, N.L. (Eds.), 1982. Encyclopedia of Statistical Sciences, vol 2. Wiley, New York, NY.Kullback, S., 1968. Information Theory and Statistics. Dover, Mineola, NY.Kullback, S., Leibler, R.A., 1951. On information and sufficiency. Ann. Math. Statist. 22, 76–86.Linhart, H., Zucchini, W., 1986. Model Selection. Wiley, New York, NY.McQuarrie, A.D.R., Tsai, C.-L., 1998. Regression and Time SeriesModel Selection.World Scientific, River Edge,NJ.

Myers, R.H., 1990. Classical and Modern Regression with Applications. 2nd Edition. Duxbury, Pacific Grove,CA.

Sugiura, N., 1978. Further analysis of the data by Akaike’s information criterion and the finite corrections. Comm.Statist. A7, 13–26.

model selection criteria based on kullback information measures for nonlinear regression

Documents