copula information criterion for model selection with two ......copula information criterion for...

Copula information criterion for model selection

with two-stage maximum likelihood estimation

Vinnie Ko∗, Nils Lid HjortDepartment of Mathematics, University of Oslo

PB 1053, Blindern, NO-0316 Oslo, Norway

February 2018

Abstract

In parametric copula setups, where both the margins and copula have parametric forms, two-stage

maximum likelihood estimation, often referred to as inference functions for margins, is used as an at-

tractive alternative to the full maximum likelihood estimation strategy. Exploiting basic results derived

earlier by the present authors, we develop a copula information criterion (CIC) for model selection. The

CIC is defined as CIC = 2�n(�η)− 2�p∗η, where �n(�η) is the maximized log-likelihood under the two-stage

maximum likelihood estimation scheme, with η the full parameter vector for the candidate model in

question, and �p∗η is a certain penalization factor. In a nutshell, CIC aims for the model that minimizes

the Kullback–Leibler divergence from the real data generating mechanism. CIC does not assume that

the chosen parametric model captures this true model, unlike what is assumed for AIC. In this sense

CIC is analogous to the Takeuchi Information Criterion (TIC), which is defined for the full maximum

likelihood. If we make an additional assumption that a candidate model is correctly specified, then CIC

for that model simplifies to AIC. Further, since both CIC and TIC are estimating the same part of the

Kullback–Leibler divergence, they are compatible, in the sense that they can be used to compare the

performance of full maximum-likelihood and two-stage maximum likelihood for a given model. Addi-

tionally, we show that CIC can easily be extended to the conditional copula setup where covariates are

parametrically linked to the copula model.

As a numerical illustration, we perform a simulation and find that CIC outperforms AIC in terms of

prediction performance from the selected models. However, as sample size grows, the difference between

CIC and AIC becomes minimal because the log-likelihood part outgrows the bias correction part. Further,

we learn from the simulation that �p∗η, the bias correction term of CIC, has a strong positive relationship

with the prediction performance of the model. So, a model with bad prediction performance is being

penalized more by CIC.

Keywords: copula, Akaike information criterion, copula information criterion, model robust, two-stage max-

imum likelihood, inference functions for margins,

∗Corresponding author.E-mail addresses: [email protected] (V. Ko), [email protected] (N.L. Hjort)

1

1 Introduction and copula models

One of the main practical issues in copula modeling is model selection. In the full parametric setup, where

both the copula and margins are assumed to have a parametric form, one often has multiple candidates for

both the copula and margins. As the dimension of the model increases, a list of possible combinations of

margins and copula grows rapidly. Hence, there is a need for a model selection criterion that can evaluate

each model systematically according to certain philosophy or criteria and assign a score to each model. In

the end, one would choose the model with the best score.

Throughout this paper, we consider the full parametric setup. In this setup, one can simultaneously

estimate all parameters of the model (i.e. both copula parameters and margin parameters) by using maximum

likelihood (ML) estimation. In this ML estimation framework, one can for instance use AICML (Akaike, 1974)

or TIC (also known as model-robust AICML) (Takeuchi, 1976) as model selection criterion and select the

model with the best score. (Note that we denote the AIC under ML estimation as AICML to distinguish it

from the two-stage ML based AIC2ML, which we will derive in Section 2.3.) However, when the dimension

of the copula model gets high, the number of parameters increases quickly and the ML estimation is not

always feasible in terms of speed and numerical stability. Two-stage maximum likelihood (two-stage ML)

estimation, also often referred to as inference functions for margins (IFM), is a popular alternative estimation

strategy that is designed to overcome these drawbacks of the ML estimation. In stage 1 of the two-stage ML

estimation, the parameter vectors of each marginal distribution are estimated separately by ML. In stage

2, the estimates from stage 1 are plugged into the log-likelihood of the model. Then, the parameters of the

copula, which are now the only unknown parameters, are estimated by using ML estimation again. One of

the advantages of this multi-stage approach is that it is computation-wise much faster than estimating all

parameters simultaneously, because it does not have to search for the global maximum in high-dimensional

space. A drawback of the two-stage ML estimation method, however, is that we cannot use the classical

results based on ML estimation, which include model selection criteria such as TIC and BIC.

In practice, different sorts of goodness-of-fit testing are often used as substitutes, to choose the best

model (Genest & Favre, 2007). Another often used model selection strategy for the two-stage ML is that

one first evaluates candidates of each marginal distribution with AICML and consequently chooses the best

distribution for each margin. Once the margins are chosen, one fits different copulae and evaluates the copula

part with AICML. However, this piecewise model evaluation cannot evaluate the model as a whole.

In this paper, we develop the copula information criterion (CIC) for two-stage ML estimation, which has

the form

CIC = 2�n(�η)− 2�p∗η.

Here �n(�η) is the maximized log-likelihood with the two-stage ML estimation method, in terms of the full

parameter vector η of the model in question, and �p∗η is a suitable penalization factor, worked out in Section 2.2.

The main advantage of CIC is that it can evaluate a parametric copula with parametric margins as a whole.

CIC is also a model-robust model selection criterion which means that it does not assume that the candidate

model contains the true model. As the overlap of the name already suggests, our CIC is analogous to CIC

from Grønneberg & Hjort (2014), which is designed for copulae estimated with pseudo maximum likelihood

(PML). In PML framework, margins are estimated empirically, while two-stage ML assumes parametric

forms of margins.

2

Our technical setting, identical to Ko & Hjort (2018), is as follows. Let (Y1, · · · , Yd)T be a d-variate

continuous stochastic variable originating from a joint density g(y1, · · · , yd) and let yi = (yi,1, · · · , yi,d)T,for i = 1, . . . , n, be independent observations of this variable. The true joint distribution g is typically

unknown. Let f(y1, · · · , yd, η) be our parametric approximation of g, with the parameter vector η, belonging

to some connected subset of the appropriate Euclidean space. In addition, G and F (·, η) indicate cumulative

distribution functions corresponding to g and f(·, η), respectively. Here Gj(yj) and Fj(yj ,αj) indicate j-th

marginal distribution functions corresponding to G and F (·, η) respectively, with αj as the parameter vector

belonging to margin component j.

According to Sklar’s theorem (Sklar, 1959), there always exists a copula C(u1, . . . , ud, θ) that satisfies

F (y1, · · · , yd, η) = C(F1(y1,α1), · · · , Fd(yd,αd), θ)

where the full parameter vector η is now blocked as

η = (αT, θT)T = (αT1 , · · · ,αd

T, θT)T.

By assuming the regulatory conditions from Ko & Hjort (2018), C(·, θ) can be differentiated,

f(y1, · · · , yd, η) = c (F1(y1,α1), · · · , Fd(yd,αd), θ)

d�

j=1

fj(yj ,αj),

where c(u1, . . . , ud) = ∂dC(u1, . . . , ud, θ)/∂u1 · · · ∂ud and fj(yj ,αj) = ∂Fj(yj ,αj)/∂yj . For further details of

copula modeling, see Joe (1997) and Nelsen (2006). Analogously, the true density g can also be decomposed

into marginal densities and the copula density

g(y1, · · · , yd) = c0 (G1(y1), · · · , Gd(yd))

d�

j=1

gj(yj),

with c0(·) the true copula.

The further structure of this paper is as follows. In Section 2.1, we briefly explain Kullback–Leibler

divergence and its relationship to TIC and AICML. In Section 2.2, we derive and define our copula information

criterion. In Section 2.3, we prove that the AIC2ML formula holds under the two-stage ML estimation. In

Section 2.4, we summarize the relationship between TIC, CIC, AICML and AIC2ML. In Section 2.5, we

illustrate what CIC looks like in the two-dimensional setting and show how CIC easily can be extended

to the conditional copula setting. In Section 3, we study the numerical behavior of those model selection

criteria. In our final Section 4, we offer a few concluding remarks and suggestions for future research.

3

2 The copula information criterion for two-stage maximum likeli-

hood estimation

2.1 Kullback–Leibler divergence

The Kullback–Leibler (KL) divergence from g to f measures how the density f diverges from g (Kullback

& Leibler, 1951) and is defined as

KL(g, f) =

�g(y) log

g(y)

f(y)dy =

�g(y) log g(y)dy −

�g(y) log f(y)dy.

It is well known that the ML estimator aims for parameter values that minimize the Kullback–Leibler

divergence (Akaike, 1998).

Consider a case where one has competing models for certain data and the parameters are estimated by

ML. An arbitrary candidate model with ML parameter estimates can be denoted as f(y, �η). (Throughout

this paper, we use ‘�’ to indicate that a quantity is estimated with ML and ‘�’ to indicate that a quantity is

estimated with two-stage ML.) Since g(y) is the same across all candidate models, minimizing KL divergence

is equal to maximizing Q(�η) =�g(y) log f(y, �η)dy. This quantity is, however, not directly observable since

g(y) is unknown. As an alternative, one may use the empirical equivalent of Q(�η):

�Q(�η) = 1

n�(�η) = 1

n

n�

i=1

�log f1(yi,1, �α1) + · · ·+ log fd(yi,d, �αd) + log c

�F1(yi,1, �α1), · · · , Fd(yi,d, �αd), �θ

��.

Yet, this estimator of Q(�η) is a biased estimator. By identifying and subtracting the bias, we obtain the

unbiased estimator of Q(�η):

�Q∗(�η) = 1

n�(�η)− 1

ntr��I−1

η�Kη

�.

where �Iη is the observed information and �Kη is the estimated covariance matrix of η.

The TIC (Takeuchi, 1976) aims for the model that maximizes �Q∗(�η) and is defined as

TIC = 2�(�η)− 2�p∗η,TIC, (1)

where �p∗η,TIC = tr��I−1

η�Kη

�. This shows that TIC is basically a scaled version of �Q∗(�η).

When one boldly makes the assumption that the candidate model is correct, i.e. contains the true data

generating mechanism, TIC simplifies to AICML, possibly the most well known model selection criterion in

statistics, which is defined as

AICML = 2�(�η)− 2pη, (2)

where pη is the length of the parameter vector η. Note that the formula of TIC and AICML are only valid

if the parameters are estimated by the ML estimator. For more details about TIC and AICML, see chapter

2 of Claeskens & Hjort (2008).

4

2.2 Derivation of the copula information criterion

When the copula model is estimated with the two-stage ML, the bias correction term of TIC, i.e. �p∗η,TIC,

is not valid. We derive the copula information criterion (CIC) which is analogous to TIC and is made for

copula models estimated with the two-stage ML.

When the copula model is estimated with the two-stage ML, the non-constant part of the KL divergence

is

Q(�η) =�

g(y)�log f1(y1, �α1) + · · ·+ log fd(yd, �αd) + log c

�F1(y1, �α1), · · · , Fd(yd, �αd), �θ

��dy.

The empirical equivalent is

�Q(�η) = 1

n

n�

i=1

�log f1(yi,1, �α1) + · · ·+ log fd(yi,d, �αd) + log c

�F1(yi,1, �α1), · · · , Fd(yi,d, �αd), �θ

��.

Now we check the bias of �Q(�η):

EG

� �Q(�η)�−Q(�η) = EG1

�1

n

n�

i=1

log f1(yi,1, �α1)

�−�

g(y1) log f1(y1, �α1)dy1

+ · · ·

+ EGd

�1

n

n�

i=1

log fd(yi,d, �αd)

�−�

g(yd) log fd(yd, �αd)dyd

+ EG

�1

n

n�

i=1

log c�F1(yi,1, �α1), · · · , Fd(yi,d, �αd), �θ

��

−�

g(y) log c�F1(y1, �α1), · · · , Fd(yd, �αd), �θ

�dy.

Since the parameters of marginals from stage 1 (�αjs) are obtained with ML estimation, we can directly use

the results from the derivation of TIC and obtain

EGj

�1

n

n�

i=1

log fj(yi,j , �αj)

�−�

g(yj) log f(yj , �αj)dyj =1

ntr�I−1αj

Kαj

�+ o(n−1) =

1

n�p∗αj

+ o(n−1),

with I−1αj

and Kαjas defined in Lemma 1 of Ko & Hjort (2018). In a nutshell, Iαj

is the Fisher information

of j-th margin and Kαjis the covariance matrix of the score vector that belongs to j-th margin.

Further, let

Qc(�η) =�


�dy

and

�Qc(�η) = E

�1

n

n�

i=1

log c�F1(yi,1, �α1), · · · , Fd(yi,d, �αd), �θ

��.

5

So, we can write

EG

��Q(�η)

�−Q(�η) = 1

n

d�

j=1

tr�I−1αj

Kαj

�+ EG

��Qc(�η)

�−Qc(�η) + o(n−1).

Now, EG

��Qc(�η)

�−Qc(�η) is the only element that should be evaluated. Let

Qc(η0) =

�g(y) log c (F1(y1,α0,1), · · · , Fd(yd,α0,d), θ0 ) dy,

Zi = log c (F1(yi,1,α0,1), · · · , Fd(yi,d,α0,d), θ0 )−Qc(η0),

Aη =√n (�η − η0) =

√n I−1

η

�Un,α(α0)

Un,θ(α0, θ0)− ITα,θ I−1

α Un,α(α0)

�+

�op(1)

op(1)

�,

which stems from Proposition 1 in Ko & Hjort (2018), and furthermore

Un,η(η) =1

n

n�

i=1

Uη(yi, η) =1

n

n�

i=1

∂ log c (F1(yi,1,α1), · · · , Fd(yi,d,αd), θ)

∂η,

Hn,η(η) =1

n

n�

i=1

Hη(yi, η) =1

n

n�

i=1

∂2 log c (F1(yi,1,α1), · · · , Fd(yi,d,αd), θ)

∂η∂ηT,

I∗η = −EG [Hη(y, η0)] = −

�g(y)Hη(y, η0) dy,

which, incidentally, should not be confused with Iη in Proposition 1 of Ko & Hjort (2018). Then we have

E[Zn] =1n

�ni=1 E[Zi] = 0, along with

�Qc(�η) =1

n

n�

i=1

log c(yi, �η)

=1

n

n�

i=1

�log c(yi, η0)−Qc(η0) +Qc(η0) + (�η − η0)

TUη(yi, η0) +1

2(�η − η0)

THη(yi, η0)(�η − η0)

�+ op(n

−1)

=1

n

n�

i=1

[Zi] +Qc(η0) + (�η − η0)TUn,η(η0) +

1

2(�η − η0)

THn,η(η0)(�η − η0) + op(n−1)

= Qc(η0) + Zn +1√nAT

η Un,η(η0) +1

2nAT

ηHn,η(η0)Aη + op(n−1),

6

Qc(�η) =�


�dy

=

�g(y)

�log c(y, η0) + (�η − η0)

TUη(y, η0) +1

2(�η − η0)

THη(y, η0)(�η − η0)

�dy + op(n

−1)

=

�g(y) log c(y, η0) dy + (�η − η0)

T

�g(y)Uη(y, η0) dy +

1

2(�η − η0)

T

�g(y)Hη(y, η0) dy · (�η − η0) + op(n

−1)

= Qc(η0) + (�η − η0)T · 0 + 1

2n

√n(�η − η0)

T

�g(y)Hη(y, η0) dy ·

√n(�η − η0) + op(n

−1)

= Qc(η0)−1

2nAT

η I∗ηAη + op(n

−1)

and

n{ �Qc(�η)−Qc(�η)} = nZn +√nAT

η Un,η(η0) +1

2AT

ηHn,η(η0)Aη +1

2AT

η I∗ηAη + op(1).

Further, note that

Un,η(η0) =1

n

n�

i=1

∂ log c (F1(yi,1,α0,1), · · · , Fd(yi,d,α0,d), θ0)

∂η0

has the following convergence, by the central limit theorem:

√nUn,η(η0) =

√n {Un,η(η0)− E [Uη(Y, η0)]} d→ Λ∗

η ∼ N�0,K∗

η

�

where K∗η = Var (Uη(y, η0)) = E

�Uη(y, η0)Uη(y, η0)

T�. (Here Λ∗

η and K∗η should not be confused with Λη

and Kη in Proposition 1 of Ko & Hjort (2018).)

Now we evaluate EG[n{ �Qc(�η)−Qc(�η)}]:

EG

�n��Qc(�η)−Qc(�η)

��= EG

�nZn +

√nAT

η Un,η(η0) +1

2AT

ηHn,η(η0)Aη +1

2AT

η I∗ηAη

�+ o(1)

p→ EG

��I−1η LΛη

�TΛ∗η

�

= EG

�tr�I−1η LΛη

�Λ∗η

�T��

= tr�I−1η LEG

�Λη(Λ

∗η)

T��

= tr�I−1η LK◦

η

�= p∗θ,

where

K◦η = EG

�Λη(Λ

∗η)

T�= EG

��Λα (Λ∗

α)T

ΛαΛTθ

Λθ (Λ∗α)

TΛθΛ

Tθ

��=

K◦

α Kα,θ�K∗

α,θ

�TKθ

.

7

It is practical to note that

tr�I−1η LK◦

η

�= tr

�I−1α 0

0 I−1θ

��I 0

−ITα,θI−1

α I

� K◦

α Kα,θ�K∗

α,θ

�TKθ

= tr

I−1

α K◦α I−1

α Kα,θ

−I−1θ IT

α,θI−1α K◦

α + I−1θ

�K∗

α,θ

�T−I−1

θ ITα,θI−1

α Kα,θ + I−1θ Kθ

= tr

��I−1α K◦

α 0

0 −I−1θ IT

α,θI−1α Kα,θ + I−1

θ Kθ

��.

Consistent estimators for I−1αj

, Kαj , I−1η , L and K◦

η can be obtained by using plug-in sample averages.

The consequent estimators are denoted as �I−1αj

, �Kαj ,�I−1η , �L, �K◦

η . For regularity conditions that ensure

convergence in probability, see Jullum & Hjort (2017).

Thus, the unbiased estimator of Q(�η) is

�Q∗(�η) = 1

n�n(�η)−

1

n

d�

j=1

tr��I−1αj�Kαj

�+ tr

��I−1η�L �K◦

η

� .

By defining �p∗αj= tr(�I−1

αj�Kαj

), �p∗θ = tr(�I−1η�L �K◦

η ), �p∗η =�d

j=1 �p∗αj+ �p∗θ and scaling �Q∗(�η), we can finally

define the copula information criterion as

CIC = 2�n(�η)− 2�p∗η. (3)

2.3 AIC for two-stage maximum likelihood estimator

The CIC, derived in Section 2.2, is a model robust model selection criterion. This means that the CIC does

not assume that the parametric model includes the true model that generated data. In this section we show

that, if we do make such a true model assumption, the CIC simplifies to the AIC2ML. To our knowledge,

this is the first time that the validity of the AIC2ML formula is proven for the two-stage ML estimator.

Lemma 1. Under the assumption that the margins and copula are correctly specified, it holds that K◦α =

Iα −Kα.

Proof. Assume the candidate model worked with contains the true data generating mechanism, i.e. that

8

f = g at the relevant parameter point. Then

0 =∂EG [Uα(y,α0)]

T

∂α0

=∂

∂α0

�f∂�d

j=1 log fj

∂αT0

dy

=

�∂f

∂α0

∂�d

j=1 log fj

∂αT0

+ f∂2�d

j=1 log fj

∂α0∂αT0

dy

=

�f∂ log f

∂α0

∂�d

j=1 log fj

∂αT0

+ f∂2�d

j=1 log fj

∂α0∂αT0

dy

=

�f

�∂�d

j=1 log fj

∂α0+

∂ log c

∂α0

�∂�d

j=1 log fj

∂αT0

+ f∂2�d

j=1 log fj

∂α0∂αT0

dy

=

�f∂�d

j=1 log fj

∂α0

∂�d

j=1 log fj

∂αT0

dy +

�∂ log c

∂α0

∂�d

j=1 log fj

∂αT0

dy +

�f∂2�d

j=1 log fj

∂α0∂αT0

dy

= EG

�Uα(y,α0)Uα(y,α0)

T�+ EG

�U∗α(y,α0)Uα(y,α0)

T�− EG [−Hα(y,α0)]

= Kα +K◦α − Iα.

If we make the true model assumption (i.e. f = g), we have from the classical results of maximum

likelihood theory that Iαj= Kαj

. This implies that we have for the bias correction term in the marginal

parameters p∗αj= tr(I−1

αjKαj

) = dim(αj), for each j.

For the bias correction term in the copula parameters part, the true model assumption results in Lemma 1.

Combining Lemma 1 with Lemma 3 and Lemma 5 from Ko & Hjort (2018) gives

tr�I−1η LK◦

η

�= tr

��I−1α K◦

α 0

0 −I−1θ IT

α,θI−1α Kα,θ + I−1

θ Kθ

��

= tr

��I − I−1

α Kα 0

0 I

��

= tr

��0 0

0 I

��= dim(θ).

Thus, the unbiased estimator �Q∗(�η) can be simplified as

�Q∗(�η) = 1

n�n(�η)−

1

n

d�

j=1

dim(αj) + dim(θ)

.

By defining pη =�d

j=1 dim(αj) + dim(θ) = dim(η) and scaling �Q∗(�η), we obtain AIC2ML for the two-stage

ML estimated copula models

AIC2ML = 2�n(�η)− 2pη. (4)

9

Compared to AICML from Section 2.1, the only difference is that the log-likelihood is now estimated under

the two-stage ML instead of ML. i.e. we use �η instead of �η.

2.4 Relationship between the model selection criteria

So far, we have discussed four model selection criteria for copula models. Table 1 shows an overview of the

relationship between them. When the model (i.e. both copula and margins) is correctly specified, CIC and

AIC2ML become equal, and the same happens between TIC and AICML. Thus, one can compare CIC and

AIC2ML (or TIC and AICML in case of ML estimation) to check whether the model is correctly specified.

Further, since both CIC and TIC are estimating the same part of the KL divergence under the same

model robust environment, they are compatible. This implies that one can compare CIC to TIC to measure

how much one looses in terms of KL divergence by using two-stage ML estimation instead of ML estimation.

The same can be done by comparing AIC2ML to AICML when one believes in the model. However, one

should not compare CIC with AICML (or TIC with AIC2ML) since they are based on two different model

beliefs (i.e. presence of the true model assumption). Figuratively speaking, one is in this situation comparing

apples with pears.

Model robust

Yes No

EstimationML TIC AICML

2ML CIC AIC2ML

Table 1: An overview of the relationship between model selection criteria discussed in this paper.

2.5 Illustration for two-dimensional case and extension to the conditional cop-

ula regression.

To make things more concrete, we now give an example of the two-dimensional case with (Y1, Y2) from the

unknown g(y). As candidate margins we choose a two-parameter distribution (e.g. normal, gamma, Weibull,

etc.) for both F1 and F2. For the copula part, we choose a one-parameter copula (e.g. Gumbel, Frank,

Clayton, etc.). The candidate model then has the form

f(y1, y2, η) = c (F1(y1,α1), F2(y2,α2), θ) · f1(y1,α1) · f2(y2,α2)

where η = (α1,1,α1,2,α2,1,α2,2, θ)T. The ingredients for the CIC can then be written down by using the

block matrix form, in line with Ko & Hjort (2018):

Kη =

Kα1 Kα1,α2 Kα1,θ

Kα2,α1 Kα2 Kα2,θ

KTα1,θ

KTα2,θ

Kθ

, Iη =

Iα1 0 0

0 Iα2 0

0 0 Iθ

,

L =

I2×2 0 0

0 I2×2 0

−ITα1,θ

I−1α1

−ITα2,θ

I−1α2

I1×1

, K◦

η =

K◦α1

K◦α1,α2

Kα1,θ

K◦α2,α1

K◦α2

Kα2,θ�K∗

α1,θ

�T �K∗

α2,θ

�TKθ

.

10

We consequently have

p∗η = p∗α1+ p∗α2

+ p∗θ

= tr�I−1α1

Kα1

�+ tr

�I−1α2

Kα2

�+ tr

�I−1η LK◦

η

�

= tr�I−1α1

Kα1

�+ tr

�I−1α2

Kα2

�+ tr

�I−1α1

K◦α1

�+ tr

�I−1α2

K◦α2

�

+ tr�−I−1

θ ITα1,θ I−1

α1Kα1,θ − I−1

θ ITα2,θ I−1

α2Kα2,θ + I−1

θ Kθ

�.

More generally, we can consider the conditional copula regression where all marginal distributions and

copula are conditioned on the k-variate covariate X parametrically. Patton (2002) extends the existing

theories of copula to the conditional copula setting, including the conditional version of Sklar’s theorem,

which gives conditional copula density

f(y1, · · · , yd, η|x) = c (F1(y1,α1|x), F2(y2,α2|x), θ|x) · f1(y1,α1|x) · f2(y2,α2|x).

For simplicity, we consider the case where the copula parameter θ is modeled by the linear calibration

function with θ = Xβ where β = (β0, · · · ,βk)Tis a k + 1 dimensional parameters. We can then consider β

as the copula parameter instead of θ, which results in η = (α1,1,α1,2,α2,1,α2,2,β0, · · · ,βk)T. One can easily

define matrices Kη, Iη, L and K◦η accordingly. For details about conditional copula regression, see Patton

(2006), Acar et al. (2013) and Palaro & Hotta (2006).

3 Simulation study

To study the behavior of CIC, we have performed a simulation study.

3.1 Simulation 1

In simulation 1, we generated datasets of 3 sizes (n = 100, 1000, 10000) with the data generating model

described in Table 2. We then, for each dataset, fitted 18 different copula models, based on the possible

combinations of the candidate copulas and margins described in Table 3. We repeated this process 1000

times and the averaged results from the fitted models can be found in Table 7, Table 8 and Table 9 in the

Appendix.

Table 2: Description of the data generating model used in simulation 1.

Copula Margin 1 Margin 2

Data generating

model

Gumbel

θ = 3

Weibull

α1 = (1.5, 4)T

(shape, scale)

Gamma

α2 = (2, 1)T

(shape, rate)

11

Table 3: List of the candidate copulae and margins used in simulation 1

Candidates

Copula Gumbel, Gaussian

Margin 1 Weibull, Gamma, Log-normal


From all three tables, we can see that CIC and AIC2ML result in similar model ranks. This is expected

since both are aiming for the model that minimizes KL divergence. When the model is correctly specified

(model 1), the estimated value of CIC and AIC2ML are essentially the same. This confirms that CIC and

AIC2ML are equal under the true model assumption (analytically proven in Section 2.3). Further, we can

observe that ‘better models’ according to CIC and AIC2ML have smaller value of MSE(�P) in general. To

clarify, MSE(�P) indicates mean squared error of two-stage ML estimated P(b0.8 < y). Here P(b0.8 < y) is

the joint probability that each marginal variable has larger value than its 0.8-quantile value, defined by each

marginal model. Similarly, MSE(�P) is mean squared error of ML estimated P(b0.8 < y).

When the model is correctly specified (model 1), �p∗η is virtually equal to pη, the length of the parameter

vector η. When the model has misspecification, the CIC value is penalized more as �p∗η increases. However,

this is not the case for AIC2ML since pη = 5 across all models. Thus, CIC has higher chance of choosing

a less wrong model. As n increases, the influence of �p∗η on CIC decreases since the absolute value of log-

likelihood grows much faster than the penalty term. This is observable in Table 4. When n = 100, the best

models chosen by the model robust model selection criteria (TIC and CIC) result in smaller MSE values.

However, when n = 10000, the penalty term of these model selection criteria is very small compared to the

log-likelihood value. Consequently, CIC and TIC choose the same models as their model non-robust variants

(AIC2ML and AICML) do. This results in the same MSE performance.

Table 4: Result from simulation 1. For each dataset, the best model was chosen among 18 candidate models

by using CIC, TIC, AIC2ML or AICML. Then, P(b0.8 < y) was computed from the best models. The table

contains mean squared error of the estimated P(b0.8 < y) multiplied by 105.

n TIC AICML CIC AIC2ML

100 3.7871 3.8723 3.9079 4.0122

1000 0.2859 0.2859 0.2858 0.2858

10000 0.0251 0.0251 0.0251 0.0251

3.2 Simulation 2

In simulation 2, we generated datasets of size n = 1000 with the data generating model described in Table 5.

We then fitted 486 different copula models, which are based on the possible combinations of the candidate

copulae and margins described in Table 6. We repeated this process 100 times and averaged the results.

Like in simulation 1, P(b0.8 < y) was computed from every fitted models. Figure 1 displays the relationship

between mean squared error of estimated P(b0.8 < y) and CIC and AIC2ML. We can see that both model

selection criteria evaluate the models that have lower mean squared error as better models. The difference

12

between CIC and AIC2ML in this perspective is minimal. This is because the log-likelihood, the element that

is shared by both model selection criteria, has much bigger absolute value than the bias correction term.

Table 5: Description of the data generating model used in simulation 2.

Copula Margin 1 Margin 2 Margin 3 Margin 4 Margin 5

Data generating

model

Gumbel

θ = 3

Weibull

α1 = (1.5, 4)T

(shape, scale)

Weibull

α2 = (2, 3)T

(shape, scale)

Gamma

α3 = (2, 1)T

(shape, rate)

Gamma

α4 = (3, 1)T

(shape, rate)

Gamma

α5 = (4, 2)T

(shape, rate)

Table 6: List of the candidate copulae and margins used in simulation 2

Candidates

Copula Gumbel, Gaussian


Margin 2 Weibull, Gamma, Normal




Figure 2 plots the same as Figure 1, but the x-axis is now the bias correction term (�p∗η for CIC and

pη for AIC2ML). The difference between CIC and AIC2ML is now more clear. While pη (dimension of the

parameter vector η) for AIC2ML is fixed at 11 across all models, �p∗η tend to penalize misspecified models

more and forms a strong relationship with the mean squared error of estimated P(b0.8 < y).

13

−11500 −11000 −10500

02

46

810

Gumbel copula

Model selection criterion value

105

× M

SE

(P)

CICAIC2MLCIC (trend)AIC2ML (trend)

−12000 −11500 −11000

100

120

140

160

180

200

220

Gaussian copula

Model selection criterion value

105

× M

SE

(P)


Figure 1: Result from simulation 2. On the x-axis is the value of CIC or AIC2ML. The 486 different copula

models, defined by using the candidate copulae and margins described in Table 6, are fitted to the dataset

generated from the data generating model described in Table 5. The y-axis is the mean squared error of�P(b0.8 < y), which indicates the two-stage ML estimated joint probability that each marginal variable has

larger value than its 0.8-quantile value, defined by each marginal model. Since different choices of copulae

leads to a big difference in model selection score, the result is separately displayed for each copula. The left

plot contains 283 models with the Gumbel copula and the right plot contains 283 models with the Gaussian

copula. The data generating model had Gumbel copula.

As mentioned in Section 2.4, one can compare CIC to TIC, or AIC2ML to AICML, to measure how

much one loses in terms of KL divergence by performing two-stage ML estimation instead of ML estimation.

Figure 3 shows that TIC−CIC or AIC2ML −AICML has a relationship with the loss of MSE of P(b0.8 < y)

caused by two-stage ML estimation. However, this relationship seems weaker when the copula is misspecified

(right panel). We tried to identify any sub-pattern in the plot that can cause this, for example by plotting

only a subset of models that have specific margins. Yet, we weren’t able to detect any sub-pattern.

14

12 14 16 18

02

46

810

Gumbel copula

Bias correction term (p*η or pη)

105

× M

SE

(P)


12 14 16 18

100

120

140

160

180

200

220

Gaussian copula

Bias correction term (p*η or pη)

105

× M

SE

(P)


Figure 2: Result from simulation 2. The 486 different copula models, defined by using the candidate copulae

and margins described in Table 6, are fitted to the dataset generated from data generating model described

in Table 5. The y-axis is the mean squared error of �P(b0.8 < y). The x-axis is the value of the bias correction

terms in model selection criteria (�p∗η for CIC and pη for AIC2ML). Like for Figure 1, the result is separately

displayed for each copula. The data generating model had Gumbel copula.

15

0 100 200 300 400

−10

−8−6

−4−2

02

Gumbel copula

TIC − CIC or AICML − AIC2ML

105

×{M

SE

(P) −

MS

E(P

)}

TIC − CICAICML − AIC2MLTIC − CIC (trend)AICML − AIC2ML (trend)

0 5 10 15 20 25 30

−60

−40

−20

020

40

Gaussian copula

TIC − CIC or AICML − AIC2ML

105

×{M

SE

(P) −

MS

E(P

)}

TIC − CICAICML − AIC2MLTIC − CIC (trend)AICML − AIC2ML (trend)

Figure 3: Result from simulation 2. The 486 different copula models, defined by using the candidate

copulae and margins described in Table 6, are fitted to the dataset generated from data generating model

described in Table 5. The y-axis is the difference between MSE(�P) (MSE of P(b0.8 < y) estimated by ML)

and MSE(�P) (MSE of P(b0.8 < y) estimated by two-stage ML). Like in Figure 1 and Figure 2, the result is

displayed separately for each copula. The data generating model had Gumbel copula.

4 Conclusions and further research

In this paper, we have developed the copula information criterion (CIC), which is a TIC-like model robust

model selection criterion for two-stage ML estimated copulae. When we make an assumption that the

parametric candidate model contains the true model, CIC becomes equal to AIC2ML. This validates the

use of AIC2ML for the two-stage ML estimated copula models. To our knowledge, this is the first time that

AIC2ML formula is analytically justified. Further, since both TIC and CIC are estimating the same part of

the KL divergence, without the presence of the true model assumption, they are compatible to each other

and can be used to check possible disadvantages caused by the two-stage ML estimation. The same can be

done by comparing AICML and AIC2ML, when one believes in the model.

Regarding the assumption that a candidate model is correct, one can compare �p∗η (bias correction term of

CIC) and pη (bias correction term of AIC2ML) to check whether the model severely diverges from the data

generating model, i.e. as a separate goodness-of-fit test. It may be noted that the job of the CIC is to rank

models according to a sensible criterion, and to identify the best ones, but doing well in this ranking is not

the same as claiming that the model passes goodness-of-fit tests. In yet other words, the winning model,

using the CIC, may still not be a perfect model, perhaps since the list of candidate models has not been the

best.

We performed a simulation study. For relatively small sample sizes, CIC outperforms AIC2ML in terms

16

of prediction performance from the selected models. When the sample size is large, the log-likelihood term

grows much faster than the bias correction term and the difference between CIC and AIC2ML is minimal.

Naturally, for large n, the best models are those with high values of the maximized (two-stage) log-likelihoods,

which means richly parametrised models.

From our simulation, cf. Figure 2, we can see that �p∗η alone has a strong correlation with the prediction

performance (measured in MSE). So, one can consider to use �p∗η (without the log-likelihood part) to judge

the model. In addition, TIC − CIC and AICML − AIC2ML turn out to have high correlation with the loss

of prediction performance (measured as difference in MSE) caused by the switch from ML estimation to

two-stage ML estimation.

The results from the simulation study hold mostly both when the copula is correctly specified and

misspecified. Yet, in Figure 3, although the overall tendency is similar, we observe that the result from the

misspecified Gaussian copula seems to consist of two different sub-patterns. However, we were not able to

find a possible cause of this.

Because of the large number of possible models in high dimensional setting, the number of situations

that we could examine, was limited. (In case of a 5-dimensional copula model with 2 candidate copulae and

3 candidate for each margin, there are 486 models that we have to test, and each model has to be fitted

by 2 different estimation schemes.) Another problem was that we could not try all copulae and margins on

the simulated data since fitting a heavily misspecified copula and margins would cause numerical problems.

A further large-scale simulation study that examines the behavior of different types of copulae in variety of

situations would be fruitful.

Furthermore, CIC is computationally expensive mainly because �K◦α, �Kα,θ and �Kθ require score functions

for every data point separately. For example, for �I−1α and �Iθ, one can avoid this by swapping the order of

summation and differentiation, but for �K◦α, �Kα,θ and �Kθ, this is not possible. A numerical technique that

can make CIC less computationally extensive would be appreciated.

Although CIC performs decently well in selecting a good model that fits the data best in terms of KL

divergence, there are situations where one is interested in a model that is suitable for specific tasks. The task

of interest could be for example estimating tail probabilities, the mean, or the median. A model selection

criterion for copula models under the two-stage ML scheme that can take this into account would be useful.

Acknowledgments

The authors would like to thank Ingrid Hobæk Haff and Steffen Grønneberg for their valuable comments

and fruitful discussions. The authors also acknowledge partial funding from the Norwegian Research Council

supported research group FocuStat: Focus Driven Statistical Inference With Complex Data, and from the

Department of Mathematics at the University of Oslo.

References

Acar, E. F., Craiu, R. V., Yao, F. et al. (2013). Statistical testing of covariate effects in conditional copula

models. Electronic Journal of Statistics 7, 2822–2850.

17

Akaike, H. (1974). A new look at the statistical model identification. IEEE transactions on automatic control

19, 716–723.

Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected

papers of hirotugu akaike. Springer, pp. 199–213.

Claeskens, G. & Hjort, N. L. (2008). Model Selection and Model Averaging, vol. 330. Cambridge University

Press Cambridge.

Genest, C. & Favre, A.-C. (2007). Everything you always wanted to know about copula modeling but were

afraid to ask. Journal of hydrologic engineering 12, 347–368.

Grønneberg, S. & Hjort, N. L. (2014). The copula information criteria. Scandinavian Journal of Statistics

41, 436–459.

Joe, H. (1997). Multivariate Models and Multivariate Dependence Concepts. CRC Press.

Jullum, M. & Hjort, N. L. (2017). Parametric or nonparametric: the fic approach. Statistica Sinica .

Ko, V. & Hjort, N. L. (2018). Model robust inference for copulae via two-stage maximum likelihood estima-

tion. Submitted to Journal of Multivariate Analysis.

Kullback, S. & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics

22, 79–86.

Nelsen, R. B. (2006). An Introduction to Copulas. Springer Science & Business Media.

Palaro, H. & Hotta, L. (2006). Using conditional copula to estimate value at risk. Journal of Data Science

4, 93–115.

Patton, A. J. (2002). Applications of copula theory in financial econometrics. Ph.D. thesis, University of

California, San Diego.

Patton, A. J. (2006). Modelling asymmetric exchange rate dependence. International economic review 47,

527–556.

Sklar, A. (1959). Fonctions de repartition a n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris

8, 229–231.

Takeuchi, K. (1976). The distribution of information statistics and the criterion of goodness of fit of models.

Mathematical Science 153, 12–18.

18

Appendix

Model no. TIC AICML CIC AIC2ML �p∗η,TIC �p∗η pη 105 ·MSE(�P) 105 ·MSE(�P) Copula Margin 1 Margin 2

n = 100

1 -609.92 -610.10 -610.77 -610.93 4.91 4.92 5 2.6695 2.8444 Gumbel Weibull Gamma

2 -613.04 -612.93 -614.75 -614.26 5.05 5.24 5 2.6092 2.9347 Gumbel Weibull Weibull

4 -613.64 -613.54 -615.04 -614.93 5.05 5.05 5 2.7188 2.8276 Gumbel Gamma Gamma

11 -621.56 -620.75 -621.61 -621.12 5.41 5.25 5 27.3463 22.8859 Gaussian Weibull Weibull

10 -622.25 -621.41 -622.33 -621.69 5.42 5.32 5 29.5786 26.3657 Gaussian Weibull Gamma

5 -619.30 -618.63 -623.51 -622.01 5.34 5.75 5 2.7269 3.3693 Gumbel Gamma Weibull

6 -624.39 -623.56 -627.26 -626.22 5.41 5.52 5 2.5957 2.7652 Gumbel Gamma Log-normal

13 -628.24 -627.03 -628.26 -627.05 5.60 5.60 5 31.3167 30.7712 Gaussian Gamma Gamma

14 -629.62 -628.39 -629.99 -628.63 5.62 5.68 5 26.9486 28.9915 Gaussian Gamma Weibull

3 -627.54 -626.46 -634.34 -632.53 5.54 5.90 5 2.6429 3.8486 Gumbel Weibull Log-normal

9 -640.53 -638.38 -643.09 -640.44 6.07 6.32 5 2.7582 2.7121 Gumbel Log-normal Log-normal

12 -646.70 -643.93 -646.92 -644.09 6.39 6.41 5 41.5681 43.6130 Gaussian Weibull Log-normal

15 -647.74 -644.76 -647.75 -644.77 6.49 6.49 5 43.4360 43.2919 Gaussian Gamma Log-normal

7 -647.85 -645.24 -654.37 -651.88 6.31 6.24 5 3.0737 6.1935 Gumbel Log-normal Gamma

8 -654.09 -651.41 -665.48 -661.68 6.34 6.90 5 2.5910 9.1325 Gumbel Log-normal Weibull

16 -666.52 -662.25 -666.53 -662.25 7.14 7.14 5 57.9316 58.2482 Gaussian Log-normal Gamma

17 -671.15 -666.76 -672.01 -667.17 7.19 7.42 5 51.5845 61.5621 Gaussian Log-normal Weibull

18 -674.16 -668.48 -674.16 -668.48 7.84 7.84 5 56.7535 56.7548 Gaussian Log-normal Log-normal

Table 7: Result of simulation 1 with n = 100. The simulation was repeated 1000 times and the results were averaged.

19


n = 1000




















20


n = 10000




















21

copula information criterion for model selection with two ......copula information criterion for...

Documents