copula information criterion for model selection with two ......copula information criterion for...
TRANSCRIPT
Copula information criterion for model selection
with two-stage maximum likelihood estimation
Vinnie Ko∗, Nils Lid HjortDepartment of Mathematics, University of Oslo
PB 1053, Blindern, NO-0316 Oslo, Norway
February 2018
Abstract
In parametric copula setups, where both the margins and copula have parametric forms, two-stage
maximum likelihood estimation, often referred to as inference functions for margins, is used as an at-
tractive alternative to the full maximum likelihood estimation strategy. Exploiting basic results derived
earlier by the present authors, we develop a copula information criterion (CIC) for model selection. The
CIC is defined as CIC = 2�n(�η)− 2�p∗η, where �n(�η) is the maximized log-likelihood under the two-stage
maximum likelihood estimation scheme, with η the full parameter vector for the candidate model in
question, and �p∗η is a certain penalization factor. In a nutshell, CIC aims for the model that minimizes
the Kullback–Leibler divergence from the real data generating mechanism. CIC does not assume that
the chosen parametric model captures this true model, unlike what is assumed for AIC. In this sense
CIC is analogous to the Takeuchi Information Criterion (TIC), which is defined for the full maximum
likelihood. If we make an additional assumption that a candidate model is correctly specified, then CIC
for that model simplifies to AIC. Further, since both CIC and TIC are estimating the same part of the
Kullback–Leibler divergence, they are compatible, in the sense that they can be used to compare the
performance of full maximum-likelihood and two-stage maximum likelihood for a given model. Addi-
tionally, we show that CIC can easily be extended to the conditional copula setup where covariates are
parametrically linked to the copula model.
As a numerical illustration, we perform a simulation and find that CIC outperforms AIC in terms of
prediction performance from the selected models. However, as sample size grows, the difference between
CIC and AIC becomes minimal because the log-likelihood part outgrows the bias correction part. Further,
we learn from the simulation that �p∗η, the bias correction term of CIC, has a strong positive relationship
with the prediction performance of the model. So, a model with bad prediction performance is being
penalized more by CIC.
Keywords: copula, Akaike information criterion, copula information criterion, model robust, two-stage max-
imum likelihood, inference functions for margins,
∗Corresponding author.E-mail addresses: [email protected] (V. Ko), [email protected] (N.L. Hjort)
1
1 Introduction and copula models
One of the main practical issues in copula modeling is model selection. In the full parametric setup, where
both the copula and margins are assumed to have a parametric form, one often has multiple candidates for
both the copula and margins. As the dimension of the model increases, a list of possible combinations of
margins and copula grows rapidly. Hence, there is a need for a model selection criterion that can evaluate
each model systematically according to certain philosophy or criteria and assign a score to each model. In
the end, one would choose the model with the best score.
Throughout this paper, we consider the full parametric setup. In this setup, one can simultaneously
estimate all parameters of the model (i.e. both copula parameters and margin parameters) by using maximum
likelihood (ML) estimation. In this ML estimation framework, one can for instance use AICML (Akaike, 1974)
or TIC (also known as model-robust AICML) (Takeuchi, 1976) as model selection criterion and select the
model with the best score. (Note that we denote the AIC under ML estimation as AICML to distinguish it
from the two-stage ML based AIC2ML, which we will derive in Section 2.3.) However, when the dimension
of the copula model gets high, the number of parameters increases quickly and the ML estimation is not
always feasible in terms of speed and numerical stability. Two-stage maximum likelihood (two-stage ML)
estimation, also often referred to as inference functions for margins (IFM), is a popular alternative estimation
strategy that is designed to overcome these drawbacks of the ML estimation. In stage 1 of the two-stage ML
estimation, the parameter vectors of each marginal distribution are estimated separately by ML. In stage
2, the estimates from stage 1 are plugged into the log-likelihood of the model. Then, the parameters of the
copula, which are now the only unknown parameters, are estimated by using ML estimation again. One of
the advantages of this multi-stage approach is that it is computation-wise much faster than estimating all
parameters simultaneously, because it does not have to search for the global maximum in high-dimensional
space. A drawback of the two-stage ML estimation method, however, is that we cannot use the classical
results based on ML estimation, which include model selection criteria such as TIC and BIC.
In practice, different sorts of goodness-of-fit testing are often used as substitutes, to choose the best
model (Genest & Favre, 2007). Another often used model selection strategy for the two-stage ML is that
one first evaluates candidates of each marginal distribution with AICML and consequently chooses the best
distribution for each margin. Once the margins are chosen, one fits different copulae and evaluates the copula
part with AICML. However, this piecewise model evaluation cannot evaluate the model as a whole.
In this paper, we develop the copula information criterion (CIC) for two-stage ML estimation, which has
the form
CIC = 2�n(�η)− 2�p∗η.
Here �n(�η) is the maximized log-likelihood with the two-stage ML estimation method, in terms of the full
parameter vector η of the model in question, and �p∗η is a suitable penalization factor, worked out in Section 2.2.
The main advantage of CIC is that it can evaluate a parametric copula with parametric margins as a whole.
CIC is also a model-robust model selection criterion which means that it does not assume that the candidate
model contains the true model. As the overlap of the name already suggests, our CIC is analogous to CIC
from Grønneberg & Hjort (2014), which is designed for copulae estimated with pseudo maximum likelihood
(PML). In PML framework, margins are estimated empirically, while two-stage ML assumes parametric
forms of margins.
2
Our technical setting, identical to Ko & Hjort (2018), is as follows. Let (Y1, · · · , Yd)T be a d-variate
continuous stochastic variable originating from a joint density g(y1, · · · , yd) and let yi = (yi,1, · · · , yi,d)T,for i = 1, . . . , n, be independent observations of this variable. The true joint distribution g is typically
unknown. Let f(y1, · · · , yd, η) be our parametric approximation of g, with the parameter vector η, belonging
to some connected subset of the appropriate Euclidean space. In addition, G and F (·, η) indicate cumulative
distribution functions corresponding to g and f(·, η), respectively. Here Gj(yj) and Fj(yj ,αj) indicate j-th
marginal distribution functions corresponding to G and F (·, η) respectively, with αj as the parameter vector
belonging to margin component j.
According to Sklar’s theorem (Sklar, 1959), there always exists a copula C(u1, . . . , ud, θ) that satisfies
F (y1, · · · , yd, η) = C(F1(y1,α1), · · · , Fd(yd,αd), θ)
where the full parameter vector η is now blocked as
η = (αT, θT)T = (αT1 , · · · ,αd
T, θT)T.
By assuming the regulatory conditions from Ko & Hjort (2018), C(·, θ) can be differentiated,
f(y1, · · · , yd, η) = c (F1(y1,α1), · · · , Fd(yd,αd), θ)
d�
j=1
fj(yj ,αj),
where c(u1, . . . , ud) = ∂dC(u1, . . . , ud, θ)/∂u1 · · · ∂ud and fj(yj ,αj) = ∂Fj(yj ,αj)/∂yj . For further details of
copula modeling, see Joe (1997) and Nelsen (2006). Analogously, the true density g can also be decomposed
into marginal densities and the copula density
g(y1, · · · , yd) = c0 (G1(y1), · · · , Gd(yd))
d�
j=1
gj(yj),
with c0(·) the true copula.
The further structure of this paper is as follows. In Section 2.1, we briefly explain Kullback–Leibler
divergence and its relationship to TIC and AICML. In Section 2.2, we derive and define our copula information
criterion. In Section 2.3, we prove that the AIC2ML formula holds under the two-stage ML estimation. In
Section 2.4, we summarize the relationship between TIC, CIC, AICML and AIC2ML. In Section 2.5, we
illustrate what CIC looks like in the two-dimensional setting and show how CIC easily can be extended
to the conditional copula setting. In Section 3, we study the numerical behavior of those model selection
criteria. In our final Section 4, we offer a few concluding remarks and suggestions for future research.
3
2 The copula information criterion for two-stage maximum likeli-
hood estimation
2.1 Kullback–Leibler divergence
The Kullback–Leibler (KL) divergence from g to f measures how the density f diverges from g (Kullback
& Leibler, 1951) and is defined as
KL(g, f) =
�g(y) log
g(y)
f(y)dy =
�g(y) log g(y)dy −
�g(y) log f(y)dy.
It is well known that the ML estimator aims for parameter values that minimize the Kullback–Leibler
divergence (Akaike, 1998).
Consider a case where one has competing models for certain data and the parameters are estimated by
ML. An arbitrary candidate model with ML parameter estimates can be denoted as f(y, �η). (Throughout
this paper, we use ‘�’ to indicate that a quantity is estimated with ML and ‘�’ to indicate that a quantity is
estimated with two-stage ML.) Since g(y) is the same across all candidate models, minimizing KL divergence
is equal to maximizing Q(�η) =�g(y) log f(y, �η)dy. This quantity is, however, not directly observable since
g(y) is unknown. As an alternative, one may use the empirical equivalent of Q(�η):
�Q(�η) = 1
n�(�η) = 1
n
n�
i=1
�log f1(yi,1, �α1) + · · ·+ log fd(yi,d, �αd) + log c
�F1(yi,1, �α1), · · · , Fd(yi,d, �αd), �θ
��.
Yet, this estimator of Q(�η) is a biased estimator. By identifying and subtracting the bias, we obtain the
unbiased estimator of Q(�η):
�Q∗(�η) = 1
n�(�η)− 1
ntr��I−1
η�Kη
�.
where �Iη is the observed information and �Kη is the estimated covariance matrix of η.
The TIC (Takeuchi, 1976) aims for the model that maximizes �Q∗(�η) and is defined as
TIC = 2�(�η)− 2�p∗η,TIC, (1)
where �p∗η,TIC = tr��I−1
η�Kη
�. This shows that TIC is basically a scaled version of �Q∗(�η).
When one boldly makes the assumption that the candidate model is correct, i.e. contains the true data
generating mechanism, TIC simplifies to AICML, possibly the most well known model selection criterion in
statistics, which is defined as
AICML = 2�(�η)− 2pη, (2)
where pη is the length of the parameter vector η. Note that the formula of TIC and AICML are only valid
if the parameters are estimated by the ML estimator. For more details about TIC and AICML, see chapter
2 of Claeskens & Hjort (2008).
4
2.2 Derivation of the copula information criterion
When the copula model is estimated with the two-stage ML, the bias correction term of TIC, i.e. �p∗η,TIC,
is not valid. We derive the copula information criterion (CIC) which is analogous to TIC and is made for
copula models estimated with the two-stage ML.
When the copula model is estimated with the two-stage ML, the non-constant part of the KL divergence
is
Q(�η) =�
g(y)�log f1(y1, �α1) + · · ·+ log fd(yd, �αd) + log c
�F1(y1, �α1), · · · , Fd(yd, �αd), �θ
��dy.
The empirical equivalent is
�Q(�η) = 1
n
n�
i=1
�log f1(yi,1, �α1) + · · ·+ log fd(yi,d, �αd) + log c
�F1(yi,1, �α1), · · · , Fd(yi,d, �αd), �θ
��.
Now we check the bias of �Q(�η):
EG
� �Q(�η)�−Q(�η) = EG1
�1
n
n�
i=1
log f1(yi,1, �α1)
�−�
g(y1) log f1(y1, �α1)dy1
+ · · ·
+ EGd
�1
n
n�
i=1
log fd(yi,d, �αd)
�−�
g(yd) log fd(yd, �αd)dyd
+ EG
�1
n
n�
i=1
log c�F1(yi,1, �α1), · · · , Fd(yi,d, �αd), �θ
��
−�
g(y) log c�F1(y1, �α1), · · · , Fd(yd, �αd), �θ
�dy.
Since the parameters of marginals from stage 1 (�αjs) are obtained with ML estimation, we can directly use
the results from the derivation of TIC and obtain
EGj
�1
n
n�
i=1
log fj(yi,j , �αj)
�−�
g(yj) log f(yj , �αj)dyj =1
ntr�I−1αj
Kαj
�+ o(n−1) =
1
n�p∗αj
+ o(n−1),
with I−1αj
and Kαjas defined in Lemma 1 of Ko & Hjort (2018). In a nutshell, Iαj
is the Fisher information
of j-th margin and Kαjis the covariance matrix of the score vector that belongs to j-th margin.
Further, let
Qc(�η) =�
g(y) log c�F1(y1, �α1), · · · , Fd(yd, �αd), �θ
�dy
and
�Qc(�η) = E
�1
n
n�
i=1
log c�F1(yi,1, �α1), · · · , Fd(yi,d, �αd), �θ
��.
5
So, we can write
EG
��Q(�η)
�−Q(�η) = 1
n
d�
j=1
tr�I−1αj
Kαj
�+ EG
��Qc(�η)
�−Qc(�η) + o(n−1).
Now, EG
��Qc(�η)
�−Qc(�η) is the only element that should be evaluated. Let
Qc(η0) =
�g(y) log c (F1(y1,α0,1), · · · , Fd(yd,α0,d), θ0 ) dy,
Zi = log c (F1(yi,1,α0,1), · · · , Fd(yi,d,α0,d), θ0 )−Qc(η0),
Aη =√n (�η − η0) =
√n I−1
η
�Un,α(α0)
Un,θ(α0, θ0)− ITα,θ I−1
α Un,α(α0)
�+
�op(1)
op(1)
�,
which stems from Proposition 1 in Ko & Hjort (2018), and furthermore
Un,η(η) =1
n
n�
i=1
Uη(yi, η) =1
n
n�
i=1
∂ log c (F1(yi,1,α1), · · · , Fd(yi,d,αd), θ)
∂η,
Hn,η(η) =1
n
n�
i=1
Hη(yi, η) =1
n
n�
i=1
∂2 log c (F1(yi,1,α1), · · · , Fd(yi,d,αd), θ)
∂η∂ηT,
I∗η = −EG [Hη(y, η0)] = −
�g(y)Hη(y, η0) dy,
which, incidentally, should not be confused with Iη in Proposition 1 of Ko & Hjort (2018). Then we have
E[Zn] =1n
�ni=1 E[Zi] = 0, along with
�Qc(�η) =1
n
n�
i=1
log c(yi, �η)
=1
n
n�
i=1
�log c(yi, η0)−Qc(η0) +Qc(η0) + (�η − η0)
TUη(yi, η0) +1
2(�η − η0)
THη(yi, η0)(�η − η0)
�+ op(n
−1)
=1
n
n�
i=1
[Zi] +Qc(η0) + (�η − η0)TUn,η(η0) +
1
2(�η − η0)
THn,η(η0)(�η − η0) + op(n−1)
= Qc(η0) + Zn +1√nAT
η Un,η(η0) +1
2nAT
ηHn,η(η0)Aη + op(n−1),
6
Qc(�η) =�
g(y) log c�F1(y1, �α1), · · · , Fd(yd, �αd), �θ
�dy
=
�g(y)
�log c(y, η0) + (�η − η0)
TUη(y, η0) +1
2(�η − η0)
THη(y, η0)(�η − η0)
�dy + op(n
−1)
=
�g(y) log c(y, η0) dy + (�η − η0)
T
�g(y)Uη(y, η0) dy +
1
2(�η − η0)
T
�g(y)Hη(y, η0) dy · (�η − η0) + op(n
−1)
= Qc(η0) + (�η − η0)T · 0 + 1
2n
√n(�η − η0)
T
�g(y)Hη(y, η0) dy ·
√n(�η − η0) + op(n
−1)
= Qc(η0)−1
2nAT
η I∗ηAη + op(n
−1)
and
n{ �Qc(�η)−Qc(�η)} = nZn +√nAT
η Un,η(η0) +1
2AT
ηHn,η(η0)Aη +1
2AT
η I∗ηAη + op(1).
Further, note that
Un,η(η0) =1
n
n�
i=1
∂ log c (F1(yi,1,α0,1), · · · , Fd(yi,d,α0,d), θ0)
∂η0
has the following convergence, by the central limit theorem:
√nUn,η(η0) =
√n {Un,η(η0)− E [Uη(Y, η0)]} d→ Λ∗
η ∼ N�0,K∗
η
�
where K∗η = Var (Uη(y, η0)) = E
�Uη(y, η0)Uη(y, η0)
T�. (Here Λ∗
η and K∗η should not be confused with Λη
and Kη in Proposition 1 of Ko & Hjort (2018).)
Now we evaluate EG[n{ �Qc(�η)−Qc(�η)}]:
EG
�n��Qc(�η)−Qc(�η)
��= EG
�nZn +
√nAT
η Un,η(η0) +1
2AT
ηHn,η(η0)Aη +1
2AT
η I∗ηAη
�+ o(1)
p→ EG
��I−1η LΛη
�TΛ∗η
�
= EG
�tr�I−1η LΛη
�Λ∗η
�T��
= tr�I−1η LEG
�Λη(Λ
∗η)
T��
= tr�I−1η LK◦
η
�= p∗θ,
where
K◦η = EG
�Λη(Λ
∗η)
T�= EG
��Λα (Λ∗
α)T
ΛαΛTθ
Λθ (Λ∗α)
TΛθΛ
Tθ
��=
K◦
α Kα,θ�K∗
α,θ
�TKθ
.
7
It is practical to note that
tr�I−1η LK◦
η
�= tr
�I−1α 0
0 I−1θ
��I 0
−ITα,θI−1
α I
� K◦
α Kα,θ�K∗
α,θ
�TKθ
= tr
I−1
α K◦α I−1
α Kα,θ
−I−1θ IT
α,θI−1α K◦
α + I−1θ
�K∗
α,θ
�T−I−1
θ ITα,θI−1
α Kα,θ + I−1θ Kθ
= tr
��I−1α K◦
α 0
0 −I−1θ IT
α,θI−1α Kα,θ + I−1
θ Kθ
��.
Consistent estimators for I−1αj
, Kαj , I−1η , L and K◦
η can be obtained by using plug-in sample averages.
The consequent estimators are denoted as �I−1αj
, �Kαj ,�I−1η , �L, �K◦
η . For regularity conditions that ensure
convergence in probability, see Jullum & Hjort (2017).
Thus, the unbiased estimator of Q(�η) is
�Q∗(�η) = 1
n�n(�η)−
1
n
d�
j=1
tr��I−1αj�Kαj
�+ tr
��I−1η�L �K◦
η
� .
By defining �p∗αj= tr(�I−1
αj�Kαj
), �p∗θ = tr(�I−1η�L �K◦
η ), �p∗η =�d
j=1 �p∗αj+ �p∗θ and scaling �Q∗(�η), we can finally
define the copula information criterion as
CIC = 2�n(�η)− 2�p∗η. (3)
2.3 AIC for two-stage maximum likelihood estimator
The CIC, derived in Section 2.2, is a model robust model selection criterion. This means that the CIC does
not assume that the parametric model includes the true model that generated data. In this section we show
that, if we do make such a true model assumption, the CIC simplifies to the AIC2ML. To our knowledge,
this is the first time that the validity of the AIC2ML formula is proven for the two-stage ML estimator.
Lemma 1. Under the assumption that the margins and copula are correctly specified, it holds that K◦α =
Iα −Kα.
Proof. Assume the candidate model worked with contains the true data generating mechanism, i.e. that
8
f = g at the relevant parameter point. Then
0 =∂EG [Uα(y,α0)]
T
∂α0
=∂
∂α0
�f∂�d
j=1 log fj
∂αT0
dy
=
�∂f
∂α0
∂�d
j=1 log fj
∂αT0
+ f∂2�d
j=1 log fj
∂α0∂αT0
dy
=
�f∂ log f
∂α0
∂�d
j=1 log fj
∂αT0
+ f∂2�d
j=1 log fj
∂α0∂αT0
dy
=
�f
�∂�d
j=1 log fj
∂α0+
∂ log c
∂α0
�∂�d
j=1 log fj
∂αT0
+ f∂2�d
j=1 log fj
∂α0∂αT0
dy
=
�f∂�d
j=1 log fj
∂α0
∂�d
j=1 log fj
∂αT0
dy +
�∂ log c
∂α0
∂�d
j=1 log fj
∂αT0
dy +
�f∂2�d
j=1 log fj
∂α0∂αT0
dy
= EG
�Uα(y,α0)Uα(y,α0)
T�+ EG
�U∗α(y,α0)Uα(y,α0)
T�− EG [−Hα(y,α0)]
= Kα +K◦α − Iα.
If we make the true model assumption (i.e. f = g), we have from the classical results of maximum
likelihood theory that Iαj= Kαj
. This implies that we have for the bias correction term in the marginal
parameters p∗αj= tr(I−1
αjKαj
) = dim(αj), for each j.
For the bias correction term in the copula parameters part, the true model assumption results in Lemma 1.
Combining Lemma 1 with Lemma 3 and Lemma 5 from Ko & Hjort (2018) gives
tr�I−1η LK◦
η
�= tr
��I−1α K◦
α 0
0 −I−1θ IT
α,θI−1α Kα,θ + I−1
θ Kθ
��
= tr
��I − I−1
α Kα 0
0 I
��
= tr
��0 0
0 I
��= dim(θ).
Thus, the unbiased estimator �Q∗(�η) can be simplified as
�Q∗(�η) = 1
n�n(�η)−
1
n
d�
j=1
dim(αj) + dim(θ)
.
By defining pη =�d
j=1 dim(αj) + dim(θ) = dim(η) and scaling �Q∗(�η), we obtain AIC2ML for the two-stage
ML estimated copula models
AIC2ML = 2�n(�η)− 2pη. (4)
9
Compared to AICML from Section 2.1, the only difference is that the log-likelihood is now estimated under
the two-stage ML instead of ML. i.e. we use �η instead of �η.
2.4 Relationship between the model selection criteria
So far, we have discussed four model selection criteria for copula models. Table 1 shows an overview of the
relationship between them. When the model (i.e. both copula and margins) is correctly specified, CIC and
AIC2ML become equal, and the same happens between TIC and AICML. Thus, one can compare CIC and
AIC2ML (or TIC and AICML in case of ML estimation) to check whether the model is correctly specified.
Further, since both CIC and TIC are estimating the same part of the KL divergence under the same
model robust environment, they are compatible. This implies that one can compare CIC to TIC to measure
how much one looses in terms of KL divergence by using two-stage ML estimation instead of ML estimation.
The same can be done by comparing AIC2ML to AICML when one believes in the model. However, one
should not compare CIC with AICML (or TIC with AIC2ML) since they are based on two different model
beliefs (i.e. presence of the true model assumption). Figuratively speaking, one is in this situation comparing
apples with pears.
Model robust
Yes No
EstimationML TIC AICML
2ML CIC AIC2ML
Table 1: An overview of the relationship between model selection criteria discussed in this paper.
2.5 Illustration for two-dimensional case and extension to the conditional cop-
ula regression.
To make things more concrete, we now give an example of the two-dimensional case with (Y1, Y2) from the
unknown g(y). As candidate margins we choose a two-parameter distribution (e.g. normal, gamma, Weibull,
etc.) for both F1 and F2. For the copula part, we choose a one-parameter copula (e.g. Gumbel, Frank,
Clayton, etc.). The candidate model then has the form
f(y1, y2, η) = c (F1(y1,α1), F2(y2,α2), θ) · f1(y1,α1) · f2(y2,α2)
where η = (α1,1,α1,2,α2,1,α2,2, θ)T. The ingredients for the CIC can then be written down by using the
block matrix form, in line with Ko & Hjort (2018):
Kη =
Kα1 Kα1,α2 Kα1,θ
Kα2,α1 Kα2 Kα2,θ
KTα1,θ
KTα2,θ
Kθ
, Iη =
Iα1 0 0
0 Iα2 0
0 0 Iθ
,
L =
I2×2 0 0
0 I2×2 0
−ITα1,θ
I−1α1
−ITα2,θ
I−1α2
I1×1
, K◦
η =
K◦α1
K◦α1,α2
Kα1,θ
K◦α2,α1
K◦α2
Kα2,θ�K∗
α1,θ
�T �K∗
α2,θ
�TKθ
.
10
We consequently have
p∗η = p∗α1+ p∗α2
+ p∗θ
= tr�I−1α1
Kα1
�+ tr
�I−1α2
Kα2
�+ tr
�I−1η LK◦
η
�
= tr�I−1α1
Kα1
�+ tr
�I−1α2
Kα2
�+ tr
�I−1α1
K◦α1
�+ tr
�I−1α2
K◦α2
�
+ tr�−I−1
θ ITα1,θ I−1
α1Kα1,θ − I−1
θ ITα2,θ I−1
α2Kα2,θ + I−1
θ Kθ
�.
More generally, we can consider the conditional copula regression where all marginal distributions and
copula are conditioned on the k-variate covariate X parametrically. Patton (2002) extends the existing
theories of copula to the conditional copula setting, including the conditional version of Sklar’s theorem,
which gives conditional copula density
f(y1, · · · , yd, η|x) = c (F1(y1,α1|x), F2(y2,α2|x), θ|x) · f1(y1,α1|x) · f2(y2,α2|x).
For simplicity, we consider the case where the copula parameter θ is modeled by the linear calibration
function with θ = Xβ where β = (β0, · · · ,βk)Tis a k + 1 dimensional parameters. We can then consider β
as the copula parameter instead of θ, which results in η = (α1,1,α1,2,α2,1,α2,2,β0, · · · ,βk)T. One can easily
define matrices Kη, Iη, L and K◦η accordingly. For details about conditional copula regression, see Patton
(2006), Acar et al. (2013) and Palaro & Hotta (2006).
3 Simulation study
To study the behavior of CIC, we have performed a simulation study.
3.1 Simulation 1
In simulation 1, we generated datasets of 3 sizes (n = 100, 1000, 10000) with the data generating model
described in Table 2. We then, for each dataset, fitted 18 different copula models, based on the possible
combinations of the candidate copulas and margins described in Table 3. We repeated this process 1000
times and the averaged results from the fitted models can be found in Table 7, Table 8 and Table 9 in the
Appendix.
Table 2: Description of the data generating model used in simulation 1.
Copula Margin 1 Margin 2
Data generating
model
Gumbel
θ = 3
Weibull
α1 = (1.5, 4)T
(shape, scale)
Gamma
α2 = (2, 1)T
(shape, rate)
11
Table 3: List of the candidate copulae and margins used in simulation 1
Candidates
Copula Gumbel, Gaussian
Margin 1 Weibull, Gamma, Log-normal
Margin 2 Weibull, Gamma, Log-normal
From all three tables, we can see that CIC and AIC2ML result in similar model ranks. This is expected
since both are aiming for the model that minimizes KL divergence. When the model is correctly specified
(model 1), the estimated value of CIC and AIC2ML are essentially the same. This confirms that CIC and
AIC2ML are equal under the true model assumption (analytically proven in Section 2.3). Further, we can
observe that ‘better models’ according to CIC and AIC2ML have smaller value of MSE(�P) in general. To
clarify, MSE(�P) indicates mean squared error of two-stage ML estimated P(b0.8 < y). Here P(b0.8 < y) is
the joint probability that each marginal variable has larger value than its 0.8-quantile value, defined by each
marginal model. Similarly, MSE(�P) is mean squared error of ML estimated P(b0.8 < y).
When the model is correctly specified (model 1), �p∗η is virtually equal to pη, the length of the parameter
vector η. When the model has misspecification, the CIC value is penalized more as �p∗η increases. However,
this is not the case for AIC2ML since pη = 5 across all models. Thus, CIC has higher chance of choosing
a less wrong model. As n increases, the influence of �p∗η on CIC decreases since the absolute value of log-
likelihood grows much faster than the penalty term. This is observable in Table 4. When n = 100, the best
models chosen by the model robust model selection criteria (TIC and CIC) result in smaller MSE values.
However, when n = 10000, the penalty term of these model selection criteria is very small compared to the
log-likelihood value. Consequently, CIC and TIC choose the same models as their model non-robust variants
(AIC2ML and AICML) do. This results in the same MSE performance.
Table 4: Result from simulation 1. For each dataset, the best model was chosen among 18 candidate models
by using CIC, TIC, AIC2ML or AICML. Then, P(b0.8 < y) was computed from the best models. The table
contains mean squared error of the estimated P(b0.8 < y) multiplied by 105.
n TIC AICML CIC AIC2ML
100 3.7871 3.8723 3.9079 4.0122
1000 0.2859 0.2859 0.2858 0.2858
10000 0.0251 0.0251 0.0251 0.0251
3.2 Simulation 2
In simulation 2, we generated datasets of size n = 1000 with the data generating model described in Table 5.
We then fitted 486 different copula models, which are based on the possible combinations of the candidate
copulae and margins described in Table 6. We repeated this process 100 times and averaged the results.
Like in simulation 1, P(b0.8 < y) was computed from every fitted models. Figure 1 displays the relationship
between mean squared error of estimated P(b0.8 < y) and CIC and AIC2ML. We can see that both model
selection criteria evaluate the models that have lower mean squared error as better models. The difference
12
between CIC and AIC2ML in this perspective is minimal. This is because the log-likelihood, the element that
is shared by both model selection criteria, has much bigger absolute value than the bias correction term.
Table 5: Description of the data generating model used in simulation 2.
Copula Margin 1 Margin 2 Margin 3 Margin 4 Margin 5
Data generating
model
Gumbel
θ = 3
Weibull
α1 = (1.5, 4)T
(shape, scale)
Weibull
α2 = (2, 3)T
(shape, scale)
Gamma
α3 = (2, 1)T
(shape, rate)
Gamma
α4 = (3, 1)T
(shape, rate)
Gamma
α5 = (4, 2)T
(shape, rate)
Table 6: List of the candidate copulae and margins used in simulation 2
Candidates
Copula Gumbel, Gaussian
Margin 1 Weibull, Gamma, Log-normal
Margin 2 Weibull, Gamma, Normal
Margin 3 Weibull, Gamma, Log-normal
Margin 4 Weibull, Gamma, Log-normal
Margin 5 Weibull, Gamma, Log-normal
Figure 2 plots the same as Figure 1, but the x-axis is now the bias correction term (�p∗η for CIC and
pη for AIC2ML). The difference between CIC and AIC2ML is now more clear. While pη (dimension of the
parameter vector η) for AIC2ML is fixed at 11 across all models, �p∗η tend to penalize misspecified models
more and forms a strong relationship with the mean squared error of estimated P(b0.8 < y).
13
−11500 −11000 −10500
02
46
810
Gumbel copula
Model selection criterion value
105
× M
SE
(P)
CICAIC2MLCIC (trend)AIC2ML (trend)
−12000 −11500 −11000
100
120
140
160
180
200
220
Gaussian copula
Model selection criterion value
105
× M
SE
(P)
CICAIC2MLCIC (trend)AIC2ML (trend)
Figure 1: Result from simulation 2. On the x-axis is the value of CIC or AIC2ML. The 486 different copula
models, defined by using the candidate copulae and margins described in Table 6, are fitted to the dataset
generated from the data generating model described in Table 5. The y-axis is the mean squared error of�P(b0.8 < y), which indicates the two-stage ML estimated joint probability that each marginal variable has
larger value than its 0.8-quantile value, defined by each marginal model. Since different choices of copulae
leads to a big difference in model selection score, the result is separately displayed for each copula. The left
plot contains 283 models with the Gumbel copula and the right plot contains 283 models with the Gaussian
copula. The data generating model had Gumbel copula.
As mentioned in Section 2.4, one can compare CIC to TIC, or AIC2ML to AICML, to measure how
much one loses in terms of KL divergence by performing two-stage ML estimation instead of ML estimation.
Figure 3 shows that TIC−CIC or AIC2ML −AICML has a relationship with the loss of MSE of P(b0.8 < y)
caused by two-stage ML estimation. However, this relationship seems weaker when the copula is misspecified
(right panel). We tried to identify any sub-pattern in the plot that can cause this, for example by plotting
only a subset of models that have specific margins. Yet, we weren’t able to detect any sub-pattern.
14
12 14 16 18
02
46
810
Gumbel copula
Bias correction term (p*η or pη)
105
× M
SE
(P)
CICAIC2MLCIC (trend)AIC2ML (trend)
12 14 16 18
100
120
140
160
180
200
220
Gaussian copula
Bias correction term (p*η or pη)
105
× M
SE
(P)
CICAIC2MLCIC (trend)AIC2ML (trend)
Figure 2: Result from simulation 2. The 486 different copula models, defined by using the candidate copulae
and margins described in Table 6, are fitted to the dataset generated from data generating model described
in Table 5. The y-axis is the mean squared error of �P(b0.8 < y). The x-axis is the value of the bias correction
terms in model selection criteria (�p∗η for CIC and pη for AIC2ML). Like for Figure 1, the result is separately
displayed for each copula. The data generating model had Gumbel copula.
15
0 100 200 300 400
−10
−8−6
−4−2
02
Gumbel copula
TIC − CIC or AICML − AIC2ML
105
×{M
SE
(P) −
MS
E(P
)}
TIC − CICAICML − AIC2MLTIC − CIC (trend)AICML − AIC2ML (trend)
0 5 10 15 20 25 30
−60
−40
−20
020
40
Gaussian copula
TIC − CIC or AICML − AIC2ML
105
×{M
SE
(P) −
MS
E(P
)}
TIC − CICAICML − AIC2MLTIC − CIC (trend)AICML − AIC2ML (trend)
Figure 3: Result from simulation 2. The 486 different copula models, defined by using the candidate
copulae and margins described in Table 6, are fitted to the dataset generated from data generating model
described in Table 5. The y-axis is the difference between MSE(�P) (MSE of P(b0.8 < y) estimated by ML)
and MSE(�P) (MSE of P(b0.8 < y) estimated by two-stage ML). Like in Figure 1 and Figure 2, the result is
displayed separately for each copula. The data generating model had Gumbel copula.
4 Conclusions and further research
In this paper, we have developed the copula information criterion (CIC), which is a TIC-like model robust
model selection criterion for two-stage ML estimated copulae. When we make an assumption that the
parametric candidate model contains the true model, CIC becomes equal to AIC2ML. This validates the
use of AIC2ML for the two-stage ML estimated copula models. To our knowledge, this is the first time that
AIC2ML formula is analytically justified. Further, since both TIC and CIC are estimating the same part of
the KL divergence, without the presence of the true model assumption, they are compatible to each other
and can be used to check possible disadvantages caused by the two-stage ML estimation. The same can be
done by comparing AICML and AIC2ML, when one believes in the model.
Regarding the assumption that a candidate model is correct, one can compare �p∗η (bias correction term of
CIC) and pη (bias correction term of AIC2ML) to check whether the model severely diverges from the data
generating model, i.e. as a separate goodness-of-fit test. It may be noted that the job of the CIC is to rank
models according to a sensible criterion, and to identify the best ones, but doing well in this ranking is not
the same as claiming that the model passes goodness-of-fit tests. In yet other words, the winning model,
using the CIC, may still not be a perfect model, perhaps since the list of candidate models has not been the
best.
We performed a simulation study. For relatively small sample sizes, CIC outperforms AIC2ML in terms
16
of prediction performance from the selected models. When the sample size is large, the log-likelihood term
grows much faster than the bias correction term and the difference between CIC and AIC2ML is minimal.
Naturally, for large n, the best models are those with high values of the maximized (two-stage) log-likelihoods,
which means richly parametrised models.
From our simulation, cf. Figure 2, we can see that �p∗η alone has a strong correlation with the prediction
performance (measured in MSE). So, one can consider to use �p∗η (without the log-likelihood part) to judge
the model. In addition, TIC − CIC and AICML − AIC2ML turn out to have high correlation with the loss
of prediction performance (measured as difference in MSE) caused by the switch from ML estimation to
two-stage ML estimation.
The results from the simulation study hold mostly both when the copula is correctly specified and
misspecified. Yet, in Figure 3, although the overall tendency is similar, we observe that the result from the
misspecified Gaussian copula seems to consist of two different sub-patterns. However, we were not able to
find a possible cause of this.
Because of the large number of possible models in high dimensional setting, the number of situations
that we could examine, was limited. (In case of a 5-dimensional copula model with 2 candidate copulae and
3 candidate for each margin, there are 486 models that we have to test, and each model has to be fitted
by 2 different estimation schemes.) Another problem was that we could not try all copulae and margins on
the simulated data since fitting a heavily misspecified copula and margins would cause numerical problems.
A further large-scale simulation study that examines the behavior of different types of copulae in variety of
situations would be fruitful.
Furthermore, CIC is computationally expensive mainly because �K◦α, �Kα,θ and �Kθ require score functions
for every data point separately. For example, for �I−1α and �Iθ, one can avoid this by swapping the order of
summation and differentiation, but for �K◦α, �Kα,θ and �Kθ, this is not possible. A numerical technique that
can make CIC less computationally extensive would be appreciated.
Although CIC performs decently well in selecting a good model that fits the data best in terms of KL
divergence, there are situations where one is interested in a model that is suitable for specific tasks. The task
of interest could be for example estimating tail probabilities, the mean, or the median. A model selection
criterion for copula models under the two-stage ML scheme that can take this into account would be useful.
Acknowledgments
The authors would like to thank Ingrid Hobæk Haff and Steffen Grønneberg for their valuable comments
and fruitful discussions. The authors also acknowledge partial funding from the Norwegian Research Council
supported research group FocuStat: Focus Driven Statistical Inference With Complex Data, and from the
Department of Mathematics at the University of Oslo.
References
Acar, E. F., Craiu, R. V., Yao, F. et al. (2013). Statistical testing of covariate effects in conditional copula
models. Electronic Journal of Statistics 7, 2822–2850.
17
Akaike, H. (1974). A new look at the statistical model identification. IEEE transactions on automatic control
19, 716–723.
Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected
papers of hirotugu akaike. Springer, pp. 199–213.
Claeskens, G. & Hjort, N. L. (2008). Model Selection and Model Averaging, vol. 330. Cambridge University
Press Cambridge.
Genest, C. & Favre, A.-C. (2007). Everything you always wanted to know about copula modeling but were
afraid to ask. Journal of hydrologic engineering 12, 347–368.
Grønneberg, S. & Hjort, N. L. (2014). The copula information criteria. Scandinavian Journal of Statistics
41, 436–459.
Joe, H. (1997). Multivariate Models and Multivariate Dependence Concepts. CRC Press.
Jullum, M. & Hjort, N. L. (2017). Parametric or nonparametric: the fic approach. Statistica Sinica .
Ko, V. & Hjort, N. L. (2018). Model robust inference for copulae via two-stage maximum likelihood estima-
tion. Submitted to Journal of Multivariate Analysis.
Kullback, S. & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics
22, 79–86.
Nelsen, R. B. (2006). An Introduction to Copulas. Springer Science & Business Media.
Palaro, H. & Hotta, L. (2006). Using conditional copula to estimate value at risk. Journal of Data Science
4, 93–115.
Patton, A. J. (2002). Applications of copula theory in financial econometrics. Ph.D. thesis, University of
California, San Diego.
Patton, A. J. (2006). Modelling asymmetric exchange rate dependence. International economic review 47,
527–556.
Sklar, A. (1959). Fonctions de repartition a n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris
8, 229–231.
Takeuchi, K. (1976). The distribution of information statistics and the criterion of goodness of fit of models.
Mathematical Science 153, 12–18.
18
Appendix
Model no. TIC AICML CIC AIC2ML �p∗η,TIC �p∗η pη 105 ·MSE(�P) 105 ·MSE(�P) Copula Margin 1 Margin 2
n = 100
1 -609.92 -610.10 -610.77 -610.93 4.91 4.92 5 2.6695 2.8444 Gumbel Weibull Gamma
2 -613.04 -612.93 -614.75 -614.26 5.05 5.24 5 2.6092 2.9347 Gumbel Weibull Weibull
4 -613.64 -613.54 -615.04 -614.93 5.05 5.05 5 2.7188 2.8276 Gumbel Gamma Gamma
11 -621.56 -620.75 -621.61 -621.12 5.41 5.25 5 27.3463 22.8859 Gaussian Weibull Weibull
10 -622.25 -621.41 -622.33 -621.69 5.42 5.32 5 29.5786 26.3657 Gaussian Weibull Gamma
5 -619.30 -618.63 -623.51 -622.01 5.34 5.75 5 2.7269 3.3693 Gumbel Gamma Weibull
6 -624.39 -623.56 -627.26 -626.22 5.41 5.52 5 2.5957 2.7652 Gumbel Gamma Log-normal
13 -628.24 -627.03 -628.26 -627.05 5.60 5.60 5 31.3167 30.7712 Gaussian Gamma Gamma
14 -629.62 -628.39 -629.99 -628.63 5.62 5.68 5 26.9486 28.9915 Gaussian Gamma Weibull
3 -627.54 -626.46 -634.34 -632.53 5.54 5.90 5 2.6429 3.8486 Gumbel Weibull Log-normal
9 -640.53 -638.38 -643.09 -640.44 6.07 6.32 5 2.7582 2.7121 Gumbel Log-normal Log-normal
12 -646.70 -643.93 -646.92 -644.09 6.39 6.41 5 41.5681 43.6130 Gaussian Weibull Log-normal
15 -647.74 -644.76 -647.75 -644.77 6.49 6.49 5 43.4360 43.2919 Gaussian Gamma Log-normal
7 -647.85 -645.24 -654.37 -651.88 6.31 6.24 5 3.0737 6.1935 Gumbel Log-normal Gamma
8 -654.09 -651.41 -665.48 -661.68 6.34 6.90 5 2.5910 9.1325 Gumbel Log-normal Weibull
16 -666.52 -662.25 -666.53 -662.25 7.14 7.14 5 57.9316 58.2482 Gaussian Log-normal Gamma
17 -671.15 -666.76 -672.01 -667.17 7.19 7.42 5 51.5845 61.5621 Gaussian Log-normal Weibull
18 -674.16 -668.48 -674.16 -668.48 7.84 7.84 5 56.7535 56.7548 Gaussian Log-normal Log-normal
Table 7: Result of simulation 1 with n = 100. The simulation was repeated 1000 times and the results were averaged.
19
Model no. TIC AICML CIC AIC2ML �p∗η,TIC �p∗η pη 105 ·MSE(�P) 105 ·MSE(�P) Copula Margin 1 Margin 2
n = 1000
1 -6060.83 -6060.84 -6061.78 -6061.79 4.99 4.99 5 0.2861 0.2996 Gumbel Weibull Gamma
2 -6093.59 -6092.60 -6102.19 -6101.23 5.50 5.48 5 0.2981 0.3218 Gumbel Weibull Weibull
4 -6095.45 -6095.16 -6105.59 -6105.31 5.15 5.14 5 0.3009 0.3331 Gumbel Gamma Gamma
11 -6171.40 -6170.13 -6174.20 -6173.32 5.63 5.44 5 23.3046 18.7508 Gaussian Weibull Weibull
10 -6178.38 -6177.18 -6179.74 -6178.74 5.60 5.50 5 25.5380 22.1326 Gaussian Weibull Gamma
5 -6151.67 -6150.39 -6183.69 -6181.42 5.64 6.13 5 0.3913 0.7139 Gumbel Gamma Weibull
6 -6204.66 -6203.44 -6230.19 -6228.67 5.61 5.76 5 0.2753 0.3321 Gumbel Gamma Log-normal
13 -6235.93 -6234.28 -6236.12 -6234.49 5.82 5.82 5 27.2102 26.6318 Gaussian Gamma Gamma
14 -6250.52 -6248.66 -6252.09 -6250.10 5.93 6.00 5 22.6996 24.9377 Gaussian Gamma Weibull
3 -6237.75 -6236.14 -6300.84 -6298.28 5.80 6.28 5 0.3561 1.3135 Gumbel Weibull Log-normal
9 -6355.65 -6352.79 -6369.91 -6366.28 6.43 6.81 5 0.5153 0.4199 Gumbel Log-normal Log-normal
12 -6425.80 -6421.62 -6426.35 -6422.09 7.09 7.13 5 38.1316 40.5651 Gaussian Weibull Log-normal
15 -6432.74 -6428.42 -6432.78 -6428.47 7.16 7.16 5 39.9678 39.8287 Gaussian Gamma Log-normal
7 -6427.95 -6424.60 -6495.95 -6492.60 6.68 6.67 5 0.6189 3.5215 Gumbel Log-normal Gamma
8 -6490.98 -6487.46 -6597.27 -6592.10 6.76 7.58 5 0.2722 6.3816 Gumbel Log-normal Weibull
16 -6612.77 -6606.21 -6612.80 -6606.24 8.28 8.28 5 55.3695 55.7180 Gaussian Log-normal Gamma
17 -6660.51 -6653.56 -6664.39 -6656.88 8.47 8.76 5 48.5595 59.4736 Gaussian Log-normal Weibull
18 -6686.70 -6678.29 -6686.70 -6678.29 9.20 9.20 5 52.9131 52.9139 Gaussian Log-normal Log-normal
Table 8: Result of simulation 1 with n = 1000. The simulation was repeated 1000 times and the results were averaged.
20
Model no. TIC AICML CIC AIC2ML �p∗η,TIC �p∗η pη 105 ·MSE(�P) 105 ·MSE(�P) Copula Margin 1 Margin 2
n = 10000
1 -60528.25 -60528.25 -60529.19 -60529.18 5.00 5.00 5 0.0251 0.0268 Gumbel Weibull Gamma
2 -60848.99 -60847.94 -60929.30 -60928.31 5.52 5.49 5 0.0381 0.0433 Gumbel Weibull Weibull
4 -60869.96 -60869.66 -60966.13 -60965.85 5.15 5.14 5 0.0419 0.0699 Gumbel Gamma Gamma
11 -61624.69 -61623.39 -61655.65 -61654.74 5.65 5.45 5 22.8801 18.3281 Gaussian Weibull Weibull
10 -61696.56 -61695.34 -61710.64 -61709.62 5.61 5.51 5 25.1114 21.7084 Gaussian Weibull Gamma
5 -61427.65 -61426.32 -61734.40 -61732.07 5.67 6.17 5 0.1282 0.4364 Gumbel Gamma Weibull
6 -61969.26 -61967.99 -62223.80 -62222.22 5.63 5.79 5 0.0292 0.0867 Gumbel Gamma Log-normal
13 -62265.03 -62263.36 -62266.99 -62265.33 5.83 5.83 5 26.7925 26.2132 Gaussian Gamma Gamma
14 -62409.19 -62407.28 -62422.48 -62420.44 5.95 6.02 5 22.2643 24.5218 Gaussian Gamma Weibull
3 -62301.64 -62299.98 -62929.03 -62926.39 5.83 6.32 5 0.1043 1.0768 Gumbel Weibull Log-normal
9 -63458.15 -63455.22 -63582.75 -63579.03 6.47 6.86 5 0.2704 0.1838 Gumbel Log-normal Log-normal
12 -64168.16 -64163.79 -64171.83 -64167.38 7.18 7.23 5 37.7631 40.2916 Gaussian Weibull Log-normal
15 -64230.67 -64226.18 -64231.09 -64226.61 7.24 7.24 5 39.6239 39.4905 Gaussian Gamma Log-normal
7 -64177.36 -64173.93 -64854.87 -64851.44 6.71 6.71 5 0.3752 3.2669 Gumbel Log-normal Gamma
8 -64807.87 -64804.26 -65853.68 -65848.39 6.80 7.64 5 0.0245 6.1275 Gumbel Log-normal Weibull
16 -65997.87 -65991.08 -65998.13 -65991.33 8.40 8.40 5 54.9718 55.3255 Gaussian Log-normal Gamma
17 -66475.00 -66467.78 -66507.88 -66500.08 8.61 8.90 5 48.1140 59.1384 Gaussian Log-normal Weibull
18 -66730.29 -66721.58 -66730.29 -66721.58 9.36 9.36 5 52.3035 52.3041 Gaussian Log-normal Log-normal
Table 9: Result of simulation 1 with n = 10000. The simulation was repeated 1000 times and the results were averaged.
21