robust estimates of insurance misrepresentation through ...jianxi/kqr_misrepresentation.pdf ·...
TRANSCRIPT
Robust Estimates of Insurance Misrepresentation
through Kernel Quantile Regression Mixtures
Hong Li,∗ Qifan Song,† Jianxi Su,‡
September 16, 2020
Abstract
This paper pertains to a class of non-parametric methods for studying the misrepresentation issue in insurance
applications. For this purpose, mixture models based on quantile regression in reproducing kernel Hilbert spaces
are employed. Compared to the existing parametric approaches, the proposed framework features a more flexible
statistics structure which could alleviate the risk of model misspecification, and is in the meantime more robust
to outliers in the data. The proposed framework can not only estimate the prevalence of misrepresentation in
the data, but also help identify the most suspicious individuals for the validation purpose. Through embedding
state-of-the-art machine learning techniques, we present a novel statistics procedure to efficiently estimate the
proposed misrepresentation model in the presence of massive data. The proposed methodology is applied to study
the Medical Expenditure Panel Survey (MEPS) data, and a significant degree of misrepresentation activities are
found on the self-reported insurance status.
Key words and phrases: Big data, insurance claim models, misrepresenter identification, misrepresentation risk
assessment, non-parametric regression mixtures.
∗Asper School of Business, University of Manitoba, Winnipeg, Canada. Email: [email protected]†Department of Statistics, Purdue University, West Lafayette, U.S.A. Email: [email protected]‡Department of Statistics, Purdue University, West Lafayette, U.S.A. Email: [email protected]
1
Introduction
Insurance companies collect policyholders’ information – summarized by the so-called insurance rating factors
– in order to calculate risk-adjusted premiums. In the real world practice, it often happens that policyholders
intentionally make untrue statements on some key rating factors so as to alter the insurance eligibility and/or lower
the insurance premiums. In actuarial parlance, this type of fraudulent behaviors are best referred to as the insurance
misrepresentation. Rating factors subject to misrepresentation are typically self-reported. Examples include the
smoking status in health insurance, the millage and use of vehicles in auto insurance. Misrepresentation phenomena
can be also found in insurance-related survey data when revealing a particular level of true information is associated
with a type of costs, such as social desirability or financial cost. Shutting eyes to misrepresentation activities may
degrade the quality of actuarial models, leading to infelicitous business decisions and/or unfair premium schemes.
Misrepresentation risk is also of particular interest to insurance regulators who attempt to understand how the
presence of fraudulent behaviors by a group of insured individuals might affect the welfare of the others (Gabaldon
et al., 2014). Hence, assessing and managing the misrepresentation risk is an indispensable component in modern
insurance practice.
There are two strands of literature in studying the misrepresentation risk. The first strand of literature focuses on
the qualitative research of misrepresentation, attempting to propose superior policy designs and proactive overseeing
processes for deterring the misrepresentation behaviors (see, Hamilton, 2009, for a comprehensive review). The other
strand of literature focuses on the quantitative aspect of misrepresentation management and aims to quantify the
extent of misrepresentation risk using statistical models. Our paper sits squarely in the latter strand.
Arguably, modeling insurance misrepresentation is a challenging task. What complicates the task, from the
statistical standpoint, is the unobservable nature of fraudulent activities. Namely, knowledge on whether misrep-
resentation actually occurred cannot be discovered until a formal investigation is undertaken. Thereby, traditional
statistics methods (e.g., discriminant analysis and logistic regression) which require access to a sample frame con-
taining the random variable of concern (i.e., the misrepresentation status under the individual level), may not be
directly used to study misrepresentation. Consequently, a new set of statistics tools are naturally called upon. In
particular, we have a keen interest in developing a rigorous statistics framework in order to deliver scientifically sound
answers to the following two questions, which are of fundamental importance in quantitative misrepresentation risk
management:
Q1. Based on a given set of insurance claims data, how to assess the level of misrepresentation activities in the
data, before actually observing the fraudulent behaviors?
Q2. If a significant level of misrepresentation activities are discovered, how to select the most suspicious individuals
for the validation purpose?
2
Despite the practical importance, modeling misrepresentation has not received much attention from insurance
researchers until recently, and thus the related literature is considerably limited. To the best of our knowledge,
there are only a few existing works on misrepresentation modeling, including Akakpo et al. (2019); Xia (2018); Xia
and Gustafson (2016, 2018); Xia et al. (2018). In particular, one of the most recent attempts made in Akakpo
et al. (2019) significantly inspires our undertakings. In that paper, the authors built on the theoretical groundwork
established by Xia and Gustafson (2016) and proposed a mixture of parametric regression to model insurance mis-
representation. Before placing the present paper into perspective, the following paragraphs give a coarse overview of
the misrepresentation model in Akakpo et al. (2019). Some standard notations for describing the misrepresentation
problem of interest are needed beforehand. We denote the data set as {yi,xi, z∗i }ni=1, where yi ∈ R is the response
(e.g., log transformed insurance claims), xi ∈ Rd is the d-dimensional correctly measured covariate (e.g., various
rating factors and demographic information), and z∗i ∈ {0, 1} is the self-reported and potentially misrepresented
classification label with z∗i = 0 represents a negative risk status. Admittedly, the aforementioned setup may be re-
stricted in the sense that there is only one potentially misrepresented variable considered and it is binary. However,
this type of misrepresentation problem is among the most common ones in real-world practice. It is also worth
mentioning that insurance misrepresentation usually occurs in one direction only, benefiting the respondent finan-
cially. For example, in a health insurance with smoking surcharge, it is impossible for a non-smoking policyholder
to misrepresent as a smoker in order to increase the insurance premium. Correspondingly, the true unknown label,
denoted by zi, is partially missing and satisfies zi = z∗i if z∗i = 1.
Next, we present a slight generalization of the misrepresentation model introduced by Akakpo et al. (2019) 1.
Given the correctly measured covariate xi and the true label of the potentially misrepresented variable zi, assume
that the conditional distribution of the log transformed insurance claims can be catered by Gaussian linear regression
(or equivalently, log-normal linear regression on the original claims):
yi − (β0 + xi β + zi βz)iid∼ N(0, σ2), i = 1, . . . , n, (1)
where β0, β = (β1, . . . , βn)′, βz are the regression coefficients, σ > 0 is the standard deviation parameter of
normal distribution, and “iid∼” designates a distribution for random variables that are identically and independently
distributed. Recall that zi is partially missing, so (yi |xi, z∗i = 1)d= (yi |xi, zi = 1) where “
d=” signifies equality in
distribution. If z∗i = 0, then (yi |xi, z∗i ) admits a 2-component mixture:
F (yi |xi, z∗i = 0) = (1− π)F (yi |xi, zi = 0) + π F (yi |xi, zi = 1), (2)
1Multiple correctly measured covariates are allowed in the parametric misrepresentation model presented here, while the one inAkakpo et al. (2019) only considered the potentially misrepresented variable.
3
where F (·) denotes the conditional cumulative distribution function (CDF) of y given (x, z∗) or (x, z), and π =
P[z = 1 | z∗ = 0] represents the misrepresentation probability. To model misrepresentation, the statistical task is to
make inference on the model’s parameters, (β0,β, βz, σ), as well as the misrepresentation probability π. In Akakpo
et al. (2019), the authors proposed to use the Expectation-Maximization (EM) algorithm to estimate the model.
The Gaussian linear model in (1) is both useful for a significant range of practical situations and elegant enough
for theoretical analysis. However, when it is used for modeling misrepresentation, a fundamental issue concerning
the soundness of the estimation result arises, namely, misspecifying the Gaussian linear model to the claim data
may contaminate the mixture structure in Equation (2). Nevertheless, attributable to the lack of validated data on
misrepresentation in the practical instance, it is impossible to verify if the mixture structure in Equation (2) is truly
due to the fraudulent behaviors, or simply because of the misspecification of parametric models. Theoretically,
it is rather challenging to quantify the bias in the estimation of misrepresentation probability when the claim
model is misspecified. Misrepresentation models based on other parametric distributions such as gamma, Pareto
and Weibull, are also considered in Xia (2018); Xia and Gustafson (2016); Xia et al. (2018). That said, these
modifications still rely on strong parametric assumptions made on the claim distribution and hence can not yield a
satisfactory resolution to the robustness issue noted above.
Another potential cause of distrustful estimates of misrepresentation under Gaussian linear model (1) is the
presence of outliers, due to the use of least squares based procedure. For instance, if a non-misrepresenter case
(i.e., zi = z∗i = 0) with relatively high response value yi appears in the data set, then this observation may induce
over-estimation for the regression function under z = 0. If fitting regression functions to data is the sole statistical
goal, then a straightforward approach to deal with outliers is to identify and remove them from the estimation
process. However, this rather naive approach does not suit a misrepresentation analysis because outliers may
actually correspond to the misrepresented individuals.
Beyond the choice of suitable statistical modeling methods, another important practical issue for modern in-
surance application is the computational bottleneck when encountering massive insurance data. While the compu-
tational complexity of restrictive parametric methods, such as linear models, is usually linearly affected by sample
size, flexible statistical methods, such as non-parametric models, suffer much heavier computing burdens due to
increasing samples. Therefore, special attentions must be paid towards how to efficiently carry out appropriate
analysis and answer the fundamental questions Q1 and Q2, without sacrificing the statistical accuracy.
In light of the discussion thus far, the major object of interest in this present paper is straightforward: a non-
parametric approach for robust estimates of insurance misrepresentation. The primary tool involved in our study
is kernel quantile regression (KQR). The rationale behind our model choice is two-fold. First, the embedded kernel
method can provide us with a flexible statistics structure to capture the non-linear and interaction effects in the
insurance data. As a consequence, the application of KQR should greatly moderate the practical concern on model
4
misspecification in studying misrepresentation. Second, compared to mean regression which is more familiar in the
insurance circle, quantile regression is known to be more robust to the presence of outliers. Furthermore, in order
to ease the computational difficulties of large insurance data set, we combine the state-of-the-art random sketch
technique with the conventional estimation method for KQR, which can significantly reduce the computing time
and memory requirement.
An innovative learning algorithm is introduced to implement the KQR-based misrepresentation model. The
algorithm not only finds non-parametric estimates of misrepresentation probability (which answers question Q1) as
well as the regression structure between insurance claims and relevant covariates, but also helps identify potential
misrepresenters (which answers question Q2). Based on extensive simulation studies, we show that the proposed
framework is able to accurately estimate the misrepresentation probability under various data generating processes
(DGP’s), ranging from the univariate linear models to multivariate non-linear models. Moreover, misrepresenters
are successfully detected with very low error rates. Therefore, the proposed misrepresentation framework seems to
be able to deliver a satisfactory performance under different challenging situations. Finally, the proposed model is
applied to study the Medical Expenditure Panel Survey (MEPS) data. In line with the results of Akakpo et al.
(2019), our model suggests a significant percentage of respondents misrepresented on the self-reported insurance
status in 2014, in order to avoid the potential tax penalty. However, our estimated percentage is substantially lower
than that reported in Akakpo et al. (2019). This discrepancy may be attributed to the fact that the proposed KQR
method could better cope with the non-linear dependence structure inherent in the MEPS data, and thus leads to
a more reliable estimation of the misrepresentation probability.
The quantile regression based misrepresentation model
We cope with the loss modeling component in misrepresentation analysis using the notion of quantile regression,
championed by Koenker (see, Koenker, 2005, for a compreshensive review). Different from most regression method-
ologies centering around the condition mean, quantile regression supplements the encompassing statistics toolbox
with a general technique for estimating families of conditional quantile functions. Because quantile regression can
be applied to study the functional dependence of data across the lower, middle and upper tails of response dis-
tributions, it is capable of providing more distributional information than the classical mean regression. In recent
years, quantile regression has become a very popular toolkit for applied actuaries and risk analysts to consider the
effect of covariates on the distribution of the dependent variable (Baione and Biancalana, 2019; Kudryavtsev, 2009;
Perez-Marın et al., 2019; Xia, 2014).
5
Misrepresentation model’s specification
We are now in the position to lay down the quantile regression based misrepresentation model of interest. Though
elementary, we recall that the quantile function associated with CDF G is defined as
G−1(τ) = inf{t ∈ R : G(t) ≥ τ}, with τ ∈ (0, 1).
If G is continuous, then the quantile function boils down to the ordinary inverse of CDF (Embrechts and Hofert,
2013).
We have already defined respectively zi and z∗i , i = 1, . . . , n, as the true and self-reported labels of the potentially
misrepresented rating factor. Given the true label zi = r with r ∈ {0, 1}, we model the functional dependence
between insurance claims and the correctly measured rating factors by
yi −[br + hr(xi)
]= ξi(τ)
iid∼ G, where br ∈ R, hr(·) : Rd → R, and G−1(τ) = 0 for some τ ∈ (0, 1). (3)
Equivalently, the function fr := br+hr is the 100τ% quantile of the conditional distribution of y given (x, z = r) for
r ∈ {0, 1}. Recall that only z∗i is observable but not zi. We employ the hybrid structure proposed by Akakpo et al.
(2019) in order to address the unobservable nature of misrepresentation. Precisely, we model data (yi,xi, z∗i = 1) by
yi = f1(xi)+ξi(τ) because no misrepresentation may occur when z∗i = 1, whereas (yi,xi, z∗i = 0) by a 2-component
mixture formulated as
yi = f0(xi) + ξi(τ), with probability 1− π;
yi = f1(xi) + ξi(τ), with probability π.
(4)
The mixture structure in Equation (4) is reminiscent of the parametric misrepresentation model in Equation (2).
In this proposed framework, the core mechanism for quantifying misrepresentation is to implement a mixture
structure on data with z∗i = 0. To make the proposed misrepresentation model well-defined, one crucial problem
that needs to be addressed is the identifiability of the embedded mixture structure. Under the Gaussian linear
regression model without considering other covariates except z, Akakpo et al. (2019) showed that the associated
misrepresentation model is identifiable (also see the theoretical discussion in Xia and Gustafson, 2016). In the
ensuing assertion, we extend the identifiability result to the more general quantile regression based misrepresentation
models, which may further involve an arbitrary number of correctly measured covariates.
Proposition 1. The proposed misrepresentation model is identifiable.
Proof. We begin by noting the proposed misrepresentation model defines equivalently the following conditional
6
distribution of data:
y | (x, z∗ = 1)d= ξ + f1(x), with ξ ∼ G;
y|(x, z∗ = 0)d= ξ + f0(x) + ω
[f1(x)− f0(x)
], with ξ ∼ G, ω ∼ Bernoulli(1− π), and ξ ⊥⊥ ω.
(5)
Thus, it suffices to prove that there does not exist a different choice of (G, f0, f1, π), under which (5) represents the
same conditional distribution.
We are going to proceed by contradiction. Assume that such choice exists, and we denote it by (G, f0, f1, π).
By the first equation of (5), we have ξ + f1(x)d= ξ + f1(x), where ξ ∼ G. Since the 100 τ% quantiles of both G
and G are 0, we can claim that f1(x) = f1(x), which further implies G = G. Conditional on x and z∗ = 0, the
characteristic function of y is
∫eitudG(u)×
[(1− π) eif0(x) t + π eif1(x) t
]=
∫eitudG(u)×
[(1− π) eif0(x) t + π eif1(x) t
]=
∫eitudG(u)×
[(1− π) eif0(x) t + π eif1(x) t
],
for all t ∈ R and here i denotes the imaginary unit. Note that characteristic function∫eitudG(u) is non-vanishing
in a region around 0, we must have
(1− π) eif0(x) t + π eif1(x) t = (1− π) eif0(x) t + π eif1(x) t, for some t around 0.
Equivalently,
π ei[f1(x)−f0(x)] t − π = π ei[f1(x)−f0(x)] t − π, for some t around 0. (6)
In order to match the imaginary part of both sides of Equation (6), we must have π = π and f0(x) = f0(x). We
have established the identifiability property for the proposed model, and the proof is now completed.
Remark 1. Although Proposition 1 is established based on quantile regression, the result can be also applied to
study the identifiability property for misrepresentation models constructed by mean regression. To see this, simply
choose τ such that 100 τ% quantile of the residual distribution is equal to its mean.
Regression functions in reproducing kernel Hilbert spaces
We have not yet specified hr, r ∈ {0, 1} in claim model (3). One trivial way of enforcing the structure in (3) is to
constrict hr to be linear, corresponding to the linear quantile regression (Koenker and Bassett, 1978) widely used
in the classical statistics literature. However, real data often exhibit non-linear functional dependence, and thus
7
simply relying on linear quantile regression may not help alleviate the practical concern on model misspecification
mentioned in Introduction section. Preferably, we seek a data-driven approach for specifying hr.
Estimating regression function hr from data seems reasonable, but in order to make it feasible, we need to at least
assume that the functions admit some regularity properties. Bear in mind the call for model flexibility in studying
misrepresentation, the notion of kernel smooth regression (Aronszajn, 1950; Gu, 2013; Wahba, 1990) is natural to
evoke. To be more specific, we assume the underlying regression function hr belongs to some reproducing kernel
Hilbert space (RKHS) HK, which is generated by a reproducing kernel (RK) K : Rd × Rd → R. The reproducing
kernel K is symmetric and satisfies∑Li=1
∑Lj=1 ci cj K(ui,uj) ≥ 0 for any positive integer L, ci ∈ R and ui ∈ Rd,
i = 1, . . . , L. The induced RKHS is defined as the closure of function set:
HK ={h : h(·) =
L∑i=1
θiK(·,ui), L ∈ Z+, ui ∈ Rd, θi ∈ R, i = 1, . . . , L}.
Under a reasonable choice of RK, the associated RKHS covers a rich range of smooth functions that can be
non-linear, thus serves as a better candidate model than linear quantile regression or other pre-specified parametric
structures. For instance, the RKHS induced by p-degree polynomial kernel K(u,v) = (1 + 〈u,v〉)p for some integer
p ≥ 1 and u,v ∈ Rd, includes the entire family of smoothing spline, additive spline, and interaction spline models
(Wahba, 1990). Here, and elsewhere, we use 〈· , ·〉 to denote the inner product operator. Some other popular choices
of RK involve, for any u = (u1, . . . , ud) ∈ Rd, v = (v1, . . . , vd) ∈ Rd, and a bandwidth parameter % > 0,
ANOVA radial basis kernel: K(u,v) =( d∑i=1
exp{− (ui − vi)2
%
})d;
Gaussian kernel: K(u ,v) = exp{− ‖u− v‖
2
%
};
Exponential kernel: K(u ,v) = exp{〈u ,v〉
}.
More to the point, the Gaussian kernel mentioned above possesses the universal approximation property, meaning
that the induced RKHS is dense in the space of all continuous functions with compact supports (see, Micchelli
et al., 2006, for a more rigorous discussion). Therefore, at least in principle, the RKHS defined by the Gaussian
kernel is capable of catering any continuous regression surfaces, or all regression surfaces that can be approximated
by continuous functions. Due to this reason, along with its mathematical tractability, the Gaussian kernel is
widely accepted as “the default” option across a board spectrum of non-parametric analyses when there is no prior
knowledge about the kernel choice.
The logical path of this present section is to end up with an open discussion about the key differences between
8
the proposed KQR misrepresentation model and the Gaussian linear misrepresentation model considered in Akakpo
et al. (2019). First, we use quantile regression instead of the traditional mean regression to model insurance
misrepresentation, which is the initial attempt in the related literature. Second, the kernel method embedded in
the proposed misrepresentation model relaxes the linear constraint on the regression structure in Akakpo et al.
(2019), and thus our claim model will be likely to suit a wider range of misrepresentation analysis tasks. Third, the
model of Akakpo et al. (2019) assumes that the potentially misrepresented variable affects the response distribution
by a parallel shift only, whereas in real-life data this assumption might not be true. In our misrepresentation model,
the regression structure can be distinct between z = 0 and z = 1 (see, Equation (3)), yielding a more flexible
and realistic modeling framework. All in all, the aforementioned unique features of the proposed KQR framework
could keep model risk at a minimum level, and also contribute to a robust estimate of misrepresentation resistant
to outliers.
Estimation procedure
To implement the KQR-based misrepresentation model, we seek a workable scheme to estimate the misrepresentation
probability π, as well as regression functions f0 and f1 belonging to some given RKHS HK. If the true label zi is
known, then a natural approach for estimating fr, r ∈ {0, 1}, is through regularization. Specifically, observe that
minimizing the `1-loss function for a location estimator yields the median, Koenker and Bassett (1978) generalized
this idea to obtain a regression estimate for any quantile though the pinball loss function: for any τ ∈ (0, 1) and
t ∈ R,
ρτ (t) =
τ |t|, if t ≥ 0;
(1− τ) |t|, if t < 0.
(7)
The pinball loss function spells out the key difference between the tradition mean regression and quantile regression
considered herein. To be specific, a mean regression estimate is obtained by minimizing the expected square loss,
while for a quantile regression estimate, the minimization problem is based on the weighted absolute loss defined
in Equation (7). Thereby, quantile regression is known to be more robust to the presence of outliers than mean
regression. Furthermore, denote the admissible function set in the RKHS by FK = {f = b + h : b ∈ R, h ∈ HK},
where b is the intercept of the regression function, and HK is a RKHS over xi, i = 1, . . . , n, generated by RK K.
Then, quantile regression fr ∈ FK, r ∈ {0, 1} can be solved by minimizing the empirical risk plus a regularizer such
9
that
fr = arg minf∈FK
(1∑n
i=1 1{zi=r}
n∑i=1
1{zi=r} ρτ(yi − f(xi)
)+ λ ‖h‖2HK
), (8)
where 1{·} denotes indicator function and λ > 0 is a hyper-parameter corresponding to the Hilbert space norm
‖ · ‖HK (Li et al., 2007; Takeuchi et al., 2006). This expression represents a trade-off between fidelity to the data,
as represented by the sum of the absolute values of residuals, and plausibility of the solution, as represented by the
norm of the function over the Hilbert space.
However, in the context of misrepresentation modeling, we only have (yi,xi, z∗i ) at our disposal and true label
zi is unknown. Hence, the optimization in Equation (8) can not be directly solved to obtain an estimate of fr,
r ∈ {0, 1}. In what follows, we are going to treat misrepresentation modeling as a missing data problem and then
develop an iterative algorithm to circumvent the aforementioned difficulty.
Expectation-Smoothing algorithm
When dealing with missing data in a parametric setting, the Expectation-Maximization (EM) algorithm is frequently
adopted to compute maximum likelihood estimates. The iterative algorithm which we are going to introduce for
estimating the KQR misrepresentation model, is a non-parametric analogous to the EM algorithm. Our idea is
motivated by Wu and Yao (2016), wherein the estimation for linear quantile regression mixtures is explored. We
extend the method in Wu and Yao (2016) to the context of non-parametric quantile regression mixtures. Similar
to the EM algorithm, the proposed algorithm for KQR mixtures iterates between the expectation step (E-step)
and the smoothing step (S-step) so as to handle the missing information and model learning, respectively. For this
reason, we term the algorithm of interest the Expectation-Smoothing (ES) algorithm.
For notational convenience, denote by Φ = (π, f0, f1) the set of parameters to be estimated. With some
reasonable initialization, the iterative two-step procedure of ES algorithm is described formally in the sequel.
1. E-step: For s ∈ N, let Φ(s) =(π(s), f
(s)0 , f
(s)1
)be the updated estimate of Φ obtained from the s-th iteration
and e(s)i,r = yi − f (s)r (xi), i = 1, . . . , n, r ∈ {0, 1}, be the associated residuals. The aim in the E-step is to
compute the posterior mean on the guess of true label zi. That is for any i ∈ {l : z∗l = 0}, we calculate
p(s+1)i,1 = E
[zi ; Φ(s)
]= P
[zi = 1 ; Φ(s)
]=
π(s) × g(s)(e(s)i,1
)(1− π(s))× g(s)
(e(s)i,0
)+ π(s) × g(s)
(e(s)i,1
) (9)
and
p(s+1)i,0 = P
[zi = 0 ; Φ(s)
]= 1− p(s+1)
i,1 , (10)
10
where g denotes the density function of residual distribution G, and g(s) is the corresponding estimate. For
any i ∈ {l : z∗l = 1}, because misrepresentation does not occur in this instance, we have pi,1 = 1 and pi,0 = 0.
We do not intend to assign a parametric form to density function g, and it can be estimated from residuals
e(s)i,r via constrained weighted kernel density estimation (Wu and Yao, 2016). Specifically, let K : R → R
be some kernel function with unbounded support (e.g., Gaussian kernel), and define the shorthand notation
K(t ; η) = η−1K(t/η), where t ∈ R, and η > 0 is a bandwidth parameter. The constrained kernel density
estimate of g can be implemented via
g(s)(t) = α(s)∑r=0,1i=1,...,n
1{e(s)i,r≤0
} p(s)i,r K(t− e(s)i,r ; η)
+ γ(s)∑r=0,1i=1,...,n
1{e(s)i,r>0
} p(s)i,r K(t− e(s)i,r ; η), t ∈ R, (11)
in which the constants α(s) and γ(s) are chosen such that
∫ ∞−∞
g(s)(t) dt = 1 and
∫ 0
−∞g(s)(t) dt = τ.
Equivalently, constants α(s) and γ(s) satisfy the system of linear equations:
α(s)
∑r=0,1i=1,...,n
1{e(s)i,r≤0
} p(s)i,r + γ(s)∑
r=0,1i=1,...,n
1{e(s)i,r>0
} p(s)i,r = 1
α(s)∑
r=0,1i=1,...,n
1{e(s)i,r≤0
} p(s)i,r ν(s)i,r + γ(s)∑
r=0,1i=1,...,n
1{e(s)i,r>0
} p(s)i,r ν(s)i,r = τ
,
where ν(s)i,r =
∫ 0
∞K(t− e(s)i,r ; η
)dt for i = 1, . . . , n and r ∈ {0, 1}.
2. S-step: This step serves a similar role of the M-step in the EM algorithm which aims to obtain the best
plausible estimation given the current posterior distribution of missing data. Specifically, the estimate of
misrepresentation probability π is updated via
π(s+1) =
∑ni=1 1{z∗i =0} p
(s+1)i,1∑n
i=1 1{z∗i =0}. (12)
The estimate of quantile regression function fr, r ∈ {0, 1}, is obtained through
f (s+1)r = arg min
f∈FK
(1
n p(s+1)r
n∑i=1
p(s+1)i,r × ρτ
(yi − f(xi)
)+ λ ‖h‖2HK
), (13)
where p(s+1)r =
∑ni=1 p
(s+1)i,r /n. The optimization problem (13) can be viewed as a weighted variant of
the penalized RKHS regression considered in Equation (8). By solving the optimization problem (13), the
regression models are trained according to weighted empirical risk functions in which the posterior weights
11
pi,r, i = 1, . . . , n, r ∈ {0, 1}, reflect how likely misrepresentation may occur in each observation.
At the first sight, solving the optimization problem (13) seems rather difficult in its current form. But
thanks to the celebrating representer’s theorem (Kimeldorf and Wahba, 1971), the solution to Equation (13)
can be written as f(s+1)r (·) = b
(s+1)r + h
(s+1)r (·) = b
(s+1)r +
∑ni=1 θ
(s+1)i,r K(· ,xi) for r ∈ {0, 1}, such that
(b(s+1)r , θ(s+1)
r
)>= arg min
(b,θ)′∈Rn+1
(1
n p(s+1)r
p(s+1)r ρτ
(y − b1−Kθ
)+ λθ>Kθ
), (14)
where θ(s+1)r =
(θ(s+1)1,r , . . . , θ
(s+1)n,r
)>, p
(s+1)r =
(p(s+1)1,r , . . . , p
(s+1)n,r
), K ∈ Rn×n has matrix element ki,j =
K(xi,xj), y = (y1, . . . , yn)>, 1 = (1, . . . , 1)>, and by a slight abuse of notation, ρτ (u) =(ρτ (u1), . . . , ρτ (un)
)>means an element-wise application of function ρτ on vector u = (u1, . . . , un)> ∈ Rn. Now, we can apply either
the pseudo-data method (Nychka et al., 1995; Yuan, 2006) or general-purpose numerical optimizers to solve
the optimization in Equation (14). More detailed discussion about the pseudo-data method is relegated to
Appendix A in order to facilitate the reading herein.
It is noteworthy that the ES algorithm under consideration is not designed to optimize a specific objective
function, so we claim convergence of the ES algorithm if the relative changes in parameter estimates between two
consecutive steps fall below some user-specified thresholds.
Application of random sketch for big data analysis
The evolutionary success of actuarial and statistical methods is to a significant degree determined by considerations
of computational convenience. The proposed KQR-based misrepresentation model provides an appreciable improve-
ment over the Gaussian linear misrepresentation model in Akakpo et al. (2019) in terms of statistical flexibility and
robustness, however, at the expense of intense computation. Especially, if the pseudo-data method is used, then
solving the optimization problem (14) involves inverting a matrix which is of dimension n by n masses of times
(see, Equation (25) in Appendix A). In this current big data era, insurance records would contain over thousands of
millions of observations, which makes the implementation of the method prohibitive. In a similar vein, applications
of general-purpose numerical optimizers for solving the optimization problem (14) will suffer from the same curse
of dimensionality.
To overcome the computational issue, we will take advantage of the state-of-the-art statistical learning technique,
random sketch (Halko et al., 2011; Mahoney, 2011; Pilanci and Wainwright, 2016; Raskutti and Mahoney, 2016;
Yang et al., 2017, etc). The key idea of random sketch is to construct a small “sketch” of the whole data set via
random sampling or random projection techniques, and then use this sketch as a surrogate to perform computations
of interest instead of using the full data set. In the context of KQR, the random sketch approach can be applied
12
on the data kernel matrix K to perform dimensional reduction. To be more specific, we consider a low rank
approximation to the optimizer (14) in the S-step, limiting that θ(s+1)r = S> θ
(s+1)r for some randomly generated
projection matrix S ∈ Rm×n with m � n and θr = (θ1,r, . . . , θm,r) ∈ Rm. In other words, θ(s+1)r lies on the
m-dimensional subspace spanned by the rows of S. Under this imposed low-rank restriction, the optimization (14)
therefore reduces to
(b(s+1)r , θ(s+1)
r
)>= arg min
(b,θ)∈Rm+1
(1
n p(s+1)r
p(s+1)r ρτ (y − b1−KS>θ) + λθ>SKS>θ
). (15)
The computational expense of (15) is mostly on inverting the m by m sketch kernel matrix SKS, hence we reduces
the big data problem to a small-scale optimization problem.
An appropriate choice of random sketch matrix S (e.g., sub-Gaussian sketch, randomized orthogonal system
sketch and sub-sampling sketch) would be capable of preserving as much as possible data information, such that
the difference between the solutions of (14) and (15) is small. At the same time, application of random sketch
reduces the regression estimate in Equation (15) to an m-dimensional convex programming problem, which can be
solved much faster than that of Equation (14). As shown by Yang et al. (2017), the projection dimension m can be
set to be as small as m ∝ (log n)3/2 for RKHS-based regression. Thereby, the computation burden involved in the
proposed KQR misrepresentation is much relieved for massive data applications.
Hyper-parameter tuning
As hitherto, we have solved the estimation problem for a prefixed regularization parameter λ. As in any smoothing
problems, the choice of regularization parameter λ plays a critical role in the successful implementation of the KQR-
based misrepresentation model. For quantile regression, a common criterion for tuning λ is based on the robust
cross-validation score (i.e., leave-one-out cross-validation with pinball loss function (6) as the error measure; see Li
et al., 2007; Nychka et al., 1995; Yuan, 2006). To account for the mixture structure embedded in the KQR-based
misrepresentation model, we consider a conditional counterpart of the robust cross-validation score, termed the
conditional cross-validation (CCV) score, and it is formulated as
CCV(λ) =1
n p0
n∑i=1
pi,0 ρτ(yi − f [−i]0 (xi)
)+
1
n p1
n∑i=1
pi,1 ρτ(yi − f [−i]1 (xi)
), (16)
for given pi,r ∈ [0, 1], i = 1, . . . , n, r ∈ {0, 1} and pr =∑ni=1 pi,r/n. Given the weights pr = (p1,r, . . . , pn,r) for
r ∈ {0, 1}, regression function f[−i]r is estimated in the same manner as fr except that the i-th sample is excluded.
The posterior weights pr can be computed in the final iteration of the proposed ES algorithm. The rationale
behind the CCV score is that a reasonable choice of λ should stimulate small prediction errors in the group to
13
which observations tend to belong (i.e., z = 0 or 1), and the ratios 1/npr, r ∈ {0, 1} are used to normalize the
corresponding quantities.
Despite its intuitive interpretation, evaluating the CCV score is onerous because for every candidate in a pre-
specified set of finite values for λ, we have to estimate n pairs of quantile regressions, i.e., f[−i]r , i = 1, . . . , n,
r ∈ {0, 1}. In order to reduce the aforementioned computational burden, we adopt the approximation idea proposed
by Yuan (2006) and extend it to suit the CCV score (16). Throughout the rest of this section, we will briefly describe
the approximation method for evaluating CCV while relegate the details to Appendix B, so that our major focus
on misrepresentation modeling can be maintained.
Our route for evaluating (16) hinges on the conditional leaving-one-out lemma (see, Lemma 2 in Appendix B),
which implies the approximation:
fr(xi)− f [−i]r (xi) ≈∂fr(xi)
∂yi
[yi − f [−i]r (xi)
], for i = 1, . . . , n and r ∈ {0, 1}.
Then approximating the CCV score (16) by first-order Taylor polynomial yields
1
n p0
n∑i=1
pi,0 ρτ(yi − f [−i]0 (xi)
)+
1
n p1
n∑i=1
pi,1 ρτ(yi − f [−i]1 (xi)
)≈ 1
n p0
n∑i=1
pi,0 ρτ(yi − f0(xi)
)1− ψi,r
+1
n p1
n∑i=1
pi,1 ρτ(yi − f1(xi)
)1− ψi,1
,
where ψi,r = ∂fr(xi)/∂yi for i = 1, . . . , n, r ∈ {0, 1}. The partial derivatives ψi,r can be computed via the diagonal
elements of the hat matrix associated with the weighted quantile smoothing (see, Equations (29) in Appendix B).
The expression above is readily much more computable than its original form (16). However, Yuan (2006) argued
that the first-order Taylor polynomial approximation may perform poorly when ∂fr(xi)/∂yi is around zero and
thus suggested to replace the individual partial derivatives by their weighted average ψr =∑ni=1 pi,r ψi,r / npr,
r ∈ {0, 1}. Evoking the idea described above leads us to the conditional approximate cross-validation (CACV)
score:
CACV(λ) =1
n p0(1− ψ0
) n∑i=1
pi,0 ρτ(yi − f0(xi)
)+
1
n p1(1− ψ1
) n∑i=1
pi,1 ρτ(yi − f1(xi)
). (17)
To identify the most reasonable λ within a pre-specified set of possible values, we search for the candidate value
that leads to the lowest CACV score.
14
Recommendations on misrepresentation validation
Algorithm 1 summarizes all the statistical considerations discussed thus far for estimating the KQR-based mis-
representation model. Let us remark that, upon convergence of the ES algorithm, we are able to find the pa-
rameter estimate Φ = (π, f0, f1) = (π(∞), f(∞)0 , f
(∞)1 ), the posterior probability vectors p
(∞)0 and p
(∞)1 , residuals
e(∞)0 =
(e(∞)1,0 , . . . , e
(∞)1,n
)and e
(∞)1 =
(e(∞)1,1 , . . . , e
(∞)1,n
), and residual distribution estimate g(∞). The estimate π can
be used to answer question Q1 posted in Introduction.
Data: n-dimensional vectors of response y and reported status z∗; a n× d characteristics matrix X; aninitial guess of the misrepresentation probability π(0); l ∈ N candidates of regularization parameterλ1, . . . , λl; a stopping criterion ε
Result: Estimated misrepresentation probability π and quantile regression fr, r ∈ {0, 1}1 Specify the kernel matrix K and generate the random sketch matrix S;2 for j ← 1 to l do3 Set the regularization parameter λ = λj ;
4 Initialize the regression estimates f(0)0 and f
(0)1 pretending that no misrepresentation occurs (i.e., set
zi = z∗i );5 Set s← 0;6 repeat
7 Calculate the posterior probabilities p(s+1)0 and p
(s+1)1 based on Equations (9) and (10);
8 Update the misrepresentation probability estimate π(s+1) using Equation (12);
9 Update the regression estimates f(s+1)0 and f
(s+1)1 using Equation (15);
10 s← s+ 1;
11 until |π(s+1) − π(s)| < ε;12 Compute the CACV(λj) according to Equation (17);13 Find λ∗ ∈ {λ1, . . . , λl} corresponding to the lowest CACV score;
14 end
15 return Misrepresentation model estimate(π(s+1), f
(s+1)0 , f
(s+1)1
)associated with λ∗
Algorithm 1: Algorithm for estimating the proposed KQR-based misrepresentation model.
In order to answer the resultant question Q2, we introduce a probability-based approach and a distance-based
approach to identify the v = a π∑ni=1 1{z∗i =0} most doubtful individuals for the validation purpose. We term
the magnitude of v, the validation size in misrepresentation investigation. The parameter a > 0 in the above
expression of v controls the validation size according to the risk attitude of the analyst, and it should be determined
subjectively. To identify potential misrepresenters, we look for observations who reported negative status on z but
behaved very differently from the others in the corresponding group. Accordingly, among the observations with
z∗ = 0, we use the v largest posterior probabilities p(∞)1 as the criterion to select the most doubtful individuals in
the probability-based approach, whereas in the distance-based approach, the v largest residuals e(∞)0 is adopted.
Note that the task of misrepresenter identification is akin to conducting a set of independent hypothesis tests on
every sample with z∗ = 0 as to whether or not misrepresentation occurs. For this reason, we appeal to the language
of statistical testing for classifying the two types of error that will be encountered in the misrepresenter identification
15
process. The type I error corresponds to the instance wherein an honest individual is falsely labeled as doubtful
misrepresenter, while the type II error occurs when a misrepresenter is undetected. For the prudence purpose,
an ideal misrepresenter identification method should minimize the type II error rate while keeping the validation
size and type I error rate at reasonable levels. Concerning the type II error, the aforementioned probability-based
and distance-based approaches each have their own drawbacks. For the probability-based approach, recall that
the calculation of posterior probability hinges on the estimated residual distribution g(∞) (see, Equation (9)).
The kernel density method used to obtain g(∞) may perform unsatisfactorily in the tail portions. Hence, when
the residuals e(∞)i,0 and e
(∞)i,1 are extraordinary and fall on the tails of residual distribution, relying on posterior
probabilities for assessing misrepresentation may not be reliable. On the other hand, the distance-based approach
is better at identifying the misrepresenters that are far away from the regression curves/surfaces, but may overlook
the doubtful individuals when the two regression curves/surfaces f(∞)0 and f
(∞)1 are close and so the magnitudes
of e(∞)i,0 and e
(∞)i,1 may be both small.
The following toy example which is adopted from Example 3 in the Simulation section, demonstrates the
respective drawbacks of the probability-based and distance-based approaches in an illuminating manner. Assume
that the underlying data generating process (DGP) follows yi = (1−zi)[20xi (2+xi)
−1]+zi[40xi (4+xi)
−1]+ ξi,
where xi’s are independent realizations of Uniform[1 , 10] and ξi’s are independently distributed student-t variables
with degree of freedom equals 5. Furthermore, we assume that 90% of the individuals report negative risk status on
z, among whom 15% misrepresent. We generate 1,000 samples and pretend that the true model as well as the true
label z are unknown. Evoking Algorithm 1 on the simulated data which contain {yi, xi, z∗i }, i = 1, . . . , 1, 000, yields
an estimate of misrepresentation probability π = 14.3%. In this toy example, we simply set the control parameter
a = 1 in determining the validation size. The estimated residual distribution and misrepresenters identification are
displayed in Figures 1 and 2, respectively.
As shown in Figure 1, the kernel density estimate g(∞) successfully recovers the overall sharp of residual dis-
tribution g except that there exist some minor yet still noticeable discrepancies in the tails of the distribution.
Such a deficiency inherent in the kernel density estimate is caused by the scarcity of creditable data falling on
the tail regions. Consequently, the associated posterior probabilities may mislead the misrepresenter identification
when the observations are far away from both the regression curves (e.g., see the left panel of Figure 2 for the
undetected misrepresenters which are above the regression curve of z = 1). The right panel of Figure 2 shows that
the distance-based approach does not suffer from the aforementioned drawback of the probability-based approach,
but relying on the v largest residuals may overlook the misrepresenters when the two regression curves are closed
(e.g., see the right panel of Figure 2 for the undetected misrepresenters around x = 1 and 2).
For a working resolution of the aforementioned problems, we suggest in the following section to consider the
union of doubtful individuals identified by either the probability-based approach or distance-based approach. This
16
−10 −5 0 5 10
0.0
0.1
0.2
0.3
t
g(t
)
Kernel density estimateTrue density function
−4 −2 0 2 4
−4
−2
02
46
8
Theoretical quantiles
Ke
rne
l−b
ase
d e
mp
iric
al q
ua
ntile
s
Figure 1: Density comparison and QQ-plot between g(∞) and g, when the KQR-based misrepresentation model isfitted against 1, 000 simulated data according to Example 3.
union method helps lower the type II error rate at the risk of slightly increasing the validation size and the type
I error rate. More sophisticated misrepresentation discovery methods which can better balance the type I and II
errors will be a very interesting direction for future research.
Simulation Study
Several simulation examples will be presented in this section to illustrate the effectiveness of the ES algorithm
for implementing the KQR-based misrepresentation model. We should consider DGP’s consisting of linear and
non-linear regression structures. We start with univariate cases in Examples 1 to 4, which will benefit us in being
able to visually inspect the proposed model’s performance and limitation, then we will consider a more general
multivariate case in Example 5. Although these examples do not replicate the real-life insurance data, they are
designed to showcase the usefulness of the KQR-based misrepresentation model even when there are non-linear and
interactive dependencies involved in the DGP.
The simulations in the subsequent examples are based on the probabilities of reported negative risk status
P(z∗ = 0) = 0.9 and misrepresentation P(z = 1|z∗ = 0) = 0.15. For the residual distribution, we assume the central
student’s t-distribution with dispersion parameter σ = 1 and degree of freedom equals 5, in order to emulate the
heavy-tailed phenomena that often exist in insurance data. The five examples are given as follows.
17
2 4 6 8 10
10
15
20
25
x
y
+
+
+
+
+
+
+
+
+
+
+
+
++++
+
+
+
++
+
+
+
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+ ++
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+ +
+
+
+
++
+
++
++
++
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Regression curve for z=0Regression curve for z=1
2 4 6 8 10
10
15
20
25
x
y
+
+
+
+
+
+
+
+
+
+
+
+
++++
+
+
+
++
+
+
+
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+ ++
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+ +
+
+
+
++
+
++
++
++
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Regression curve for z=0Regression curve for z=1
Figure 2: The scatter plots of 1, 000 simulated data according to Example 3, superimposed by the true (in solid curve)and fitted (in dashed curve) regression curves. The potential misrepresenters are identified respectively accordingto the probability-based approach (left panel) and distance-based approach (right panel). The signs “+”, “◦” and“∗” represent the observations having truly positive risk status, truly negative risk status and misrepresented riskstatus on z, respectively. Observations marked with the sign “�” are the honest individuals but falsely identifiedas doubtful misrepresenters (i.e., the type I error). Misrepresenters marked with the sign “4” are undetected (i.e.,the type II error), but successfully discovered otherwise.
Example 1. This is an univariate example with linear regression setting:
yi = (1− zi) (5 + xi) + zi (10 + xi) + ξi, i = 1, . . . , n,
where xi’s are independent samples of Uniform[1 , 10].
The setup in Example 1 complies with the misrepresentation model in Akakpo et al. (2019) except that a more
flexible student’s t-distribution is assumed on the residual errors. The next example is a slight extension of Example
1, in which the slope coefficients in the regression models between z = 0 and z = 1 are different, so the effect of z
on y is beyond a parallel shift.
Example 2. Consider
yi = (1− zi) (5 + xi) + zi (10 + 2xi) + ξi, i = 1, . . . , n,
where xi’s are independent samples of Uniform[1 , 10].
Estimating the misrepresentation probability under the linear setting in Example 2 readily excesses the capacity
of the misrepresentation model considered in Akakpo et al. (2019). However, it is natural to expect that the
18
proposed KQR misrepresentation model can well handle this case. Next, we proceed forward to examples involving
non-linear dependence in the regression structures.
Example 3. Consider that the regression structure of the DGP admits a polynomial ratio form:
yi = (1− zi)[20xi (2 + xi)
−1]+ zi[40xi (4 + xi)
−1]+ ξi, i = 1, . . . , n,
where xi’s are independent samples of Uniform[1 , 10].
Example 4. Consider that the regression structure of the DGP has a form of exponential-polynomial product:
yi = 5 + (1− zi)[4 exp(0.125xi)x
−1i
]+ zi
[8 exp(0.25xi)x
−1i
]+ ξi, i = 1, . . . , n,
where xi’s are independent samples of Uniform[1 , 10].
Our final example considers the most general setup which contains multiple covariates and complex functional
dependence.
Example 5. Assume that the DGP possesses a regression structure in which the non-linear, interactive, and additive
dependencies are mixed as follows:
yi =
10x1i (2 + x1i)
−1 + 3 exp{
0.125 (x2i + 0.5x3i)}x−12i + ξi, if zi = 0
20x1i (4 + x1i)−1 + 6 exp
{0.25 (x2i + 0.5x3i)
}x−12i + ξi, if zi = 1
,
where xji’s (j = 1, 2, 3, i = 1, . . . , n) are independent samples of Uniform[1 , 10].
To fit the KQR-based misrepresentation model to the aforementioned examples, we deliberately choose to
focus on the median regression (i.e., τ = 0.5), which can be viewed as a robust alternative to the parametric
mean regression frequently used in the related literature. Recall that the Hilbert space induced by Gaussian kernel
possesses the universal approximation property, so we will use Gaussian kernel for catering the regression estimation.
The bandwidth parameter % > 0 in the Gaussian kernel is set to be the average of the 10% and 90% quantiles of
‖xi−xj‖2 for each i 6= j ∈ {1, . . . , n}, which is suggested by Takeuchi et al. (2006). Moreover, we will use Gaussian
kernel to implement the constrained weighted density estimation in Equation (11) with bandwidth η = 1.06n−1/5 φ,
where φ denotes the standard deviation of residual errors (Silverman, 1986). In the application of random sketch,
we choose to use the sub-Gaussian sketching and set the projection dimension to be m = d3 log(n)3/2e, which is
suggested by the theoretical analysis conducted in Yang et al. (2017).
19
Evoking the proposed ES algorithm, Figure 3 depicts the fitted regression curves for Examples 1 to 4 based
on a trial of simulation. Within Figure 3, we set the validation size control parameter a = 1 and use the union
of the probability-based and distance-based methods to identify potential misrepresenters. We can see that the
KQR misrepresentation model produces fairly satisfactory goodness of fit to the simulated data for all examples
reported, including the linear examples with parallel shift (Example 1) and non-parallel shift (Example 2), as well
as the non-linear examples (Examples 3 to 4). In particular, only a few type I and type II errors are observed,
indicating that most of the misrepresenters are successfully discovered by the suggested union method. Among
the misrepresents who can not be detected by the algorithm, their associated responses stand in the region where
the realizations between the two regression models overlap significantly. In such a circumstance, the mispresenters
somehow behave like honest individuals, therefore it is statistically difficult to identify them. The same argument
can be applied to explain the observations committing the type I error. Accordingly, we observe from Figure 3 that
more severe type I and II errors occur in Example 1 than in Example 2, because the parallel-shift construction of
Example 1 results in a larger overlapping area between the two regression models. For the same reason, within
each of the Examples 2, 3 and 4, there are less intensive type I and II errors occurred in the right tail portion
where the distance between the two regression curves are larger. For Example 5 which involves multiple predictor
variables, the model’s goodness of fit and misrepresenter identification can no longer be visually displayed. Instead,
the estimation results for Example 5 can be found in Figure 5 as well as Tables 1 and 2.
The results presented in Figure 3 are based on a single trial of simulation only. To account for the stochastic
variation in simulation, we now repeat the same experiments for each of Examples 1 - 5 for multiple times. Fur-
thermore, we shall vary the sample size, residual standard deviation, and validation size in order to understand
the associated impacts on the proposed model’s performance. Some additional notions for quantifying the model
performance are needed beforehand. Define
Type I error rate =The number of samples committing the type I error
The number of individuals having negative status on z :∑ni=1 1{zi=0}
, (18)
which calculates the percentage of individuals with truly negative status on z that are falsely labeled as misrepre-
senters. Meanwhile, define
Type II error rate =The number of samples committing the type II error
The true number of misrepresenters:∑ni=1 1{z∗i =0∩ zi=1}
, (19)
which measures the percentage of misrepresenters that are undetected. Moreover, we define
Validation rate =The validation size: v
The true number of misrepresenters:∑ni=1 1{z∗i =0∩ zi=1}
. (20)
20
2 4 6 8 10
51
01
52
0
x
y
+
+
+
+
+
+
+
+
+
+
++
+++
+
+
+
+
+
+
+++
+
+
+
+
++
+
+
+
+
++
+
+
++
+
+
++
++
+
++
++
++
+
+
+
+
+
+
++
+
+
+
+
+
+
+ +
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
Regression curve for z=0Regression curve for z=1
2 4 6 8 10
51
01
52
02
5
x
y
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
++
++
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Regression curve for z=0Regression curve for z=1
2 4 6 8 10
10
15
20
25
x
y
+
+
+
+
+
+
+
+
+
+
+
+
++++
+
+
+
++
+
+
+
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+ ++
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+ +
+
+
+
++
+
++
++
++
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Regression curve for z=0Regression curve for z=1
2 4 6 8 10
05
10
15
x
y
+
+
+
+
++
+
+
+
+
+
++
++
+
+
+
++
++
+
+
+
+ +
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
++
+
+
+
+
+
+
++ +
+
+
+
+ ++
+
+
+
+
++
+++
+
+
+
+
+
+
+
+
++ +
+
+
+
+
+
+
+
+++
+
+ +
+
+
+
+
+
+
+
Regression curve for z=0Regression curve for z=1
Figure 3: The scatter plots of 1, 000 simulated data according to Example 1 (top left), Example 2 (top right),Example 3 (bottom left), Example 4 (bottom right), superimposed by the true (in solid curve) and fitted (in dashedcurve) regression curves, and the misrepresentation identification. The signs “+”, “◦”, “∗”, “�” and “4” aredefined in the same manner as Figure 2. The estimated misrepresentation probabilities are respectively 14.30%,15.21%, 16.16%, and 15.10%.
The rationale underlying the definition of validation rate is that, if the ratio is close to 100%, then the validation
size is approximately equal to the true number of misrepresenters. On the one hand, a 100% validation rate means
the misrepresenter identification procedure behaves as an oracle who has the knowledge of the unknown number,
say s, of misrepresenters and deliberately examines the top s suspicious subjects. On the other hand, it also means
the misrepresenter identification procedure can detect most of the misrepresentation cases if the type II error rate
is also low.
Now, we evaluate the accuracy of the misrepresentation probability estimation and misrepresenter detection
21
with different sample sizes n ∈ {1, 000, 3, 000, 5, 000, 7, 000}. For every fixed n, we run 150 independent trials of
simulations for all the five examples. The box plots for describing the distribution of π, the type I and type II
error rates of the five examples are displayed in Figures 4 to 5. For each box plot, the middle line is the median
of the distribution, the lower and the upper limits of the box are the first and the third quartile, respectively, the
upper (resp. lower) whisker extends up to 1.5 times the interquartile range from the top (resp. bottom) of the
box, and the points are the outliers beyond the whiskers. For all the examples, we observe that as the sample
size increases, the estimators’ distributions become more concentrated, leading to more accurate estimation results.
Moreover, the median of the estimated misrepresentation probability π is very close to the true value 15%. When
it comes to the performance on misrepresenter detection, both the type I and II error rates are quite low. The
median type I and type II error rates range respectively from about 0.2% and 3%, and from 0.9% to 9% across
Examples 1 - 5. Among these examples, Example 1 has the highest type I and II error rates. This is because the
two parallel regression curves involved in the construction of Example 1, resulting in a relatively large region in
which the realizations between the regression models for z = 0 and z = 1 overlap significantly. As was explained
earlier, observations falling into such overlapping areas are statistically difficult to be distinguished between honest
individuals and misrepresenters. The realizations from the two regression models in Example 2 (see, Figure 3) seem
to overlap the least, and thus it has the lowest error rates among the five examples.
After analyzing the impacts of sample size, we turn to study how data volatility may influence the proposed
model’s performance. It is natural for us to conjecture that more volatile data may harm the model’s performance
and increase the error rates, because realizations of the two regression models for z = 0 and z = 1 are more likely
to overlap. In order to verify the aforementioned conjecture, we fix n = 1, 000, vary the dispersion parameter in the
residual distribution among σ ∈ {0.75, 1, 1.25}, and examine the validation ratio as well as the type I and II error
rates by running 150 independent trials of simulations. The results summarized in Table 1 confirm the surmise:
we observe that both the type I and type II error rates increase with the dispersion parameter σ. Among the five
examples, it seems that larger data volatility influences Example 1 the most, with a 25% increase in σ doubling
the type II error rate. This is again due to the parallel-shift construction in Example 1, which makes the increase
in data volatility deteriorate the performance of the misrepresenter identification algorithm over the whole range
of the data. Nonetheless, among the other examples, the effect of increasing σ is focused on the left tail portion of
the DGP where the two regression curves are close (see, Figure 3). Although the accuracy of the misrepresenter
identification algorithm has declined because of more noisy data, the validation ratio seems less sensitive to the
changes in σ. More importantly, when the suggested union method is used, all the validation ratios are about 100%,
meaning that the validation size suggested by the algorithm is rather close to the true number of misrepresenters.
To ease the negative impacts of noisy data on misrepresenter identification, one working approach is to increase
the validation size. Table 2 collects the validation rate, the type I and type II error rates for a ∈ {1, 1.1, 1.2}
22
1,000 3,000 5,000 7,000
Sample Size
π
8%
10%
12%
14%
16%
18%
1,000 3,000 5,000 7,000
Sample Size
Type I E
rror
Rate
0%
0.5%
1%
1.5%
2%
2.5%
1,000 3,000 5,000 7,000
Sample Size
Type II E
rror
Rate
5%
10%
15%
20%
25%
1,000 3,000 5,000 7,000
Sample Size
π
12%
14%
16%
18%
1,000 3,000 5,000 7,000
Sample Size
Type I E
rror
Rate
0%
0.1%
0.2%
0.3%
0.4%
0.5%
0.6%
1,000 3,000 5,000 7,000
Sample Size
Type II E
rror
Rate
0%
1%
2%
3%
4%
1,000 3,000 5,000 7,000
Sample Size
π
12%
13%
14%
15%
16%
17%
18%
1,000 3,000 5,000 7,000
Sample Size
Type I E
rror
Rate
1%
1.5%
2%
2.5%
3%
1,000 3,000 5,000 7,000
Sample Size
Type II E
rror
Rate
2%
4%
6%
8%
10%
12%
1,000 3,000 5,000 7,000
Sample Size
π
12%
14%
16%
18%
1,000 3,000 5,000 7,000
Sample Size
Type I E
rror
Rate
0.5%
1%
1.5%
2%
2.5%
3%
3.5%
1,000 3,000 5,000 7,000
Sample Size
Type II E
rror
Rate
5%
10%
15%
20%
Figure 4: The box plots of the misrepresentation probability estimates π (left column), the type I error rate (middlecolumn), and the type II error rate (right column) for Examples 1-4 (from top to bottom, respectively). For thebox plots about π, the horizontal line marks the true misrepresentation probability.
23
1,000 3,000 5,000 7,000
Sample Size
π
12%
14%
16%
18%
1,000 3,000 5,000 7,000
Sample Size
Type I E
rror
Rate
0%
0.5%
1%
1.5%
1,000 3,000 5,000 7,000
Sample Size
Type II E
rror
Rate
0%
1%
2%
3%
4%
5%
6%
7%
Figure 5: The box plots of the misrepresentation probability estimate π (left column), the type I error rate (middlecolumn), and the type II error rate (right column) for Examples 5. For the box plot about π, the horizontal linemarks the true misrepresentation probability.
σ Validation Rate Type I Error Rate Type II Error Rate
Method Prob./Dist. Union Prob. Dist. Union Prob. Dist. Union
Example 10.75 98.73% 100.58% 0.52% 0.45% 0.56% 4.27% 3.85% 2.62%
1 94.86% 98.09% 1.10% 1.02% 1.18% 11.29% 10.84% 8.03%1.25 89.52% 93.73% 1.80% 1.72% 1.97% 20.55% 20.05% 17.27%
Example 20.75 98.37% 99.93% 0.03% 0.05% 0.06% 1.81% 1.92% 0.42%
1 97.53% 99.57% 0.13% 0.18% 0.24% 2.37% 2.66% 0.91%1.25 98.45% 101.11% 0.31% 0.41% 0.51% 3.32% 3.88% 1.78%
Example 30.75 98.36% 103.40% 0.55% 0.96% 1.27% 4.77% 7.06% 3.78%
1 99.17% 105.85% 0.99% 1.45% 1.93% 7.64% 10.22% 6.30%1.25 98.44% 106.17% 1.49% 1.98% 2.59% 10.07% 12.80% 8.57%
Example 40.75 98.50% 101.76% 0.69% 0.74% 0.92% 5.35% 5.62% 3.38%
1 98.30% 103.65% 1.41% 1.49% 1.81% 12.18% 12.66% 9.21%1.25 92.94% 100.49% 2.27% 2.35% 2.86% 20.00% 20.42% 15.78%
Example 50.75 96.20% 99.48% 0.11% 0.11% 0.18% 4.46% 4.45% 1.58%
1 98.59% 102.43% 0.33% 0.36% 0.52% 5.54% 5.71% 2.79%1.25 96.24% 101.10% 0.64% 0.73% 0.99% 7.44% 7.94% 4.56%
Table 1: The validation rate, type I and type II error rates for Examples 1 - 5. The reported figures are based on150 independent trials of simulations with n = 1, 000, a = 1 and σ ∈ {0.75, 1, 1.25}.
with σ = 1. The following observations could be made. First, increasing a will lead to a larger validation size
by construction. In particular, the validation rate under the union method has increased from around 100% when
a = 1, to between 117.93% and 135.07% when a = 1.2. Second, because of the larger validation size, the type I error
rate also increases with a, from below 2% when a = 1 to between 2.17% and 6.50% when a = 1.2. Finally, we observe
appreciable improvements over the type II error rate as the validation size parameter a increases. Specifically, when
we choose a = 1.2, the type II error rates among the five examples all drop below 4% (the one in Example 2 is
even as small as 0.07%). In summary, increasing the validation size helps lower the type II error rate but at the
24
expense of raising the type I error rate. Such a trade-off between the type I and type II errors is somewhat akin to
the study of hypothesis testing, underlying how accuracy and prudence are two sides of the same coin. In future
research, it will be interesting to investigate the optimal choice of validation size that best balances the type I and
type II errors.
a Validation Rate Type I Error Rate Type II Error Rate
Method Prob./Dist. Union Prob. Dist. Union Prob. Dist. Union
Example 11 94.86% 98.09% 1.10% 1.02% 1.18% 11.29% 10.84% 8.03%
1.1 105.55% 108.17% 1.97% 1.91% 2.16% 6.66% 6.31% 5.14%1.2 114.98% 117.93% 3.26% 3.21% 3.62% 4.45% 4.20% 3.54%
Example 21 97.53% 99.57% 0.13% 0.18% 0.24% 2.37% 2.66% 0.91%
1.1 110.36% 114.92% 1.49% 1.50% 2.27% 0.18% 0.25% 0.11%1.2 118.97% 129.38% 3.24% 3.24% 5.07% 0.11% 0.12% 0.07%
Example 31 99.17% 105.85% 0.99% 1.45% 1.93% 7.64% 10.22% 6.30%
1.1 107.85% 118.66% 2.16% 2.74% 3.97% 4.22% 7.50% 3.55%1.2 118.74% 135.07% 3.68% 4.23% 6.50% 2.89% 5.98% 2.42%
Example 41 98.30% 103.65% 1.41% 1.49% 1.81% 12.18% 12.66% 9.21%
1.1 105.15% 110.59% 2.30% 2.40% 2.97% 7.29% 7.90% 5.71%1.2 115.86% 122.84% 3.49% 3.63% 4.55% 4.56% 5.36% 3.66%
Example 51 98.59% 102.43% 0.33% 0.36% 0.52% 5.54% 5.71% 2.79%
1.1 105.53% 110.77% 1.33% 1.44% 2.14% 1.43% 2.05% 0.83%1.2 116.27% 126.26% 2.89% 2.97% 4.57% 0.69% 1.16% 0.39%
Table 2: The validation rate, type I and type II error rates for Examples 1 - 5. The reported figures are based on150 independent trials of simulations with n = 1, 000, σ = 1 and a ∈ {1, 1.1, 1.2}.
All in all, these simulation studies confirm that the proposed ES algorithm is able to generate valid inference
of misrepresentation, even when the true DGP is highly non-linear. Moreover, as illustrated in Tables 1 and 2, the
suggested union method shows an advantage over the probability-based method or the distance-based method in
that it lowers the type II error rate noticeably while only slightly increases the validation size and type I error.
Application to real data
In this section, we apply the KQR-based misrepresentation model to assess the misrepresentation risk in the Medical
Expenditure Panel Survey (MEPS) data. The Health Insurance Mandate of the Patient Protection and Affordable
Care Act features a tax penalty enacted in 2014 for individuals who did not have health insurance during the
year. Therefore, the earlier study in Akakpo et al. (2019) argued that there are financial incentives for the MEPS
respondents to misrepresent on the self-reported insurance status. To prepare ourselves for the succeeding discussion,
let us denote by y the logarithm of the annual claim amount of each observation, and z the unobserved, true insurance
status with z = 0 being insured, and z = 1 being uninsured. The observed, self-reported insurance status is denoted
25
by z∗. Hence, the probability π = P[z = 1|z∗ = 0] corresponds to the percentage of MEPS’s respondents who
misrepresented on the insurance status so as to avoid the potential tax penalty. Akakpo et al. (2019) employed the
Gaussian linear regression model with the insurance status being the only covariate to study the misrepresentation
issue. Specifically, they assumed that the conditional distribution of y follows the Gaussian distribution with mean
µ = β0+β1z, where (β0, β1) are regression coefficients. Based on the Gaussian linear regression, Akakpo et al. (2019)
found significant statistical evidences of misrepresentation on the self-reported insurance status in year 2014. The
probability of misrepresentation was estimated to be about 18% among the respondents who reported themselves
as insured.
Armed with the KQR-based misrepresentation model proposed in this current paper, we will investigate the
misrepresentation issue in the MEPS data under a non-parametric environment. Moreover, extra relevant covariates
are included in our analysis, which will aid in lowering the volatility of the residual errors in the misrepresentation
model. As illustrated in the simulation analyses, less volatile residuals will benefit for a more credible assessment of
misrepresentation. The additional covariates considered in our subsequent analysis include age, gender, and body
mass index (BMI, which is computed via dividing a person’s weight in kilograms over height by meters squared).
These variables are selected because they are respondents’ biographic characteristics which can be observed and
verified easily. Hence, it is reasonable to assume that these variables are correctly measured.
A concise description of the data used in our analysis is provided in what follows. We are interested in the
year 2014 MEPS data, and consider all adult respondents aged from 18 to 60 who have positive annual medical
expenditures, and valid entries in all covariates considered. As a result, our data set consists of about 12, 000 valid
samples. Table 3 outlines the definitions and descriptive statistics of the variables considered. According to the
summary statistics, we may conclude that a typical respondent in the data is a female, about 40 years old, and has the
BMI at 29. Notice that the mean values of the variables outlined in Table 3 are very similar between the insured
and uninsured groups except for the expenditure variable. Such a noticeable degree of dissimilarity on medical
expenditures between the insured and uninsured groups enables us to utilize the individual claim information for
identifying potential misrepresentation activities. As a side note, the BMI has been widely accepted as a measure of
obesity and one of the key variables in analyzing medical expenditure (Bhattacharya and Bundorf, 2009; Finkelstein
et al., 2003). A plethora of empirical literature has already documented that BMI possesses a non-linear relationship
with medical expenditure (Cawley et al., 2015; Laxy et al., 2017) and is dependent on age and gender (Nevill and
Metsios, 2015). Consequently, adopting flexible non-linear models, such as the proposed KQR method, is crucial in
our study of misrepresentation risk in the MEPS data.
To implement the KQR-based misrepresentation model in studying the MEPS data, we choose the same con-
figuration as that of the simulation studies. Specifically, we set τ = 0.5 corresponding to median regression, use
Gaussian kernel for handling the non-parametric learning of both regression functions and residual distribution, and
26
Insured (z∗ = 0) Uninsured (z∗ = 1)Variable Description Mean SD Mean SD Mean Diff.
Age The age of the respondent 39.99 12.45 38.96 11.63 2.65%Gender =1 if the policyholder is female, 0 if male 0.41 0.49 0.43 0.50 -5.38%
BMI The body mass index of the respondent 28.60 6.95 29.08 6.85 -1.67%Expenditure The logarithm of annual medical expenditure 7.20 1.70 6.03 1.66 17.48%
Number of obs 10529 1490Percentage 87.60% 12.39%
Table 3: Summary statistics for the subset of MEPS data being analyzed. In this table, “SD” is the shorthandnotation for standard deviation and “mean diff.” is the relative difference of the mean value between z∗ = 0 andz∗ = 1 of the variable.
choose m = d3 log(n)3/2e = 87 to be the projection dimension in the deployment of random sketch technique. The
kernel regression smoothing parameter λ is tuned by using the CACV index. Figure 6 displays the CACV index
in response to varying values of λ, and it suggest the “best” smoothing parameter value to be λ = exp(−8.68). It
is worth noting that, given the true label z, the regression function considered in the misrepresentation model of
Akakpo et al. (2019) is constant. Since constant functions are nested under the admissible function set FK with
zero RKHS norms, we argue that our estimated KQR will perform at least as good as the constant regression model
in Akakpo et al. (2019), so far at least the mean absolute error is concerned.
Evoking the estimation procedure described in Algorithm 1, we obtain an estimate of misrepresentation prob-
ability of 9.21%. Remark that our non-parametric estimate of misrepresentation probability is much lower than
the parametric estimate of 18% obtained in Akakpo et al. (2019). A potential explanation for the discrepancy is
that our proposed KQR method captures the subtle non-linear dependence structure inherent in the MEPS data,
while the Gaussian model utilized by Akakpo et al. (2019) overlooks the effect of covariates, which may possibly
overestimate the residuals and claim more misrepresenters.
What we have reported so far is a point estimate of misrepresentation. In the next stage of the analysis, we aim
to validate whether or not the estimate of misrepresentation is statistically significant. Formally, we are interested
in testing the null hypothesis, H0 : π = 0, against the alternative hypothesis, H1 : π > 0. To this end, we consider
a non-parametric bootstrap method to conduct the hypothesis testing. Its detailed steps are elaborated below:
1. Assume that the null hypothesis is true, i.e., no misrepresentation exists. In this case, we could fit the KQR
model to data by directly solving Equation (8) with zi = z∗i for all i = 1, . . . , n:
fnullr = arg minf∈FK
(1∑n
i=1 1{z∗i =r}
n∑i=1
1{z∗i =r} ρτ(yi − f(xi)
)+ λ ‖h‖2HK
), for r ∈ {0, 1}.
2. Compute the residual errors under the null model estimated from Step 1:
enulli = yi −[1{z∗i =0} f
null0 (xi) + 1{z∗i =1} f
null1 (xi)
], for i = 1, . . . , n.
27
−12 −11 −10 −9 −8 −7 −6 −5
1.2
85
1.2
90
1.2
95
1.3
00
1.3
05
1.3
10
Log (lambda)
CA
CV
Figure 6: Plot of the CACV index with varying values of log(λ) and the dash line marks the location of the minimumof the CACV curve.
3. Construct the bootstrap data {ybsi ,xi, z∗i }ni=1 with
ybsi =[1{z∗i =0} f
null0 (xi) + 1{z∗i =1} f
null1 (xi)
]+ ebsi , for i = 1, . . . , n,
where ebs = (ebs1 , . . . , ebsn ) denotes the bootstrap samples of enull = (enull1 , . . . , enulln ) with replacement.
4. Apply the ES algorithm on the bootstrap data obtained in Step 3 to estimate the misrepresentation probability.
5. Steps 3 and 4 are repeated for w ∈ N times to obtain πbs1 , . . . , π
bsw , where πbs
j is the estimated misrepresentation
probability based on the j-th bootstrap data, j = 1, . . . , w.
6. Approximate the p-value of test statistic by pbs = w−1∑wj=1 1{π < πbs
j }, where π is the estimated misrepre-
sentation probability based on the original data.
7. Reject the null hypothesis if pbs falls below a user-specified significance level.
The intuition of the above test is as follows. Under the null hypothesis, there is no misrepresentation, and
therefore the non-zero estimated misrepresentation probability π is caused by the randomness in the samples.
In this case, πbs1 , . . . , π
bsw establish the empirical distribution, to which the misrepresentation probability estimator
belongs. If the estimated misrepresentation probability based on the observed data is so different from the estimates
under the null hypothesis (i.e., falling on the right end of the estimator’s distribution), then the null hypothesis
should be rejected. Therefore, pbs is the probability that the null hypothesis will be falsely rejected. Such arguments
are standard in the non-parametric bootstrap hypothesis testing literature (Dwass, 1957; also see, Edgington, 2007
and Good, 2000 for comprehensive treatments of the topic).
28
In our analysis, we run w = 3, 000 sets of bootstrap data in order to ensure that the estimated p-value is credible.
We find none of the bootstrap data set yields an estimate of misrepresentation probability higher than 9.21%. So
the estimated p-value of the test statistic is 0%, and the null hypothesis H0 should be rejected at any practical
significance level. Collectively, the bootstrap hypothesis test supports the conclusion that there is a significant
level of misrepresentation in the MEPS data. This conclusion is consistent with the one in Akakpo et al. (2019),
despite that our estimate of misrepresentation probability is much lower. Last but not least, for the purpose of
misrepresentation validation, one can apply the methods described in “Recommendations on misrepresentation
validation” section to identify the doubtful individuals.
Conclusions
In this paper, we proposed a class of non-parametric misrepresentation models constructed by quantile regression
in reproducing kernel Hilbert space. The core mechanism of the proposed misrepresentation model is to employ
a mixture structure for catering the potential occurrences of misrepresentation. We proved the identifiability of
the proposed misrepresentation model. A novel algorithm was proposed for accommodating the model learning,
suitable for big data applications. Our extensive simulation study had shown that the proposed methodology is
capable of providing valid inference on misrepresentation, even when the underlying DGP is highly non-linear and
complex. We applied the proposed model on the Medical Expenditure Panel Survey data to study the potential
misrepresentation behaviors on the respondents’ self-reported insurance status. We found strong statistical evidence
that misrepresentation exists, which confirms the earlier study’s conclusion drawn based on the Gaussian linear
regression model (Akakpo et al., 2019).
There are a handful of directions for future research. For example, we could explore the possibility to incorporate
other statistical learning methods such as support vector machine, tree-based models, or neural network in studying
insurance misrepresentation. The mixture machinery used in this paper will serve as the building block for this
research direction, and the embedded KQR can be substituted by one of the methods outlined above. It will be
interesting to compare the performance among these methods in modeling misrepresentation. Another important
research direction is to augment the proposed misrepresentation model in order to handle the frequency component
in insurance claims data which also contains the statistical information about misrepresentation.
References
Akakpo, R., Xia, M., and Polansky, A. (2019). Frequentist inference in insurance ratemaking models adjusting for
misrepresentation. ASTIN Bulletin, 49(1):117–146.
29
Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society,
68(3):337–404.
Baione, F. and Biancalana, D. (2019). An individual risk model for premium calculation based on quantile: A com-
parison between generalized linear models and quantile regression. North American Actuarial Journal, 23(4):573–
590.
Bhattacharya, J. and Bundorf, M. K. (2009). The incidence of the healthcare costs of obesity. Journal of Health
Economics, 28(3):649–658.
Cawley, J., Meyerhoefer, C., Biener, A., Hammer, M., and Wintfeld, N. (2015). Savings in medical expenditures
associated with reductions in body mass index among US adults with obesity, by diabetes status. Pharmacoeco-
nomics, 33(7):707–722.
Dwass, M. (1957). Modified randomization tests for nonparametric hypotheses. The Annals of Mathematical
Statistics, 28(1):181–187.
Edgington, E. S. (2007). Randomization Tests. Chapman and Hall, New York.
Embrechts, P. and Hofert, M. (2013). A note on generalized inverses. Mathematical Methods of Operations Research,
77(3):423–432.
Finkelstein, E. A., Fiebelkorn, I. C., and Wang, G. (2003). National medical spending attributable to overweight
and obesity: how much, and who’s paying? Health Affairs, 22:219–226.
Gabaldon, I. M., Vazquez Hernandez, F. J., and Watt, R. (2014). The effect of contract type on insurance fraud.
Journal of Insurance Regulation, 33(8):197–230.
Good, P. (2000). Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer,
New York.
Gu, C. (2013). Smoothing Spline ANOVA Models. Springer, New York.
Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms
for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288.
Hamilton, J. M. (2009). Misrepresentation in the Life, Health, and Disability Insurance Application Process: A
National Survey. Minnesota State Bar Association, Minneapolis.
Kimeldorf, G. and Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical
Analysis and Applications, 33(1):82–95.
30
Koenker, R. (2005). Quantile Regression. Cambridge University Press, Cambridge.
Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica, 46(1):33–50.
Kudryavtsev, A. A. (2009). Using quantile regression for rate-making. Insurance: Mathematics and Economics,
45(2):296–304.
Laxy, M., Stark, R., Peters, A., Hauner, H., Holle, R., and Teuner, C. M. (2017). The non-linear relationship
between bmi and health care costs and the resulting cost fraction attributable to obesity. International Journal
of Environmental Research and Public Health, 14:1–6.
Li, Y., Liu, Y., and Zhu, J. (2007). Quantile regression in reproducing kernel Hilbert spaces. Journal of the
American Statistical Association, 102(477):255–268.
Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Foundations and Trends in Machine
Learning, 3(2):123–224.
Micchelli, C. A., Xu, Y., and Zhang, H. (2006). Universal kernels. Journal of Machine Learning Research, 7:2651–
2667.
Nevill, A. M. and Metsios, G. S. (2015). The need to redefine age- and gender-specific overweight and obese body
mass index cutoff points. Nutrition and Diabetes, 5:1–2.
Nychka, D., Gray, G., Haaland, P., Martin, D., and O’Connell, M. (1995). A nonparametric regression approach to
syringe grading for quality improvement. Journal of the American Statistical Association, 90(432):1171–1178.
Perez-Marın, A. M., Guillen, M., Alcaniz, M., and Bermudez, L. (2019). Quantile regression with telematics
information to assess the risk of driving above the posted speed limit. Risks, 7(3):80.
Pilanci, M. and Wainwright, M. J. (2016). Iterative Hessian sketch: Fast and accurate solution approximation for
constrained least-squares. The Journal of Machine Learning Research, 17(1):1842–1879.
Raskutti, G. and Mahoney, M. W. (2016). A statistical perspective on randomized sketching for ordinary least-
squares. The Journal of Machine Learning Research, 17(1):7508–7538.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.
Takeuchi, I., Le, Q. V., Sears, T. D., and Smola, A. J. (2006). Nonparametric quantile estimation. Journal of
Machine Learning Research, 7:1231–1264.
Wahba, G. (1990). Spline Models for Observational Data. Society for Industrial and Applied Mathematics, Philadel-
phia.
31
Wu, Q. and Yao, W. (2016). Mixtures of quantile regressions. Computational Statistics and Data Analysis, 93:162–
176.
Xia, M. (2014). Risk segmentation using Bayesian quantile regression with natural cubic splines. Austin Statistics,
1(1):1–7.
Xia, M. (2018). Bayesian adjustment for insurance misrepresentation in heavy-tailed loss regression. Risks, 6(3):83.
Xia, M. and Gustafson, P. (2016). Bayesian regression models adjusting for unidirectional covariate misclassification.
Canadian Journal of Statistics, 44(2):198–218.
Xia, M. and Gustafson, P. (2018). Bayesian inference for unidirectional misclassification of a binary response trait.
Statistics in Medicine, 37(6):933–947.
Xia, M., Hua, L., and Vadnais, G. (2018). Embedded predictive analysis of misrepresentation risk in GLM ratemak-
ing models. Variance, 12(1):39–58.
Yang, Y., Pilanci, M., and Wainwright, M. J. (2017). Randomized sketches for kernels: Fast and optimal nonpara-
metric regression. The Annals of Statistics, 45(3):991–1023.
Yuan, M. (2006). GACV for quantile smoothing splines. Computational Statistics and Data Analysis, 50(3):813–829.
Appendix A Pseudo-data method for weighted quantile smoothing
We place our focus on the optimization problem in Equation (15) which is more general than that of Equation (14).
If we set S to be an identity matrix of dimension n, then the two problems are identical. For brevity, let us also
suppress the indexes “r” and “s” in Equation (15) and simply write
(b, θ)> = arg min(b,θ)∈Rm+1
(1
n ppρτ (y − b1−KS>θ) + λθ>SKS>θ
), (21)
for a given p = (p1, . . . , pn), p =∑ni=1 pi/n. The major difficulty in solving the optimization problem above is caused
by the fact that the pinball loss function (7) is not differentiable at zero. To get around the non-differentiability
issue, Nychka et al. (1995) suggested to approximate the pinball loss by a differentiable function:
ρτ,δ(t) =
ρτ (t), if |t| ≥ δ;
τ t2/δ, if 0 ≤ t ≤ δ;
(1− τ) t2/δ, if −δ ≤ t < 0.
(22)
32
Here, δ > 0 is an approximation threshold parameter. Denote by
(ζ, ϑ
)>= arg min
(ζ,ϑ)∈Rm+1
(1
n ppρτ,δ(y − ζ 1−KS>ϑ) + λϑ>SKS>ϑ
)(23)
the approximate solution of (21) obtained by substituting the pinball loss function (21) with the differentiable
counterpart ρτ,δ. Application of first order condition to the objective function in Equation (23) yields the solution
must satisfy
n∑i=1
wi × (yi − ζ − kiS>ϑ) = 0
and
n∑i=1
wi × (yi − ζ − ki S>ϑ) (−ki S>)j + λ (SK S> ϑ)j = 0, ∀j ∈ {1, . . . ,m},
where ki is the i-th row of K, wi =[2n p (yi − ζ − kiS>ϑ)
]−1pi ρ′τ,δ(yi − ζ − kiS>ϑ) for i = 1, . . . , n. A careful
inspection of the above linear equations system reveals that the (approximated) weighted quantile smoothing in
Equation (23) can be viewed analogously to the traditional non-parametric smoothing based on the expected square
loss:
(y − ζ −KS>ϑ)>W (y − ζ −KS>ϑ)− λϑ>SKS>ϑ, (24)
where W = diag(w), w = (w1, . . . , wn), and the operator diag : Rn → Rn×n maps an n-tuple to the corresponding
diagonal matrix.
Following the proposal in Nychka et al. (1995), we can compute the solution to Equation (23) in an iterative
manner. To be specific, given the current estimate(ζ(s), ϑ(s)
)>, in the (s + 1)-th iteration, we solve the weighted
smoothing quantile regression (23) based on the weights
w(s)i =
[2n p (yi − ζ(s) − kiS>ϑ(s))
]−1pi ρ′τ,δ(yi − ζ(s) − kiS>ϑ(s)), for i = 1, . . . , n.
Let W (s) = diag(w(s)), w(s) = (w(s)1 , . . . , w
(s)n ) and define
A = SKBKS> + Tr(W (s))(SKW (s)KS> + λSKS>
), B = W (s) 11>W (s).
33
In light of the equivalence between (23) and (24), the updated estimate can be computed explicitly via
ϑ(s+1) = D(s) y, D(s) = A−1[
Tr(W (s))SKW (s) − SKB]
(25)
and
ζ(s+1) = c(s) y, c(s) =1
Tr(W (s+1))
[1>W (s) − 1>W KS>D(s)
]. (26)
The iteration will be repeated until some convergence criterion is met, upon which we obtain the solution to
Equation (23). By choosing a small enough approximation threshold δ, the estimate at convergence(ζ(∞), ϑ(∞)
)>provides a good approximation of (b, θ)> in Equation (21).
Appendix B Conditional approximate cross-validation
The approximation method for evaluating the CCV score (16) is elaborated here. Recall that the differentiable
function ρτ,δ(·) defined in Equation (22) provides an approximation of the pinball loss function ρτ (·) if δ > 0 is
chosen to be small. First, we substitute ρτ (·) with ρτ,δ(·) such that the following approximation based on Taylor
series holds:
n∑i=1
pi,r ρτ(yi − f [−i]r (xi)
)≈
n∑i=1
pi,r ρτ(yi − fr(xi)
)+
n∑i=1
pi,r ρ′τ,δ
(yi − fr(xi)
) [fr(xi)− f [−i]r (xi)
], r = 0, 1. (27)
Now, we need a feasible way for computing fr(xi)− f [−i]r (xi). The following lemma is of auxiliary importance, and
its proof is motivated significantly by Lemma 3.1 in Yuan (2006).
Lemma 2. Given the regularization parameter λ > 0 and weights pi,r ∈ [0, 1], i = 1, . . . , n, r = 0, 1, define
f[i]r (·) = b
[i]r + h
[i]r (·) in the same manner as fr(·) = br + hr(·) except that the i-th observation yi is replaced by
f[−i]r (xi). Then we have f
[i]r = f
[−i]r for i = 1, . . . , n and r = 0, 1.
Proof. For r = 0, 1 and any fixed i = 1, . . . , n, the following string of relationships hold by definitions of f[i]r and
f[−i]r :
1
n pr
[n∑
j=1,j 6=i
pj,r ρτ(yj − f [i]r (xj)
)+ pi,r ρτ
(f [−i]r (xi)− f [i]r (xj)
)]+ λ ‖h[i]r ‖2HK
≥ 1
n pr
n∑j=1,j 6=i
pj,r ρτ(yj − f [i]r (xj)
)+ λ ‖h[i]r ‖2HK
34
≥ 1
n pr
n∑j=1,j 6=i
pj,r ρτ(yj − f [−i]r (xj)
)+ λ ‖h[−i]r ‖2HK
=1
n pr
[n∑
j=1,j 6=i
pj,r ρτ(yj − f [−i]r (xj)
)+ pi,r ρτ
(f [−i]r (xi)− f [−i]r (xj)
)]+ λ ‖h[−i]r ‖2HK
≥ 1
n pr
[n∑
j=1,j 6=i
pj,r ρτ(yj − f [i]r (xj)
)+ pi,r ρτ
(f [−i]r (xi)− f [i]r (xj)
)]+ λ ‖h[i]r ‖2HK .
Then, we can conclude that
1
n pr
[n∑
j=1,j 6=i
pj,r ρτ(yj − f [i]r (xj)
)+ pi,r ρτ
(f [−i]r (xi)− f [i]r (xj)
)]+ λ ‖h[i]r ‖2HK
=1
n pr
[n∑
j=1,j 6=i
pj,r ρτ(yj − f [−i]r (xj)
)+ pi,r ρτ
(f [−i]r (xi)− f [−i]r (xj)
)]+ λ ‖h[−i]r ‖2HK .
Given the weights parameters pi,r, i = 1, . . . , n and r = 0, 1, it is straightforward to check that the optimization
problems underlying f[i]r and f
[−i]r are convex. Thereby, we must have f
[i]r = f
[−i]r . This completes the proof.
Lemma 2 yields, for i = 1, . . . , n and r = 0, 1,
fr(xi)− f [−i]r (xi) = fr(xi)− f [i]r (xi) =[∂fr(xi)
∂yi
(yi − f [−i]r (xi)
)](1 + o(1)
). (28)
Recall the notation ψi,r = ∂fr(xi)/∂yi, i = 1, . . . , n and r = 0, 1. Then adopt the same argument as in Equations
(3.5)-(3.6) in Yuan (2006), we get
ρ′τ,δ(yi − fr(xi)
) [fr(xi)− f [−i]r (xi)
]≈ ρτ
(yi − fr(xi)
) ψi,r
1− ψi,r,
and so (27) can be approximated as
n∑i=1
pi,r ρτ(yi − f [−i]r (xi)
)≈
n∑i=1
pi,r ρτ(yi − fr(xi)
) 1
1− ψi,r.
It remains to compute the partial derivatives terms ψi,r. To this end, let
Ar = 1 c(∞)r +KS>D(∞)
r , r ∈ {0, 1}, (29)
where c(∞)r and D
(∞)r are the matrices defined in (26) and (25) upon convergence of the pseudo-data algorithm
for estimating fr. In fact, Ar is the hat matrix associated with the weighted quantile smoothing such that yr =
35
Ar y, where yr =(fr(x1), . . . , fr(xn)
)>. Thereby, we know ψi,r = aii, which is the i-th diagonal element of A,
i = 1, . . . , n.
Finally, note that when ∂fr(xi)/∂yi is close to zero, the second-order term of Taylor series which has been
omitted in (28), may dominate the first order term, causing the approximation behaves poorly. Follow the resolution
proposed by Yuan (2006) to apply the averaging trick on the individual partial derivative terms, the conditional
approximate cross-validation (CACV) score in (17) is now obtained and ready to implement.
36