robust estimates of insurance misrepresentation through ...jianxi/kqr_misrepresentation.pdf ·...

Robust Estimates of Insurance Misrepresentation

through Kernel Quantile Regression Mixtures

Hong Li,∗ Qifan Song,† Jianxi Su,‡

September 16, 2020

Abstract

This paper pertains to a class of non-parametric methods for studying the misrepresentation issue in insurance

applications. For this purpose, mixture models based on quantile regression in reproducing kernel Hilbert spaces

are employed. Compared to the existing parametric approaches, the proposed framework features a more flexible

statistics structure which could alleviate the risk of model misspecification, and is in the meantime more robust

to outliers in the data. The proposed framework can not only estimate the prevalence of misrepresentation in

the data, but also help identify the most suspicious individuals for the validation purpose. Through embedding

state-of-the-art machine learning techniques, we present a novel statistics procedure to efficiently estimate the

proposed misrepresentation model in the presence of massive data. The proposed methodology is applied to study

the Medical Expenditure Panel Survey (MEPS) data, and a significant degree of misrepresentation activities are

found on the self-reported insurance status.

Key words and phrases: Big data, insurance claim models, misrepresenter identification, misrepresentation risk

assessment, non-parametric regression mixtures.

∗Asper School of Business, University of Manitoba, Winnipeg, Canada. Email: [email protected]†Department of Statistics, Purdue University, West Lafayette, U.S.A. Email: [email protected]‡Department of Statistics, Purdue University, West Lafayette, U.S.A. Email: [email protected]

1

mailto:[email protected]



Introduction

Insurance companies collect policyholders’ information – summarized by the so-called insurance rating factors

– in order to calculate risk-adjusted premiums. In the real world practice, it often happens that policyholders

intentionally make untrue statements on some key rating factors so as to alter the insurance eligibility and/or lower

the insurance premiums. In actuarial parlance, this type of fraudulent behaviors are best referred to as the insurance

misrepresentation. Rating factors subject to misrepresentation are typically self-reported. Examples include the

smoking status in health insurance, the millage and use of vehicles in auto insurance. Misrepresentation phenomena

can be also found in insurance-related survey data when revealing a particular level of true information is associated

with a type of costs, such as social desirability or financial cost. Shutting eyes to misrepresentation activities may

degrade the quality of actuarial models, leading to infelicitous business decisions and/or unfair premium schemes.

Misrepresentation risk is also of particular interest to insurance regulators who attempt to understand how the

presence of fraudulent behaviors by a group of insured individuals might affect the welfare of the others (Gabaldon

et al., 2014). Hence, assessing and managing the misrepresentation risk is an indispensable component in modern

insurance practice.

There are two strands of literature in studying the misrepresentation risk. The first strand of literature focuses on

the qualitative research of misrepresentation, attempting to propose superior policy designs and proactive overseeing

processes for deterring the misrepresentation behaviors (see, Hamilton, 2009, for a comprehensive review). The other

strand of literature focuses on the quantitative aspect of misrepresentation management and aims to quantify the

extent of misrepresentation risk using statistical models. Our paper sits squarely in the latter strand.

Arguably, modeling insurance misrepresentation is a challenging task. What complicates the task, from the

statistical standpoint, is the unobservable nature of fraudulent activities. Namely, knowledge on whether misrep-

resentation actually occurred cannot be discovered until a formal investigation is undertaken. Thereby, traditional

statistics methods (e.g., discriminant analysis and logistic regression) which require access to a sample frame con-

taining the random variable of concern (i.e., the misrepresentation status under the individual level), may not be

directly used to study misrepresentation. Consequently, a new set of statistics tools are naturally called upon. In

particular, we have a keen interest in developing a rigorous statistics framework in order to deliver scientifically sound

answers to the following two questions, which are of fundamental importance in quantitative misrepresentation risk

management:

Q1. Based on a given set of insurance claims data, how to assess the level of misrepresentation activities in the

data, before actually observing the fraudulent behaviors?

Q2. If a significant level of misrepresentation activities are discovered, how to select the most suspicious individuals

for the validation purpose?

2

Despite the practical importance, modeling misrepresentation has not received much attention from insurance

researchers until recently, and thus the related literature is considerably limited. To the best of our knowledge,

there are only a few existing works on misrepresentation modeling, including Akakpo et al. (2019); Xia (2018); Xia

and Gustafson (2016, 2018); Xia et al. (2018). In particular, one of the most recent attempts made in Akakpo

et al. (2019) significantly inspires our undertakings. In that paper, the authors built on the theoretical groundwork

established by Xia and Gustafson (2016) and proposed a mixture of parametric regression to model insurance mis-

representation. Before placing the present paper into perspective, the following paragraphs give a coarse overview of

the misrepresentation model in Akakpo et al. (2019). Some standard notations for describing the misrepresentation

problem of interest are needed beforehand. We denote the data set as {yi,xi, z∗i }ni=1, where yi ∈ R is the response

(e.g., log transformed insurance claims), xi ∈ Rd is the d-dimensional correctly measured covariate (e.g., various

rating factors and demographic information), and z∗i ∈ {0, 1} is the self-reported and potentially misrepresented

classification label with z∗i = 0 represents a negative risk status. Admittedly, the aforementioned setup may be re-

stricted in the sense that there is only one potentially misrepresented variable considered and it is binary. However,

this type of misrepresentation problem is among the most common ones in real-world practice. It is also worth

mentioning that insurance misrepresentation usually occurs in one direction only, benefiting the respondent finan-

cially. For example, in a health insurance with smoking surcharge, it is impossible for a non-smoking policyholder

to misrepresent as a smoker in order to increase the insurance premium. Correspondingly, the true unknown label,

denoted by zi, is partially missing and satisfies zi = z∗i if z∗i = 1.

Next, we present a slight generalization of the misrepresentation model introduced by Akakpo et al. (2019) 1.

Given the correctly measured covariate xi and the true label of the potentially misrepresented variable zi, assume

that the conditional distribution of the log transformed insurance claims can be catered by Gaussian linear regression

(or equivalently, log-normal linear regression on the original claims):

yi − (β0 + xi β + zi βz)iid∼ N(0, σ2), i = 1, . . . , n, (1)

where β0, β = (β1, . . . , βn)′, βz are the regression coefficients, σ > 0 is the standard deviation parameter of

normal distribution, and “iid∼” designates a distribution for random variables that are identically and independently

distributed. Recall that zi is partially missing, so (yi |xi, z∗i = 1)d= (yi |xi, zi = 1) where “

d=” signifies equality in

distribution. If z∗i = 0, then (yi |xi, z∗i ) admits a 2-component mixture:

F (yi |xi, z∗i = 0) = (1− π)F (yi |xi, zi = 0) + π F (yi |xi, zi = 1), (2)

1Multiple correctly measured covariates are allowed in the parametric misrepresentation model presented here, while the one inAkakpo et al. (2019) only considered the potentially misrepresented variable.

3

where F (·) denotes the conditional cumulative distribution function (CDF) of y given (x, z∗) or (x, z), and π =

P[z = 1 | z∗ = 0] represents the misrepresentation probability. To model misrepresentation, the statistical task is to

make inference on the model’s parameters, (β0,β, βz, σ), as well as the misrepresentation probability π. In Akakpo

et al. (2019), the authors proposed to use the Expectation-Maximization (EM) algorithm to estimate the model.

The Gaussian linear model in (1) is both useful for a significant range of practical situations and elegant enough

for theoretical analysis. However, when it is used for modeling misrepresentation, a fundamental issue concerning

the soundness of the estimation result arises, namely, misspecifying the Gaussian linear model to the claim data

may contaminate the mixture structure in Equation (2). Nevertheless, attributable to the lack of validated data on

misrepresentation in the practical instance, it is impossible to verify if the mixture structure in Equation (2) is truly

due to the fraudulent behaviors, or simply because of the misspecification of parametric models. Theoretically,

it is rather challenging to quantify the bias in the estimation of misrepresentation probability when the claim

model is misspecified. Misrepresentation models based on other parametric distributions such as gamma, Pareto

and Weibull, are also considered in Xia (2018); Xia and Gustafson (2016); Xia et al. (2018). That said, these

modifications still rely on strong parametric assumptions made on the claim distribution and hence can not yield a

satisfactory resolution to the robustness issue noted above.

Another potential cause of distrustful estimates of misrepresentation under Gaussian linear model (1) is the

presence of outliers, due to the use of least squares based procedure. For instance, if a non-misrepresenter case

(i.e., zi = z∗i = 0) with relatively high response value yi appears in the data set, then this observation may induce

over-estimation for the regression function under z = 0. If fitting regression functions to data is the sole statistical

goal, then a straightforward approach to deal with outliers is to identify and remove them from the estimation

process. However, this rather naive approach does not suit a misrepresentation analysis because outliers may

actually correspond to the misrepresented individuals.

Beyond the choice of suitable statistical modeling methods, another important practical issue for modern in-

surance application is the computational bottleneck when encountering massive insurance data. While the compu-

tational complexity of restrictive parametric methods, such as linear models, is usually linearly affected by sample

size, flexible statistical methods, such as non-parametric models, suffer much heavier computing burdens due to

increasing samples. Therefore, special attentions must be paid towards how to efficiently carry out appropriate

analysis and answer the fundamental questions Q1 and Q2, without sacrificing the statistical accuracy.

In light of the discussion thus far, the major object of interest in this present paper is straightforward: a non-

parametric approach for robust estimates of insurance misrepresentation. The primary tool involved in our study

is kernel quantile regression (KQR). The rationale behind our model choice is two-fold. First, the embedded kernel

method can provide us with a flexible statistics structure to capture the non-linear and interaction effects in the

insurance data. As a consequence, the application of KQR should greatly moderate the practical concern on model

4

misspecification in studying misrepresentation. Second, compared to mean regression which is more familiar in the

insurance circle, quantile regression is known to be more robust to the presence of outliers. Furthermore, in order

to ease the computational difficulties of large insurance data set, we combine the state-of-the-art random sketch

technique with the conventional estimation method for KQR, which can significantly reduce the computing time

and memory requirement.

An innovative learning algorithm is introduced to implement the KQR-based misrepresentation model. The

algorithm not only finds non-parametric estimates of misrepresentation probability (which answers question Q1) as

well as the regression structure between insurance claims and relevant covariates, but also helps identify potential

misrepresenters (which answers question Q2). Based on extensive simulation studies, we show that the proposed

framework is able to accurately estimate the misrepresentation probability under various data generating processes

(DGP’s), ranging from the univariate linear models to multivariate non-linear models. Moreover, misrepresenters

are successfully detected with very low error rates. Therefore, the proposed misrepresentation framework seems to

be able to deliver a satisfactory performance under different challenging situations. Finally, the proposed model is

applied to study the Medical Expenditure Panel Survey (MEPS) data. In line with the results of Akakpo et al.

(2019), our model suggests a significant percentage of respondents misrepresented on the self-reported insurance

status in 2014, in order to avoid the potential tax penalty. However, our estimated percentage is substantially lower

than that reported in Akakpo et al. (2019). This discrepancy may be attributed to the fact that the proposed KQR

method could better cope with the non-linear dependence structure inherent in the MEPS data, and thus leads to

a more reliable estimation of the misrepresentation probability.

The quantile regression based misrepresentation model

We cope with the loss modeling component in misrepresentation analysis using the notion of quantile regression,

championed by Koenker (see, Koenker, 2005, for a compreshensive review). Different from most regression method-

ologies centering around the condition mean, quantile regression supplements the encompassing statistics toolbox

with a general technique for estimating families of conditional quantile functions. Because quantile regression can

be applied to study the functional dependence of data across the lower, middle and upper tails of response dis-

tributions, it is capable of providing more distributional information than the classical mean regression. In recent

years, quantile regression has become a very popular toolkit for applied actuaries and risk analysts to consider the

effect of covariates on the distribution of the dependent variable (Baione and Biancalana, 2019; Kudryavtsev, 2009;

Perez-Marın et al., 2019; Xia, 2014).

5

Misrepresentation model’s specification

We are now in the position to lay down the quantile regression based misrepresentation model of interest. Though

elementary, we recall that the quantile function associated with CDF G is defined as

G−1(τ) = inf{t ∈ R : G(t) ≥ τ}, with τ ∈ (0, 1).

If G is continuous, then the quantile function boils down to the ordinary inverse of CDF (Embrechts and Hofert,

2013).

We have already defined respectively zi and z∗i , i = 1, . . . , n, as the true and self-reported labels of the potentially

misrepresented rating factor. Given the true label zi = r with r ∈ {0, 1}, we model the functional dependence

between insurance claims and the correctly measured rating factors by

yi −[br + hr(xi)

]= ξi(τ)

iid∼ G, where br ∈ R, hr(·) : Rd → R, and G−1(τ) = 0 for some τ ∈ (0, 1). (3)

Equivalently, the function fr := br+hr is the 100τ% quantile of the conditional distribution of y given (x, z = r) for

r ∈ {0, 1}. Recall that only z∗i is observable but not zi. We employ the hybrid structure proposed by Akakpo et al.

(2019) in order to address the unobservable nature of misrepresentation. Precisely, we model data (yi,xi, z∗i = 1) by

yi = f1(xi)+ξi(τ) because no misrepresentation may occur when z∗i = 1, whereas (yi,xi, z∗i = 0) by a 2-component

mixture formulated as

yi = f0(xi) + ξi(τ), with probability 1− π;

yi = f1(xi) + ξi(τ), with probability π.

(4)

The mixture structure in Equation (4) is reminiscent of the parametric misrepresentation model in Equation (2).

In this proposed framework, the core mechanism for quantifying misrepresentation is to implement a mixture

structure on data with z∗i = 0. To make the proposed misrepresentation model well-defined, one crucial problem

that needs to be addressed is the identifiability of the embedded mixture structure. Under the Gaussian linear

regression model without considering other covariates except z, Akakpo et al. (2019) showed that the associated

misrepresentation model is identifiable (also see the theoretical discussion in Xia and Gustafson, 2016). In the

ensuing assertion, we extend the identifiability result to the more general quantile regression based misrepresentation

models, which may further involve an arbitrary number of correctly measured covariates.

Proposition 1. The proposed misrepresentation model is identifiable.

Proof. We begin by noting the proposed misrepresentation model defines equivalently the following conditional

6

distribution of data:

y | (x, z∗ = 1)d= ξ + f1(x), with ξ ∼ G;

y|(x, z∗ = 0)d= ξ + f0(x) + ω

[f1(x)− f0(x)

], with ξ ∼ G, ω ∼ Bernoulli(1− π), and ξ ⊥⊥ ω.

(5)

Thus, it suffices to prove that there does not exist a different choice of (G, f0, f1, π), under which (5) represents the

same conditional distribution.

We are going to proceed by contradiction. Assume that such choice exists, and we denote it by (G, f0, f1, π).

By the first equation of (5), we have ξ + f1(x)d= ξ + f1(x), where ξ ∼ G. Since the 100 τ% quantiles of both G

and G are 0, we can claim that f1(x) = f1(x), which further implies G = G. Conditional on x and z∗ = 0, the

characteristic function of y is

∫eitudG(u)×

[(1− π) eif0(x) t + π eif1(x) t

]=

∫eitudG(u)×

[(1− π) eif0(x) t + π eif1(x) t

]=

∫eitudG(u)×

[(1− π) eif0(x) t + π eif1(x) t

],

for all t ∈ R and here i denotes the imaginary unit. Note that characteristic function∫eitudG(u) is non-vanishing

in a region around 0, we must have

(1− π) eif0(x) t + π eif1(x) t = (1− π) eif0(x) t + π eif1(x) t, for some t around 0.

Equivalently,

π ei[f1(x)−f0(x)] t − π = π ei[f1(x)−f0(x)] t − π, for some t around 0. (6)

In order to match the imaginary part of both sides of Equation (6), we must have π = π and f0(x) = f0(x). We

have established the identifiability property for the proposed model, and the proof is now completed.

Remark 1. Although Proposition 1 is established based on quantile regression, the result can be also applied to

study the identifiability property for misrepresentation models constructed by mean regression. To see this, simply

choose τ such that 100 τ% quantile of the residual distribution is equal to its mean.

Regression functions in reproducing kernel Hilbert spaces

We have not yet specified hr, r ∈ {0, 1} in claim model (3). One trivial way of enforcing the structure in (3) is to

constrict hr to be linear, corresponding to the linear quantile regression (Koenker and Bassett, 1978) widely used

in the classical statistics literature. However, real data often exhibit non-linear functional dependence, and thus

7

simply relying on linear quantile regression may not help alleviate the practical concern on model misspecification

mentioned in Introduction section. Preferably, we seek a data-driven approach for specifying hr.

Estimating regression function hr from data seems reasonable, but in order to make it feasible, we need to at least

assume that the functions admit some regularity properties. Bear in mind the call for model flexibility in studying

misrepresentation, the notion of kernel smooth regression (Aronszajn, 1950; Gu, 2013; Wahba, 1990) is natural to

evoke. To be more specific, we assume the underlying regression function hr belongs to some reproducing kernel

Hilbert space (RKHS) HK, which is generated by a reproducing kernel (RK) K : Rd × Rd → R. The reproducing

kernel K is symmetric and satisfies∑Li=1

∑Lj=1 ci cj K(ui,uj) ≥ 0 for any positive integer L, ci ∈ R and ui ∈ Rd,

i = 1, . . . , L. The induced RKHS is defined as the closure of function set:

HK ={h : h(·) =

L∑i=1

θiK(·,ui), L ∈ Z+, ui ∈ Rd, θi ∈ R, i = 1, . . . , L}.

Under a reasonable choice of RK, the associated RKHS covers a rich range of smooth functions that can be

non-linear, thus serves as a better candidate model than linear quantile regression or other pre-specified parametric

structures. For instance, the RKHS induced by p-degree polynomial kernel K(u,v) = (1 + 〈u,v〉)p for some integer

p ≥ 1 and u,v ∈ Rd, includes the entire family of smoothing spline, additive spline, and interaction spline models

(Wahba, 1990). Here, and elsewhere, we use 〈· , ·〉 to denote the inner product operator. Some other popular choices

of RK involve, for any u = (u1, . . . , ud) ∈ Rd, v = (v1, . . . , vd) ∈ Rd, and a bandwidth parameter % > 0,

ANOVA radial basis kernel: K(u,v) =( d∑i=1

exp{− (ui − vi)2

%

})d;

Gaussian kernel: K(u ,v) = exp{− ‖u− v‖

2

%

};

Exponential kernel: K(u ,v) = exp{〈u ,v〉

}.

More to the point, the Gaussian kernel mentioned above possesses the universal approximation property, meaning

that the induced RKHS is dense in the space of all continuous functions with compact supports (see, Micchelli

et al., 2006, for a more rigorous discussion). Therefore, at least in principle, the RKHS defined by the Gaussian

kernel is capable of catering any continuous regression surfaces, or all regression surfaces that can be approximated

by continuous functions. Due to this reason, along with its mathematical tractability, the Gaussian kernel is

widely accepted as “the default” option across a board spectrum of non-parametric analyses when there is no prior

knowledge about the kernel choice.

The logical path of this present section is to end up with an open discussion about the key differences between

8

the proposed KQR misrepresentation model and the Gaussian linear misrepresentation model considered in Akakpo

et al. (2019). First, we use quantile regression instead of the traditional mean regression to model insurance

misrepresentation, which is the initial attempt in the related literature. Second, the kernel method embedded in

the proposed misrepresentation model relaxes the linear constraint on the regression structure in Akakpo et al.

(2019), and thus our claim model will be likely to suit a wider range of misrepresentation analysis tasks. Third, the

model of Akakpo et al. (2019) assumes that the potentially misrepresented variable affects the response distribution

by a parallel shift only, whereas in real-life data this assumption might not be true. In our misrepresentation model,

the regression structure can be distinct between z = 0 and z = 1 (see, Equation (3)), yielding a more flexible

and realistic modeling framework. All in all, the aforementioned unique features of the proposed KQR framework

could keep model risk at a minimum level, and also contribute to a robust estimate of misrepresentation resistant

to outliers.

Estimation procedure

To implement the KQR-based misrepresentation model, we seek a workable scheme to estimate the misrepresentation

probability π, as well as regression functions f0 and f1 belonging to some given RKHS HK. If the true label zi is

known, then a natural approach for estimating fr, r ∈ {0, 1}, is through regularization. Specifically, observe that

minimizing the `1-loss function for a location estimator yields the median, Koenker and Bassett (1978) generalized

this idea to obtain a regression estimate for any quantile though the pinball loss function: for any τ ∈ (0, 1) and

t ∈ R,

ρτ (t) =

τ |t|, if t ≥ 0;

(1− τ) |t|, if t < 0.

(7)

The pinball loss function spells out the key difference between the tradition mean regression and quantile regression

considered herein. To be specific, a mean regression estimate is obtained by minimizing the expected square loss,

while for a quantile regression estimate, the minimization problem is based on the weighted absolute loss defined

in Equation (7). Thereby, quantile regression is known to be more robust to the presence of outliers than mean

regression. Furthermore, denote the admissible function set in the RKHS by FK = {f = b + h : b ∈ R, h ∈ HK},

where b is the intercept of the regression function, and HK is a RKHS over xi, i = 1, . . . , n, generated by RK K.

Then, quantile regression fr ∈ FK, r ∈ {0, 1} can be solved by minimizing the empirical risk plus a regularizer such

9

that

fr = arg minf∈FK

(1∑n

i=1 1{zi=r}

n∑i=1

1{zi=r} ρτ(yi − f(xi)

)+ λ ‖h‖2HK

), (8)

where 1{·} denotes indicator function and λ > 0 is a hyper-parameter corresponding to the Hilbert space norm

‖ · ‖HK (Li et al., 2007; Takeuchi et al., 2006). This expression represents a trade-off between fidelity to the data,

as represented by the sum of the absolute values of residuals, and plausibility of the solution, as represented by the

norm of the function over the Hilbert space.

However, in the context of misrepresentation modeling, we only have (yi,xi, z∗i ) at our disposal and true label

zi is unknown. Hence, the optimization in Equation (8) can not be directly solved to obtain an estimate of fr,

r ∈ {0, 1}. In what follows, we are going to treat misrepresentation modeling as a missing data problem and then

develop an iterative algorithm to circumvent the aforementioned difficulty.

Expectation-Smoothing algorithm

When dealing with missing data in a parametric setting, the Expectation-Maximization (EM) algorithm is frequently

adopted to compute maximum likelihood estimates. The iterative algorithm which we are going to introduce for

estimating the KQR misrepresentation model, is a non-parametric analogous to the EM algorithm. Our idea is

motivated by Wu and Yao (2016), wherein the estimation for linear quantile regression mixtures is explored. We

extend the method in Wu and Yao (2016) to the context of non-parametric quantile regression mixtures. Similar

to the EM algorithm, the proposed algorithm for KQR mixtures iterates between the expectation step (E-step)

and the smoothing step (S-step) so as to handle the missing information and model learning, respectively. For this

reason, we term the algorithm of interest the Expectation-Smoothing (ES) algorithm.

For notational convenience, denote by Φ = (π, f0, f1) the set of parameters to be estimated. With some

reasonable initialization, the iterative two-step procedure of ES algorithm is described formally in the sequel.

1. E-step: For s ∈ N, let Φ(s) =(π(s), f

(s)0 , f

(s)1

)be the updated estimate of Φ obtained from the s-th iteration

and e(s)i,r = yi − f (s)r (xi), i = 1, . . . , n, r ∈ {0, 1}, be the associated residuals. The aim in the E-step is to

compute the posterior mean on the guess of true label zi. That is for any i ∈ {l : z∗l = 0}, we calculate

p(s+1)i,1 = E

[zi ; Φ(s)

]= P

[zi = 1 ; Φ(s)

]=

π(s) × g(s)(e(s)i,1

)(1− π(s))× g(s)

(e(s)i,0

)+ π(s) × g(s)

(e(s)i,1

) (9)

and

p(s+1)i,0 = P

[zi = 0 ; Φ(s)

]= 1− p(s+1)

i,1 , (10)

10

where g denotes the density function of residual distribution G, and g(s) is the corresponding estimate. For

any i ∈ {l : z∗l = 1}, because misrepresentation does not occur in this instance, we have pi,1 = 1 and pi,0 = 0.

We do not intend to assign a parametric form to density function g, and it can be estimated from residuals

e(s)i,r via constrained weighted kernel density estimation (Wu and Yao, 2016). Specifically, let K : R → R

be some kernel function with unbounded support (e.g., Gaussian kernel), and define the shorthand notation

K(t ; η) = η−1K(t/η), where t ∈ R, and η > 0 is a bandwidth parameter. The constrained kernel density

estimate of g can be implemented via

g(s)(t) = α(s)∑r=0,1i=1,...,n

1{e(s)i,r≤0

} p(s)i,r K(t− e(s)i,r ; η)

+ γ(s)∑r=0,1i=1,...,n

1{e(s)i,r>0

} p(s)i,r K(t− e(s)i,r ; η), t ∈ R, (11)

in which the constants α(s) and γ(s) are chosen such that

∫ ∞−∞

g(s)(t) dt = 1 and

∫ 0

−∞g(s)(t) dt = τ.

Equivalently, constants α(s) and γ(s) satisfy the system of linear equations:

α(s)

∑r=0,1i=1,...,n

1{e(s)i,r≤0

} p(s)i,r + γ(s)∑

r=0,1i=1,...,n

1{e(s)i,r>0

} p(s)i,r = 1

α(s)∑

r=0,1i=1,...,n

1{e(s)i,r≤0

} p(s)i,r ν(s)i,r + γ(s)∑

r=0,1i=1,...,n

1{e(s)i,r>0

} p(s)i,r ν(s)i,r = τ

,

where ν(s)i,r =

∫ 0

∞K(t− e(s)i,r ; η

)dt for i = 1, . . . , n and r ∈ {0, 1}.

2. S-step: This step serves a similar role of the M-step in the EM algorithm which aims to obtain the best

plausible estimation given the current posterior distribution of missing data. Specifically, the estimate of

misrepresentation probability π is updated via

π(s+1) =

∑ni=1 1{z∗i =0} p

(s+1)i,1∑n

i=1 1{z∗i =0}. (12)

The estimate of quantile regression function fr, r ∈ {0, 1}, is obtained through

f (s+1)r = arg min

f∈FK

(1

n p(s+1)r

n∑i=1

p(s+1)i,r × ρτ

(yi − f(xi)

)+ λ ‖h‖2HK

), (13)

where p(s+1)r =

∑ni=1 p

(s+1)i,r /n. The optimization problem (13) can be viewed as a weighted variant of

the penalized RKHS regression considered in Equation (8). By solving the optimization problem (13), the

regression models are trained according to weighted empirical risk functions in which the posterior weights

11

pi,r, i = 1, . . . , n, r ∈ {0, 1}, reflect how likely misrepresentation may occur in each observation.

At the first sight, solving the optimization problem (13) seems rather difficult in its current form. But

thanks to the celebrating representer’s theorem (Kimeldorf and Wahba, 1971), the solution to Equation (13)

can be written as f(s+1)r (·) = b

(s+1)r + h

(s+1)r (·) = b

(s+1)r +

∑ni=1 θ

(s+1)i,r K(· ,xi) for r ∈ {0, 1}, such that

(b(s+1)r , θ(s+1)

r

)>= arg min

(b,θ)′∈Rn+1

(1

n p(s+1)r

p(s+1)r ρτ

(y − b1−Kθ

)+ λθ>Kθ

), (14)

where θ(s+1)r =

(θ(s+1)1,r , . . . , θ

(s+1)n,r

)>, p

(s+1)r =

(p(s+1)1,r , . . . , p

(s+1)n,r

), K ∈ Rn×n has matrix element ki,j =

K(xi,xj), y = (y1, . . . , yn)>, 1 = (1, . . . , 1)>, and by a slight abuse of notation, ρτ (u) =(ρτ (u1), . . . , ρτ (un)

)>means an element-wise application of function ρτ on vector u = (u1, . . . , un)> ∈ Rn. Now, we can apply either

the pseudo-data method (Nychka et al., 1995; Yuan, 2006) or general-purpose numerical optimizers to solve

the optimization in Equation (14). More detailed discussion about the pseudo-data method is relegated to

Appendix A in order to facilitate the reading herein.

It is noteworthy that the ES algorithm under consideration is not designed to optimize a specific objective

function, so we claim convergence of the ES algorithm if the relative changes in parameter estimates between two

consecutive steps fall below some user-specified thresholds.

Application of random sketch for big data analysis

The evolutionary success of actuarial and statistical methods is to a significant degree determined by considerations

of computational convenience. The proposed KQR-based misrepresentation model provides an appreciable improve-

ment over the Gaussian linear misrepresentation model in Akakpo et al. (2019) in terms of statistical flexibility and

robustness, however, at the expense of intense computation. Especially, if the pseudo-data method is used, then

solving the optimization problem (14) involves inverting a matrix which is of dimension n by n masses of times

(see, Equation (25) in Appendix A). In this current big data era, insurance records would contain over thousands of

millions of observations, which makes the implementation of the method prohibitive. In a similar vein, applications

of general-purpose numerical optimizers for solving the optimization problem (14) will suffer from the same curse

of dimensionality.

To overcome the computational issue, we will take advantage of the state-of-the-art statistical learning technique,

random sketch (Halko et al., 2011; Mahoney, 2011; Pilanci and Wainwright, 2016; Raskutti and Mahoney, 2016;

Yang et al., 2017, etc). The key idea of random sketch is to construct a small “sketch” of the whole data set via

random sampling or random projection techniques, and then use this sketch as a surrogate to perform computations

of interest instead of using the full data set. In the context of KQR, the random sketch approach can be applied

12

on the data kernel matrix K to perform dimensional reduction. To be more specific, we consider a low rank

approximation to the optimizer (14) in the S-step, limiting that θ(s+1)r = S> θ

(s+1)r for some randomly generated

projection matrix S ∈ Rm×n with m � n and θr = (θ1,r, . . . , θm,r) ∈ Rm. In other words, θ(s+1)r lies on the

m-dimensional subspace spanned by the rows of S. Under this imposed low-rank restriction, the optimization (14)

therefore reduces to

(b(s+1)r , θ(s+1)

r

)>= arg min

(b,θ)∈Rm+1

(1

n p(s+1)r

p(s+1)r ρτ (y − b1−KS>θ) + λθ>SKS>θ

). (15)

The computational expense of (15) is mostly on inverting the m by m sketch kernel matrix SKS, hence we reduces

the big data problem to a small-scale optimization problem.

An appropriate choice of random sketch matrix S (e.g., sub-Gaussian sketch, randomized orthogonal system

sketch and sub-sampling sketch) would be capable of preserving as much as possible data information, such that

the difference between the solutions of (14) and (15) is small. At the same time, application of random sketch

reduces the regression estimate in Equation (15) to an m-dimensional convex programming problem, which can be

solved much faster than that of Equation (14). As shown by Yang et al. (2017), the projection dimension m can be

set to be as small as m ∝ (log n)3/2 for RKHS-based regression. Thereby, the computation burden involved in the

proposed KQR misrepresentation is much relieved for massive data applications.

Hyper-parameter tuning

As hitherto, we have solved the estimation problem for a prefixed regularization parameter λ. As in any smoothing

problems, the choice of regularization parameter λ plays a critical role in the successful implementation of the KQR-

based misrepresentation model. For quantile regression, a common criterion for tuning λ is based on the robust

cross-validation score (i.e., leave-one-out cross-validation with pinball loss function (6) as the error measure; see Li

et al., 2007; Nychka et al., 1995; Yuan, 2006). To account for the mixture structure embedded in the KQR-based

misrepresentation model, we consider a conditional counterpart of the robust cross-validation score, termed the

conditional cross-validation (CCV) score, and it is formulated as

CCV(λ) =1

n p0

n∑i=1

pi,0 ρτ(yi − f [−i]0 (xi)

)+

1

n p1

n∑i=1


), (16)

for given pi,r ∈ [0, 1], i = 1, . . . , n, r ∈ {0, 1} and pr =∑ni=1 pi,r/n. Given the weights pr = (p1,r, . . . , pn,r) for

r ∈ {0, 1}, regression function f[−i]r is estimated in the same manner as fr except that the i-th sample is excluded.

The posterior weights pr can be computed in the final iteration of the proposed ES algorithm. The rationale

behind the CCV score is that a reasonable choice of λ should stimulate small prediction errors in the group to

13

which observations tend to belong (i.e., z = 0 or 1), and the ratios 1/npr, r ∈ {0, 1} are used to normalize the

corresponding quantities.

Despite its intuitive interpretation, evaluating the CCV score is onerous because for every candidate in a pre-

specified set of finite values for λ, we have to estimate n pairs of quantile regressions, i.e., f[−i]r , i = 1, . . . , n,

r ∈ {0, 1}. In order to reduce the aforementioned computational burden, we adopt the approximation idea proposed

by Yuan (2006) and extend it to suit the CCV score (16). Throughout the rest of this section, we will briefly describe

the approximation method for evaluating CCV while relegate the details to Appendix B, so that our major focus

on misrepresentation modeling can be maintained.

Our route for evaluating (16) hinges on the conditional leaving-one-out lemma (see, Lemma 2 in Appendix B),

which implies the approximation:

fr(xi)− f [−i]r (xi) ≈∂fr(xi)

∂yi

[yi − f [−i]r (xi)

], for i = 1, . . . , n and r ∈ {0, 1}.

Then approximating the CCV score (16) by first-order Taylor polynomial yields

1

n p0

n∑i=1


)+

1

n p1

n∑i=1


)≈ 1

n p0

n∑i=1

pi,0 ρτ(yi − f0(xi)

)1− ψi,r

+1

n p1

n∑i=1


)1− ψi,1

,

where ψi,r = ∂fr(xi)/∂yi for i = 1, . . . , n, r ∈ {0, 1}. The partial derivatives ψi,r can be computed via the diagonal

elements of the hat matrix associated with the weighted quantile smoothing (see, Equations (29) in Appendix B).

The expression above is readily much more computable than its original form (16). However, Yuan (2006) argued

that the first-order Taylor polynomial approximation may perform poorly when ∂fr(xi)/∂yi is around zero and

thus suggested to replace the individual partial derivatives by their weighted average ψr =∑ni=1 pi,r ψi,r / npr,

r ∈ {0, 1}. Evoking the idea described above leads us to the conditional approximate cross-validation (CACV)

score:

CACV(λ) =1

n p0(1− ψ0

) n∑i=1


)+

1

n p1(1− ψ1

) n∑i=1


). (17)

To identify the most reasonable λ within a pre-specified set of possible values, we search for the candidate value

that leads to the lowest CACV score.

14

Recommendations on misrepresentation validation

Algorithm 1 summarizes all the statistical considerations discussed thus far for estimating the KQR-based mis-

representation model. Let us remark that, upon convergence of the ES algorithm, we are able to find the pa-

rameter estimate Φ = (π, f0, f1) = (π(∞), f(∞)0 , f

(∞)1 ), the posterior probability vectors p

(∞)0 and p

(∞)1 , residuals

e(∞)0 =

(e(∞)1,0 , . . . , e

(∞)1,n

)and e

(∞)1 =

(e(∞)1,1 , . . . , e

(∞)1,n

), and residual distribution estimate g(∞). The estimate π can

be used to answer question Q1 posted in Introduction.

Data: n-dimensional vectors of response y and reported status z∗; a n× d characteristics matrix X; aninitial guess of the misrepresentation probability π(0); l ∈ N candidates of regularization parameterλ1, . . . , λl; a stopping criterion ε

Result: Estimated misrepresentation probability π and quantile regression fr, r ∈ {0, 1}1 Specify the kernel matrix K and generate the random sketch matrix S;2 for j ← 1 to l do3 Set the regularization parameter λ = λj ;

4 Initialize the regression estimates f(0)0 and f

(0)1 pretending that no misrepresentation occurs (i.e., set

zi = z∗i );5 Set s← 0;6 repeat

7 Calculate the posterior probabilities p(s+1)0 and p

(s+1)1 based on Equations (9) and (10);

8 Update the misrepresentation probability estimate π(s+1) using Equation (12);

9 Update the regression estimates f(s+1)0 and f

(s+1)1 using Equation (15);

10 s← s+ 1;

11 until |π(s+1) − π(s)| < ε;12 Compute the CACV(λj) according to Equation (17);13 Find λ∗ ∈ {λ1, . . . , λl} corresponding to the lowest CACV score;

14 end

15 return Misrepresentation model estimate(π(s+1), f

(s+1)0 , f

(s+1)1

)associated with λ∗

Algorithm 1: Algorithm for estimating the proposed KQR-based misrepresentation model.

In order to answer the resultant question Q2, we introduce a probability-based approach and a distance-based

approach to identify the v = a π∑ni=1 1{z∗i =0} most doubtful individuals for the validation purpose. We term

the magnitude of v, the validation size in misrepresentation investigation. The parameter a > 0 in the above

expression of v controls the validation size according to the risk attitude of the analyst, and it should be determined

subjectively. To identify potential misrepresenters, we look for observations who reported negative status on z but

behaved very differently from the others in the corresponding group. Accordingly, among the observations with

z∗ = 0, we use the v largest posterior probabilities p(∞)1 as the criterion to select the most doubtful individuals in

the probability-based approach, whereas in the distance-based approach, the v largest residuals e(∞)0 is adopted.

Note that the task of misrepresenter identification is akin to conducting a set of independent hypothesis tests on

every sample with z∗ = 0 as to whether or not misrepresentation occurs. For this reason, we appeal to the language

of statistical testing for classifying the two types of error that will be encountered in the misrepresenter identification

15

process. The type I error corresponds to the instance wherein an honest individual is falsely labeled as doubtful

misrepresenter, while the type II error occurs when a misrepresenter is undetected. For the prudence purpose,

an ideal misrepresenter identification method should minimize the type II error rate while keeping the validation

size and type I error rate at reasonable levels. Concerning the type II error, the aforementioned probability-based

and distance-based approaches each have their own drawbacks. For the probability-based approach, recall that

the calculation of posterior probability hinges on the estimated residual distribution g(∞) (see, Equation (9)).

The kernel density method used to obtain g(∞) may perform unsatisfactorily in the tail portions. Hence, when

the residuals e(∞)i,0 and e

(∞)i,1 are extraordinary and fall on the tails of residual distribution, relying on posterior

probabilities for assessing misrepresentation may not be reliable. On the other hand, the distance-based approach

is better at identifying the misrepresenters that are far away from the regression curves/surfaces, but may overlook

the doubtful individuals when the two regression curves/surfaces f(∞)0 and f

(∞)1 are close and so the magnitudes

of e(∞)i,0 and e

(∞)i,1 may be both small.

The following toy example which is adopted from Example 3 in the Simulation section, demonstrates the

respective drawbacks of the probability-based and distance-based approaches in an illuminating manner. Assume

that the underlying data generating process (DGP) follows yi = (1−zi)[20xi (2+xi)

−1]+zi[40xi (4+xi)

−1]+ ξi,

where xi’s are independent realizations of Uniform[1 , 10] and ξi’s are independently distributed student-t variables

with degree of freedom equals 5. Furthermore, we assume that 90% of the individuals report negative risk status on

z, among whom 15% misrepresent. We generate 1,000 samples and pretend that the true model as well as the true

label z are unknown. Evoking Algorithm 1 on the simulated data which contain {yi, xi, z∗i }, i = 1, . . . , 1, 000, yields

an estimate of misrepresentation probability π = 14.3%. In this toy example, we simply set the control parameter

a = 1 in determining the validation size. The estimated residual distribution and misrepresenters identification are

displayed in Figures 1 and 2, respectively.

As shown in Figure 1, the kernel density estimate g(∞) successfully recovers the overall sharp of residual dis-

tribution g except that there exist some minor yet still noticeable discrepancies in the tails of the distribution.

Such a deficiency inherent in the kernel density estimate is caused by the scarcity of creditable data falling on

the tail regions. Consequently, the associated posterior probabilities may mislead the misrepresenter identification

when the observations are far away from both the regression curves (e.g., see the left panel of Figure 2 for the

undetected misrepresenters which are above the regression curve of z = 1). The right panel of Figure 2 shows that

the distance-based approach does not suffer from the aforementioned drawback of the probability-based approach,

but relying on the v largest residuals may overlook the misrepresenters when the two regression curves are closed

(e.g., see the right panel of Figure 2 for the undetected misrepresenters around x = 1 and 2).

For a working resolution of the aforementioned problems, we suggest in the following section to consider the

union of doubtful individuals identified by either the probability-based approach or distance-based approach. This

16

−10 −5 0 5 10

0.0

0.1

0.2

0.3

t

g(t

)

Kernel density estimateTrue density function

−4 −2 0 2 4

−4

−2

02

46

8

Theoretical quantiles

Ke

rne

l−b

ase

d e

mp

iric

al q

ua

ntile

s

Figure 1: Density comparison and QQ-plot between g(∞) and g, when the KQR-based misrepresentation model isfitted against 1, 000 simulated data according to Example 3.

union method helps lower the type II error rate at the risk of slightly increasing the validation size and the type

I error rate. More sophisticated misrepresentation discovery methods which can better balance the type I and II

errors will be a very interesting direction for future research.

Simulation Study

Several simulation examples will be presented in this section to illustrate the effectiveness of the ES algorithm

for implementing the KQR-based misrepresentation model. We should consider DGP’s consisting of linear and

non-linear regression structures. We start with univariate cases in Examples 1 to 4, which will benefit us in being

able to visually inspect the proposed model’s performance and limitation, then we will consider a more general

multivariate case in Example 5. Although these examples do not replicate the real-life insurance data, they are

designed to showcase the usefulness of the KQR-based misrepresentation model even when there are non-linear and

interactive dependencies involved in the DGP.

The simulations in the subsequent examples are based on the probabilities of reported negative risk status

P(z∗ = 0) = 0.9 and misrepresentation P(z = 1|z∗ = 0) = 0.15. For the residual distribution, we assume the central

student’s t-distribution with dispersion parameter σ = 1 and degree of freedom equals 5, in order to emulate the

heavy-tailed phenomena that often exist in insurance data. The five examples are given as follows.

17

2 4 6 8 10

10

15

20

25

x

y

+

+

+

+

+

+

+

+

+

+

+

+

++++

+

+

+

++

+

+

+

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+ ++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+ +

+

+

+

++

+

++

++

++

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

Regression curve for z=0Regression curve for z=1

2 4 6 8 10

10

15

20

25

x

y

+

+

+

+

+

+

+

+

+

+

+

+

++++

+

+

+

++

+

+

+

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+ ++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+ +

+

+

+

++

+

++

++

++

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+


Figure 2: The scatter plots of 1, 000 simulated data according to Example 3, superimposed by the true (in solid curve)and fitted (in dashed curve) regression curves. The potential misrepresenters are identified respectively accordingto the probability-based approach (left panel) and distance-based approach (right panel). The signs “+”, “◦” and“∗” represent the observations having truly positive risk status, truly negative risk status and misrepresented riskstatus on z, respectively. Observations marked with the sign “�” are the honest individuals but falsely identifiedas doubtful misrepresenters (i.e., the type I error). Misrepresenters marked with the sign “4” are undetected (i.e.,the type II error), but successfully discovered otherwise.

Example 1. This is an univariate example with linear regression setting:

yi = (1− zi) (5 + xi) + zi (10 + xi) + ξi, i = 1, . . . , n,

where xi’s are independent samples of Uniform[1 , 10].

The setup in Example 1 complies with the misrepresentation model in Akakpo et al. (2019) except that a more

flexible student’s t-distribution is assumed on the residual errors. The next example is a slight extension of Example

1, in which the slope coefficients in the regression models between z = 0 and z = 1 are different, so the effect of z

on y is beyond a parallel shift.

Example 2. Consider

yi = (1− zi) (5 + xi) + zi (10 + 2xi) + ξi, i = 1, . . . , n,


Estimating the misrepresentation probability under the linear setting in Example 2 readily excesses the capacity

of the misrepresentation model considered in Akakpo et al. (2019). However, it is natural to expect that the

18

proposed KQR misrepresentation model can well handle this case. Next, we proceed forward to examples involving

non-linear dependence in the regression structures.

Example 3. Consider that the regression structure of the DGP admits a polynomial ratio form:

yi = (1− zi)[20xi (2 + xi)

−1]+ zi[40xi (4 + xi)

−1]+ ξi, i = 1, . . . , n,


Example 4. Consider that the regression structure of the DGP has a form of exponential-polynomial product:

yi = 5 + (1− zi)[4 exp(0.125xi)x

−1i

]+ zi

[8 exp(0.25xi)x

−1i

]+ ξi, i = 1, . . . , n,


Our final example considers the most general setup which contains multiple covariates and complex functional

dependence.

Example 5. Assume that the DGP possesses a regression structure in which the non-linear, interactive, and additive

dependencies are mixed as follows:

yi =

10x1i (2 + x1i)

−1 + 3 exp{

0.125 (x2i + 0.5x3i)}x−12i + ξi, if zi = 0

20x1i (4 + x1i)−1 + 6 exp

{0.25 (x2i + 0.5x3i)

}x−12i + ξi, if zi = 1

,

where xji’s (j = 1, 2, 3, i = 1, . . . , n) are independent samples of Uniform[1 , 10].

To fit the KQR-based misrepresentation model to the aforementioned examples, we deliberately choose to

focus on the median regression (i.e., τ = 0.5), which can be viewed as a robust alternative to the parametric

mean regression frequently used in the related literature. Recall that the Hilbert space induced by Gaussian kernel

possesses the universal approximation property, so we will use Gaussian kernel for catering the regression estimation.

The bandwidth parameter % > 0 in the Gaussian kernel is set to be the average of the 10% and 90% quantiles of

‖xi−xj‖2 for each i 6= j ∈ {1, . . . , n}, which is suggested by Takeuchi et al. (2006). Moreover, we will use Gaussian

kernel to implement the constrained weighted density estimation in Equation (11) with bandwidth η = 1.06n−1/5 φ,

where φ denotes the standard deviation of residual errors (Silverman, 1986). In the application of random sketch,

we choose to use the sub-Gaussian sketching and set the projection dimension to be m = d3 log(n)3/2e, which is

suggested by the theoretical analysis conducted in Yang et al. (2017).

19

Evoking the proposed ES algorithm, Figure 3 depicts the fitted regression curves for Examples 1 to 4 based

on a trial of simulation. Within Figure 3, we set the validation size control parameter a = 1 and use the union

of the probability-based and distance-based methods to identify potential misrepresenters. We can see that the

KQR misrepresentation model produces fairly satisfactory goodness of fit to the simulated data for all examples

reported, including the linear examples with parallel shift (Example 1) and non-parallel shift (Example 2), as well

as the non-linear examples (Examples 3 to 4). In particular, only a few type I and type II errors are observed,

indicating that most of the misrepresenters are successfully discovered by the suggested union method. Among

the misrepresents who can not be detected by the algorithm, their associated responses stand in the region where

the realizations between the two regression models overlap significantly. In such a circumstance, the mispresenters

somehow behave like honest individuals, therefore it is statistically difficult to identify them. The same argument

can be applied to explain the observations committing the type I error. Accordingly, we observe from Figure 3 that

more severe type I and II errors occur in Example 1 than in Example 2, because the parallel-shift construction of

Example 1 results in a larger overlapping area between the two regression models. For the same reason, within

each of the Examples 2, 3 and 4, there are less intensive type I and II errors occurred in the right tail portion

where the distance between the two regression curves are larger. For Example 5 which involves multiple predictor

variables, the model’s goodness of fit and misrepresenter identification can no longer be visually displayed. Instead,

the estimation results for Example 5 can be found in Figure 5 as well as Tables 1 and 2.

The results presented in Figure 3 are based on a single trial of simulation only. To account for the stochastic

variation in simulation, we now repeat the same experiments for each of Examples 1 - 5 for multiple times. Fur-

thermore, we shall vary the sample size, residual standard deviation, and validation size in order to understand

the associated impacts on the proposed model’s performance. Some additional notions for quantifying the model

performance are needed beforehand. Define

Type I error rate =The number of samples committing the type I error

The number of individuals having negative status on z :∑ni=1 1{zi=0}

, (18)

which calculates the percentage of individuals with truly negative status on z that are falsely labeled as misrepre-

senters. Meanwhile, define

Type II error rate =The number of samples committing the type II error

The true number of misrepresenters:∑ni=1 1{z∗i =0∩ zi=1}

, (19)

which measures the percentage of misrepresenters that are undetected. Moreover, we define

Validation rate =The validation size: v

The true number of misrepresenters:∑ni=1 1{z∗i =0∩ zi=1}

. (20)

20

2 4 6 8 10

51

01

52

0

x

y

+

+

+

+

+

+

+

+

+

+

++

+++

+

+

+

+

+

+

+++

+

+

+

+

++

+

+

+

+

++

+

+

++

+

+

++

++

+

++

++

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+ +

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+


2 4 6 8 10

51

01

52

02

5

x

y

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

++

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+


2 4 6 8 10

10

15

20

25

x

y

+

+

+

+

+

+

+

+

+

+

+

+

++++

+

+

+

++

+

+

+

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+ ++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+ +

+

+

+

++

+

++

++

++

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+


2 4 6 8 10

05

10

15

x

y

+

+

+

+

++

+

+

+

+

+

++

++

+

+

+

++

++

+

+

+

+ +

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

++

+

+

+

+

+

+

++ +

+

+

+

+ ++

+

+

+

+

++

+++

+

+

+

+

+

+

+

+

++ +

+

+

+

+

+

+

+

+++

+

+ +

+

+

+

+

+

+

+


Figure 3: The scatter plots of 1, 000 simulated data according to Example 1 (top left), Example 2 (top right),Example 3 (bottom left), Example 4 (bottom right), superimposed by the true (in solid curve) and fitted (in dashedcurve) regression curves, and the misrepresentation identification. The signs “+”, “◦”, “∗”, “�” and “4” aredefined in the same manner as Figure 2. The estimated misrepresentation probabilities are respectively 14.30%,15.21%, 16.16%, and 15.10%.

The rationale underlying the definition of validation rate is that, if the ratio is close to 100%, then the validation

size is approximately equal to the true number of misrepresenters. On the one hand, a 100% validation rate means

the misrepresenter identification procedure behaves as an oracle who has the knowledge of the unknown number,

say s, of misrepresenters and deliberately examines the top s suspicious subjects. On the other hand, it also means

the misrepresenter identification procedure can detect most of the misrepresentation cases if the type II error rate

is also low.

Now, we evaluate the accuracy of the misrepresentation probability estimation and misrepresenter detection

21

with different sample sizes n ∈ {1, 000, 3, 000, 5, 000, 7, 000}. For every fixed n, we run 150 independent trials of

simulations for all the five examples. The box plots for describing the distribution of π, the type I and type II

error rates of the five examples are displayed in Figures 4 to 5. For each box plot, the middle line is the median

of the distribution, the lower and the upper limits of the box are the first and the third quartile, respectively, the

upper (resp. lower) whisker extends up to 1.5 times the interquartile range from the top (resp. bottom) of the

box, and the points are the outliers beyond the whiskers. For all the examples, we observe that as the sample

size increases, the estimators’ distributions become more concentrated, leading to more accurate estimation results.

Moreover, the median of the estimated misrepresentation probability π is very close to the true value 15%. When

it comes to the performance on misrepresenter detection, both the type I and II error rates are quite low. The

median type I and type II error rates range respectively from about 0.2% and 3%, and from 0.9% to 9% across

Examples 1 - 5. Among these examples, Example 1 has the highest type I and II error rates. This is because the

two parallel regression curves involved in the construction of Example 1, resulting in a relatively large region in

which the realizations between the regression models for z = 0 and z = 1 overlap significantly. As was explained

earlier, observations falling into such overlapping areas are statistically difficult to be distinguished between honest

individuals and misrepresenters. The realizations from the two regression models in Example 2 (see, Figure 3) seem

to overlap the least, and thus it has the lowest error rates among the five examples.

After analyzing the impacts of sample size, we turn to study how data volatility may influence the proposed

model’s performance. It is natural for us to conjecture that more volatile data may harm the model’s performance

and increase the error rates, because realizations of the two regression models for z = 0 and z = 1 are more likely

to overlap. In order to verify the aforementioned conjecture, we fix n = 1, 000, vary the dispersion parameter in the

residual distribution among σ ∈ {0.75, 1, 1.25}, and examine the validation ratio as well as the type I and II error

rates by running 150 independent trials of simulations. The results summarized in Table 1 confirm the surmise:

we observe that both the type I and type II error rates increase with the dispersion parameter σ. Among the five

examples, it seems that larger data volatility influences Example 1 the most, with a 25% increase in σ doubling

the type II error rate. This is again due to the parallel-shift construction in Example 1, which makes the increase

in data volatility deteriorate the performance of the misrepresenter identification algorithm over the whole range

of the data. Nonetheless, among the other examples, the effect of increasing σ is focused on the left tail portion of

the DGP where the two regression curves are close (see, Figure 3). Although the accuracy of the misrepresenter

identification algorithm has declined because of more noisy data, the validation ratio seems less sensitive to the

changes in σ. More importantly, when the suggested union method is used, all the validation ratios are about 100%,

meaning that the validation size suggested by the algorithm is rather close to the true number of misrepresenters.

To ease the negative impacts of noisy data on misrepresenter identification, one working approach is to increase

the validation size. Table 2 collects the validation rate, the type I and type II error rates for a ∈ {1, 1.1, 1.2}

22

1,000 3,000 5,000 7,000

Sample Size

π

8%

10%

12%

14%

16%

18%

1,000 3,000 5,000 7,000

Sample Size

Type I E

rror

Rate

0%

0.5%

1%

1.5%

2%

2.5%

1,000 3,000 5,000 7,000

Sample Size

Type II E

rror

Rate

5%

10%

15%

20%

25%

1,000 3,000 5,000 7,000

Sample Size

π

12%

14%

16%

18%

1,000 3,000 5,000 7,000

Sample Size

Type I E

rror

Rate

0%

0.1%

0.2%

0.3%

0.4%

0.5%

0.6%

1,000 3,000 5,000 7,000

Sample Size

Type II E

rror

Rate

0%

1%

2%

3%

4%

1,000 3,000 5,000 7,000

Sample Size

π

12%

13%

14%

15%

16%

17%

18%

1,000 3,000 5,000 7,000

Sample Size

Type I E

rror

Rate

1%

1.5%

2%

2.5%

3%

1,000 3,000 5,000 7,000

Sample Size

Type II E

rror

Rate

2%

4%

6%

8%

10%

12%

1,000 3,000 5,000 7,000

Sample Size

π

12%

14%

16%

18%

1,000 3,000 5,000 7,000

Sample Size

Type I E

rror

Rate

0.5%

1%

1.5%

2%

2.5%

3%

3.5%

1,000 3,000 5,000 7,000

Sample Size

Type II E

rror

Rate

5%

10%

15%

20%

Figure 4: The box plots of the misrepresentation probability estimates π (left column), the type I error rate (middlecolumn), and the type II error rate (right column) for Examples 1-4 (from top to bottom, respectively). For thebox plots about π, the horizontal line marks the true misrepresentation probability.

23

1,000 3,000 5,000 7,000

Sample Size

π

12%

14%

16%

18%

1,000 3,000 5,000 7,000

Sample Size

Type I E

rror

Rate

0%

0.5%

1%

1.5%

1,000 3,000 5,000 7,000

Sample Size

Type II E

rror

Rate

0%

1%

2%

3%

4%

5%

6%

7%

Figure 5: The box plots of the misrepresentation probability estimate π (left column), the type I error rate (middlecolumn), and the type II error rate (right column) for Examples 5. For the box plot about π, the horizontal linemarks the true misrepresentation probability.

σ Validation Rate Type I Error Rate Type II Error Rate

Method Prob./Dist. Union Prob. Dist. Union Prob. Dist. Union

Example 10.75 98.73% 100.58% 0.52% 0.45% 0.56% 4.27% 3.85% 2.62%

1 94.86% 98.09% 1.10% 1.02% 1.18% 11.29% 10.84% 8.03%1.25 89.52% 93.73% 1.80% 1.72% 1.97% 20.55% 20.05% 17.27%

Example 20.75 98.37% 99.93% 0.03% 0.05% 0.06% 1.81% 1.92% 0.42%

1 97.53% 99.57% 0.13% 0.18% 0.24% 2.37% 2.66% 0.91%1.25 98.45% 101.11% 0.31% 0.41% 0.51% 3.32% 3.88% 1.78%

Example 30.75 98.36% 103.40% 0.55% 0.96% 1.27% 4.77% 7.06% 3.78%

1 99.17% 105.85% 0.99% 1.45% 1.93% 7.64% 10.22% 6.30%1.25 98.44% 106.17% 1.49% 1.98% 2.59% 10.07% 12.80% 8.57%

Example 40.75 98.50% 101.76% 0.69% 0.74% 0.92% 5.35% 5.62% 3.38%

1 98.30% 103.65% 1.41% 1.49% 1.81% 12.18% 12.66% 9.21%1.25 92.94% 100.49% 2.27% 2.35% 2.86% 20.00% 20.42% 15.78%

Example 50.75 96.20% 99.48% 0.11% 0.11% 0.18% 4.46% 4.45% 1.58%

1 98.59% 102.43% 0.33% 0.36% 0.52% 5.54% 5.71% 2.79%1.25 96.24% 101.10% 0.64% 0.73% 0.99% 7.44% 7.94% 4.56%

Table 1: The validation rate, type I and type II error rates for Examples 1 - 5. The reported figures are based on150 independent trials of simulations with n = 1, 000, a = 1 and σ ∈ {0.75, 1, 1.25}.

with σ = 1. The following observations could be made. First, increasing a will lead to a larger validation size

by construction. In particular, the validation rate under the union method has increased from around 100% when

a = 1, to between 117.93% and 135.07% when a = 1.2. Second, because of the larger validation size, the type I error

rate also increases with a, from below 2% when a = 1 to between 2.17% and 6.50% when a = 1.2. Finally, we observe

appreciable improvements over the type II error rate as the validation size parameter a increases. Specifically, when

we choose a = 1.2, the type II error rates among the five examples all drop below 4% (the one in Example 2 is

even as small as 0.07%). In summary, increasing the validation size helps lower the type II error rate but at the

24

expense of raising the type I error rate. Such a trade-off between the type I and type II errors is somewhat akin to

the study of hypothesis testing, underlying how accuracy and prudence are two sides of the same coin. In future

research, it will be interesting to investigate the optimal choice of validation size that best balances the type I and

type II errors.

a Validation Rate Type I Error Rate Type II Error Rate

Method Prob./Dist. Union Prob. Dist. Union Prob. Dist. Union

Example 11 94.86% 98.09% 1.10% 1.02% 1.18% 11.29% 10.84% 8.03%

1.1 105.55% 108.17% 1.97% 1.91% 2.16% 6.66% 6.31% 5.14%1.2 114.98% 117.93% 3.26% 3.21% 3.62% 4.45% 4.20% 3.54%

Example 21 97.53% 99.57% 0.13% 0.18% 0.24% 2.37% 2.66% 0.91%

1.1 110.36% 114.92% 1.49% 1.50% 2.27% 0.18% 0.25% 0.11%1.2 118.97% 129.38% 3.24% 3.24% 5.07% 0.11% 0.12% 0.07%

Example 31 99.17% 105.85% 0.99% 1.45% 1.93% 7.64% 10.22% 6.30%

1.1 107.85% 118.66% 2.16% 2.74% 3.97% 4.22% 7.50% 3.55%1.2 118.74% 135.07% 3.68% 4.23% 6.50% 2.89% 5.98% 2.42%

Example 41 98.30% 103.65% 1.41% 1.49% 1.81% 12.18% 12.66% 9.21%

1.1 105.15% 110.59% 2.30% 2.40% 2.97% 7.29% 7.90% 5.71%1.2 115.86% 122.84% 3.49% 3.63% 4.55% 4.56% 5.36% 3.66%

Example 51 98.59% 102.43% 0.33% 0.36% 0.52% 5.54% 5.71% 2.79%

1.1 105.53% 110.77% 1.33% 1.44% 2.14% 1.43% 2.05% 0.83%1.2 116.27% 126.26% 2.89% 2.97% 4.57% 0.69% 1.16% 0.39%

Table 2: The validation rate, type I and type II error rates for Examples 1 - 5. The reported figures are based on150 independent trials of simulations with n = 1, 000, σ = 1 and a ∈ {1, 1.1, 1.2}.

All in all, these simulation studies confirm that the proposed ES algorithm is able to generate valid inference

of misrepresentation, even when the true DGP is highly non-linear. Moreover, as illustrated in Tables 1 and 2, the

suggested union method shows an advantage over the probability-based method or the distance-based method in

that it lowers the type II error rate noticeably while only slightly increases the validation size and type I error.

Application to real data

In this section, we apply the KQR-based misrepresentation model to assess the misrepresentation risk in the Medical

Expenditure Panel Survey (MEPS) data. The Health Insurance Mandate of the Patient Protection and Affordable

Care Act features a tax penalty enacted in 2014 for individuals who did not have health insurance during the

year. Therefore, the earlier study in Akakpo et al. (2019) argued that there are financial incentives for the MEPS

respondents to misrepresent on the self-reported insurance status. To prepare ourselves for the succeeding discussion,

let us denote by y the logarithm of the annual claim amount of each observation, and z the unobserved, true insurance

status with z = 0 being insured, and z = 1 being uninsured. The observed, self-reported insurance status is denoted

25

by z∗. Hence, the probability π = P[z = 1|z∗ = 0] corresponds to the percentage of MEPS’s respondents who

misrepresented on the insurance status so as to avoid the potential tax penalty. Akakpo et al. (2019) employed the

Gaussian linear regression model with the insurance status being the only covariate to study the misrepresentation

issue. Specifically, they assumed that the conditional distribution of y follows the Gaussian distribution with mean

µ = β0+β1z, where (β0, β1) are regression coefficients. Based on the Gaussian linear regression, Akakpo et al. (2019)

found significant statistical evidences of misrepresentation on the self-reported insurance status in year 2014. The

probability of misrepresentation was estimated to be about 18% among the respondents who reported themselves

as insured.

Armed with the KQR-based misrepresentation model proposed in this current paper, we will investigate the

misrepresentation issue in the MEPS data under a non-parametric environment. Moreover, extra relevant covariates

are included in our analysis, which will aid in lowering the volatility of the residual errors in the misrepresentation

model. As illustrated in the simulation analyses, less volatile residuals will benefit for a more credible assessment of

misrepresentation. The additional covariates considered in our subsequent analysis include age, gender, and body

mass index (BMI, which is computed via dividing a person’s weight in kilograms over height by meters squared).

These variables are selected because they are respondents’ biographic characteristics which can be observed and

verified easily. Hence, it is reasonable to assume that these variables are correctly measured.

A concise description of the data used in our analysis is provided in what follows. We are interested in the

year 2014 MEPS data, and consider all adult respondents aged from 18 to 60 who have positive annual medical

expenditures, and valid entries in all covariates considered. As a result, our data set consists of about 12, 000 valid

samples. Table 3 outlines the definitions and descriptive statistics of the variables considered. According to the

summary statistics, we may conclude that a typical respondent in the data is a female, about 40 years old, and has the

BMI at 29. Notice that the mean values of the variables outlined in Table 3 are very similar between the insured

and uninsured groups except for the expenditure variable. Such a noticeable degree of dissimilarity on medical

expenditures between the insured and uninsured groups enables us to utilize the individual claim information for

identifying potential misrepresentation activities. As a side note, the BMI has been widely accepted as a measure of

obesity and one of the key variables in analyzing medical expenditure (Bhattacharya and Bundorf, 2009; Finkelstein

et al., 2003). A plethora of empirical literature has already documented that BMI possesses a non-linear relationship

with medical expenditure (Cawley et al., 2015; Laxy et al., 2017) and is dependent on age and gender (Nevill and

Metsios, 2015). Consequently, adopting flexible non-linear models, such as the proposed KQR method, is crucial in

our study of misrepresentation risk in the MEPS data.

To implement the KQR-based misrepresentation model in studying the MEPS data, we choose the same con-

figuration as that of the simulation studies. Specifically, we set τ = 0.5 corresponding to median regression, use

Gaussian kernel for handling the non-parametric learning of both regression functions and residual distribution, and

26

Insured (z∗ = 0) Uninsured (z∗ = 1)Variable Description Mean SD Mean SD Mean Diff.

Age The age of the respondent 39.99 12.45 38.96 11.63 2.65%Gender =1 if the policyholder is female, 0 if male 0.41 0.49 0.43 0.50 -5.38%

BMI The body mass index of the respondent 28.60 6.95 29.08 6.85 -1.67%Expenditure The logarithm of annual medical expenditure 7.20 1.70 6.03 1.66 17.48%

Number of obs 10529 1490Percentage 87.60% 12.39%

Table 3: Summary statistics for the subset of MEPS data being analyzed. In this table, “SD” is the shorthandnotation for standard deviation and “mean diff.” is the relative difference of the mean value between z∗ = 0 andz∗ = 1 of the variable.

choose m = d3 log(n)3/2e = 87 to be the projection dimension in the deployment of random sketch technique. The

kernel regression smoothing parameter λ is tuned by using the CACV index. Figure 6 displays the CACV index

in response to varying values of λ, and it suggest the “best” smoothing parameter value to be λ = exp(−8.68). It

is worth noting that, given the true label z, the regression function considered in the misrepresentation model of

Akakpo et al. (2019) is constant. Since constant functions are nested under the admissible function set FK with

zero RKHS norms, we argue that our estimated KQR will perform at least as good as the constant regression model

in Akakpo et al. (2019), so far at least the mean absolute error is concerned.

Evoking the estimation procedure described in Algorithm 1, we obtain an estimate of misrepresentation prob-

ability of 9.21%. Remark that our non-parametric estimate of misrepresentation probability is much lower than

the parametric estimate of 18% obtained in Akakpo et al. (2019). A potential explanation for the discrepancy is

that our proposed KQR method captures the subtle non-linear dependence structure inherent in the MEPS data,

while the Gaussian model utilized by Akakpo et al. (2019) overlooks the effect of covariates, which may possibly

overestimate the residuals and claim more misrepresenters.

What we have reported so far is a point estimate of misrepresentation. In the next stage of the analysis, we aim

to validate whether or not the estimate of misrepresentation is statistically significant. Formally, we are interested

in testing the null hypothesis, H0 : π = 0, against the alternative hypothesis, H1 : π > 0. To this end, we consider

a non-parametric bootstrap method to conduct the hypothesis testing. Its detailed steps are elaborated below:

1. Assume that the null hypothesis is true, i.e., no misrepresentation exists. In this case, we could fit the KQR

model to data by directly solving Equation (8) with zi = z∗i for all i = 1, . . . , n:

fnullr = arg minf∈FK

(1∑n

i=1 1{z∗i =r}

n∑i=1

1{z∗i =r} ρτ(yi − f(xi)

)+ λ ‖h‖2HK

), for r ∈ {0, 1}.

2. Compute the residual errors under the null model estimated from Step 1:

enulli = yi −[1{z∗i =0} f

null0 (xi) + 1{z∗i =1} f

null1 (xi)

], for i = 1, . . . , n.

27

−12 −11 −10 −9 −8 −7 −6 −5

1.2

85

1.2

90

1.2

95

1.3

00

1.3

05

1.3

10

Log (lambda)

CA

CV

Figure 6: Plot of the CACV index with varying values of log(λ) and the dash line marks the location of the minimumof the CACV curve.

3. Construct the bootstrap data {ybsi ,xi, z∗i }ni=1 with

ybsi =[1{z∗i =0} f

null0 (xi) + 1{z∗i =1} f

null1 (xi)

]+ ebsi , for i = 1, . . . , n,

where ebs = (ebs1 , . . . , ebsn ) denotes the bootstrap samples of enull = (enull1 , . . . , enulln ) with replacement.

4. Apply the ES algorithm on the bootstrap data obtained in Step 3 to estimate the misrepresentation probability.

5. Steps 3 and 4 are repeated for w ∈ N times to obtain πbs1 , . . . , π

bsw , where πbs

j is the estimated misrepresentation

probability based on the j-th bootstrap data, j = 1, . . . , w.

6. Approximate the p-value of test statistic by pbs = w−1∑wj=1 1{π < πbs

j }, where π is the estimated misrepre-

sentation probability based on the original data.

7. Reject the null hypothesis if pbs falls below a user-specified significance level.

The intuition of the above test is as follows. Under the null hypothesis, there is no misrepresentation, and

therefore the non-zero estimated misrepresentation probability π is caused by the randomness in the samples.

In this case, πbs1 , . . . , π

bsw establish the empirical distribution, to which the misrepresentation probability estimator

belongs. If the estimated misrepresentation probability based on the observed data is so different from the estimates

under the null hypothesis (i.e., falling on the right end of the estimator’s distribution), then the null hypothesis

should be rejected. Therefore, pbs is the probability that the null hypothesis will be falsely rejected. Such arguments

are standard in the non-parametric bootstrap hypothesis testing literature (Dwass, 1957; also see, Edgington, 2007

and Good, 2000 for comprehensive treatments of the topic).

28

In our analysis, we run w = 3, 000 sets of bootstrap data in order to ensure that the estimated p-value is credible.

We find none of the bootstrap data set yields an estimate of misrepresentation probability higher than 9.21%. So

the estimated p-value of the test statistic is 0%, and the null hypothesis H0 should be rejected at any practical

significance level. Collectively, the bootstrap hypothesis test supports the conclusion that there is a significant

level of misrepresentation in the MEPS data. This conclusion is consistent with the one in Akakpo et al. (2019),

despite that our estimate of misrepresentation probability is much lower. Last but not least, for the purpose of

misrepresentation validation, one can apply the methods described in “Recommendations on misrepresentation

validation” section to identify the doubtful individuals.

Conclusions

In this paper, we proposed a class of non-parametric misrepresentation models constructed by quantile regression

in reproducing kernel Hilbert space. The core mechanism of the proposed misrepresentation model is to employ

a mixture structure for catering the potential occurrences of misrepresentation. We proved the identifiability of

the proposed misrepresentation model. A novel algorithm was proposed for accommodating the model learning,

suitable for big data applications. Our extensive simulation study had shown that the proposed methodology is

capable of providing valid inference on misrepresentation, even when the underlying DGP is highly non-linear and

complex. We applied the proposed model on the Medical Expenditure Panel Survey data to study the potential

misrepresentation behaviors on the respondents’ self-reported insurance status. We found strong statistical evidence

that misrepresentation exists, which confirms the earlier study’s conclusion drawn based on the Gaussian linear

regression model (Akakpo et al., 2019).

There are a handful of directions for future research. For example, we could explore the possibility to incorporate

other statistical learning methods such as support vector machine, tree-based models, or neural network in studying

insurance misrepresentation. The mixture machinery used in this paper will serve as the building block for this

research direction, and the embedded KQR can be substituted by one of the methods outlined above. It will be

interesting to compare the performance among these methods in modeling misrepresentation. Another important

research direction is to augment the proposed misrepresentation model in order to handle the frequency component

in insurance claims data which also contains the statistical information about misrepresentation.

References

Akakpo, R., Xia, M., and Polansky, A. (2019). Frequentist inference in insurance ratemaking models adjusting for

misrepresentation. ASTIN Bulletin, 49(1):117–146.

29

Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society,

68(3):337–404.

Baione, F. and Biancalana, D. (2019). An individual risk model for premium calculation based on quantile: A com-

parison between generalized linear models and quantile regression. North American Actuarial Journal, 23(4):573–

590.

Bhattacharya, J. and Bundorf, M. K. (2009). The incidence of the healthcare costs of obesity. Journal of Health

Economics, 28(3):649–658.

Cawley, J., Meyerhoefer, C., Biener, A., Hammer, M., and Wintfeld, N. (2015). Savings in medical expenditures

associated with reductions in body mass index among US adults with obesity, by diabetes status. Pharmacoeco-

nomics, 33(7):707–722.

Dwass, M. (1957). Modified randomization tests for nonparametric hypotheses. The Annals of Mathematical

Statistics, 28(1):181–187.

Edgington, E. S. (2007). Randomization Tests. Chapman and Hall, New York.

Embrechts, P. and Hofert, M. (2013). A note on generalized inverses. Mathematical Methods of Operations Research,

77(3):423–432.

Finkelstein, E. A., Fiebelkorn, I. C., and Wang, G. (2003). National medical spending attributable to overweight

and obesity: how much, and who’s paying? Health Affairs, 22:219–226.

Gabaldon, I. M., Vazquez Hernandez, F. J., and Watt, R. (2014). The effect of contract type on insurance fraud.

Journal of Insurance Regulation, 33(8):197–230.

Good, P. (2000). Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer,

New York.

Gu, C. (2013). Smoothing Spline ANOVA Models. Springer, New York.

Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms

for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288.

Hamilton, J. M. (2009). Misrepresentation in the Life, Health, and Disability Insurance Application Process: A

National Survey. Minnesota State Bar Association, Minneapolis.

Kimeldorf, G. and Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical

Analysis and Applications, 33(1):82–95.

30

Koenker, R. (2005). Quantile Regression. Cambridge University Press, Cambridge.

Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica, 46(1):33–50.

Kudryavtsev, A. A. (2009). Using quantile regression for rate-making. Insurance: Mathematics and Economics,

45(2):296–304.

Laxy, M., Stark, R., Peters, A., Hauner, H., Holle, R., and Teuner, C. M. (2017). The non-linear relationship

between bmi and health care costs and the resulting cost fraction attributable to obesity. International Journal

of Environmental Research and Public Health, 14:1–6.

Li, Y., Liu, Y., and Zhu, J. (2007). Quantile regression in reproducing kernel Hilbert spaces. Journal of the

American Statistical Association, 102(477):255–268.

Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Foundations and Trends in Machine

Learning, 3(2):123–224.

Micchelli, C. A., Xu, Y., and Zhang, H. (2006). Universal kernels. Journal of Machine Learning Research, 7:2651–

2667.

Nevill, A. M. and Metsios, G. S. (2015). The need to redefine age- and gender-specific overweight and obese body

mass index cutoff points. Nutrition and Diabetes, 5:1–2.

Nychka, D., Gray, G., Haaland, P., Martin, D., and O’Connell, M. (1995). A nonparametric regression approach to

syringe grading for quality improvement. Journal of the American Statistical Association, 90(432):1171–1178.

Perez-Marın, A. M., Guillen, M., Alcaniz, M., and Bermudez, L. (2019). Quantile regression with telematics

information to assess the risk of driving above the posted speed limit. Risks, 7(3):80.

Pilanci, M. and Wainwright, M. J. (2016). Iterative Hessian sketch: Fast and accurate solution approximation for

constrained least-squares. The Journal of Machine Learning Research, 17(1):1842–1879.

Raskutti, G. and Mahoney, M. W. (2016). A statistical perspective on randomized sketching for ordinary least-

squares. The Journal of Machine Learning Research, 17(1):7508–7538.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.

Takeuchi, I., Le, Q. V., Sears, T. D., and Smola, A. J. (2006). Nonparametric quantile estimation. Journal of

Machine Learning Research, 7:1231–1264.

Wahba, G. (1990). Spline Models for Observational Data. Society for Industrial and Applied Mathematics, Philadel-

phia.

31

Wu, Q. and Yao, W. (2016). Mixtures of quantile regressions. Computational Statistics and Data Analysis, 93:162–

176.

Xia, M. (2014). Risk segmentation using Bayesian quantile regression with natural cubic splines. Austin Statistics,

1(1):1–7.

Xia, M. (2018). Bayesian adjustment for insurance misrepresentation in heavy-tailed loss regression. Risks, 6(3):83.

Xia, M. and Gustafson, P. (2016). Bayesian regression models adjusting for unidirectional covariate misclassification.

Canadian Journal of Statistics, 44(2):198–218.

Xia, M. and Gustafson, P. (2018). Bayesian inference for unidirectional misclassification of a binary response trait.

Statistics in Medicine, 37(6):933–947.

Xia, M., Hua, L., and Vadnais, G. (2018). Embedded predictive analysis of misrepresentation risk in GLM ratemak-

ing models. Variance, 12(1):39–58.

Yang, Y., Pilanci, M., and Wainwright, M. J. (2017). Randomized sketches for kernels: Fast and optimal nonpara-

metric regression. The Annals of Statistics, 45(3):991–1023.

Yuan, M. (2006). GACV for quantile smoothing splines. Computational Statistics and Data Analysis, 50(3):813–829.

Appendix A Pseudo-data method for weighted quantile smoothing

We place our focus on the optimization problem in Equation (15) which is more general than that of Equation (14).

If we set S to be an identity matrix of dimension n, then the two problems are identical. For brevity, let us also

suppress the indexes “r” and “s” in Equation (15) and simply write

(b, θ)> = arg min(b,θ)∈Rm+1

(1

n ppρτ (y − b1−KS>θ) + λθ>SKS>θ

), (21)

for a given p = (p1, . . . , pn), p =∑ni=1 pi/n. The major difficulty in solving the optimization problem above is caused

by the fact that the pinball loss function (7) is not differentiable at zero. To get around the non-differentiability

issue, Nychka et al. (1995) suggested to approximate the pinball loss by a differentiable function:

ρτ,δ(t) =

ρτ (t), if |t| ≥ δ;

τ t2/δ, if 0 ≤ t ≤ δ;

(1− τ) t2/δ, if −δ ≤ t < 0.

(22)

32

Here, δ > 0 is an approximation threshold parameter. Denote by

(ζ, ϑ

)>= arg min

(ζ,ϑ)∈Rm+1

(1

n ppρτ,δ(y − ζ 1−KS>ϑ) + λϑ>SKS>ϑ

)(23)

the approximate solution of (21) obtained by substituting the pinball loss function (21) with the differentiable

counterpart ρτ,δ. Application of first order condition to the objective function in Equation (23) yields the solution

must satisfy

n∑i=1

wi × (yi − ζ − kiS>ϑ) = 0

and

n∑i=1

wi × (yi − ζ − ki S>ϑ) (−ki S>)j + λ (SK S> ϑ)j = 0, ∀j ∈ {1, . . . ,m},

where ki is the i-th row of K, wi =[2n p (yi − ζ − kiS>ϑ)

]−1pi ρ′τ,δ(yi − ζ − kiS>ϑ) for i = 1, . . . , n. A careful

inspection of the above linear equations system reveals that the (approximated) weighted quantile smoothing in

Equation (23) can be viewed analogously to the traditional non-parametric smoothing based on the expected square

loss:

(y − ζ −KS>ϑ)>W (y − ζ −KS>ϑ)− λϑ>SKS>ϑ, (24)

where W = diag(w), w = (w1, . . . , wn), and the operator diag : Rn → Rn×n maps an n-tuple to the corresponding

diagonal matrix.

Following the proposal in Nychka et al. (1995), we can compute the solution to Equation (23) in an iterative

manner. To be specific, given the current estimate(ζ(s), ϑ(s)

)>, in the (s + 1)-th iteration, we solve the weighted

smoothing quantile regression (23) based on the weights

w(s)i =

[2n p (yi − ζ(s) − kiS>ϑ(s))

]−1pi ρ′τ,δ(yi − ζ(s) − kiS>ϑ(s)), for i = 1, . . . , n.

Let W (s) = diag(w(s)), w(s) = (w(s)1 , . . . , w

(s)n ) and define

A = SKBKS> + Tr(W (s))(SKW (s)KS> + λSKS>

), B = W (s) 11>W (s).

33

In light of the equivalence between (23) and (24), the updated estimate can be computed explicitly via

ϑ(s+1) = D(s) y, D(s) = A−1[

Tr(W (s))SKW (s) − SKB]

(25)

and

ζ(s+1) = c(s) y, c(s) =1

Tr(W (s+1))

[1>W (s) − 1>W KS>D(s)

]. (26)

The iteration will be repeated until some convergence criterion is met, upon which we obtain the solution to

Equation (23). By choosing a small enough approximation threshold δ, the estimate at convergence(ζ(∞), ϑ(∞)

)>provides a good approximation of (b, θ)> in Equation (21).

Appendix B Conditional approximate cross-validation

The approximation method for evaluating the CCV score (16) is elaborated here. Recall that the differentiable

function ρτ,δ(·) defined in Equation (22) provides an approximation of the pinball loss function ρτ (·) if δ > 0 is

chosen to be small. First, we substitute ρτ (·) with ρτ,δ(·) such that the following approximation based on Taylor

series holds:

n∑i=1

pi,r ρτ(yi − f [−i]r (xi)

)≈

n∑i=1

pi,r ρτ(yi − fr(xi)

)+

n∑i=1

pi,r ρ′τ,δ

(yi − fr(xi)

) [fr(xi)− f [−i]r (xi)

], r = 0, 1. (27)

Now, we need a feasible way for computing fr(xi)− f [−i]r (xi). The following lemma is of auxiliary importance, and

its proof is motivated significantly by Lemma 3.1 in Yuan (2006).

Lemma 2. Given the regularization parameter λ > 0 and weights pi,r ∈ [0, 1], i = 1, . . . , n, r = 0, 1, define

f[i]r (·) = b

[i]r + h

[i]r (·) in the same manner as fr(·) = br + hr(·) except that the i-th observation yi is replaced by

f[−i]r (xi). Then we have f

[i]r = f

[−i]r for i = 1, . . . , n and r = 0, 1.

Proof. For r = 0, 1 and any fixed i = 1, . . . , n, the following string of relationships hold by definitions of f[i]r and

f[−i]r :

1

n pr

[n∑

j=1,j 6=i

pj,r ρτ(yj − f [i]r (xj)

)+ pi,r ρτ

(f [−i]r (xi)− f [i]r (xj)

)]+ λ ‖h[i]r ‖2HK

≥ 1

n pr

n∑j=1,j 6=i


)+ λ ‖h[i]r ‖2HK

34

≥ 1

n pr

n∑j=1,j 6=i

pj,r ρτ(yj − f [−i]r (xj)

)+ λ ‖h[−i]r ‖2HK

=1

n pr

[n∑

j=1,j 6=i


)+ pi,r ρτ

(f [−i]r (xi)− f [−i]r (xj)

)]+ λ ‖h[−i]r ‖2HK

≥ 1

n pr

[n∑

j=1,j 6=i


)+ pi,r ρτ

(f [−i]r (xi)− f [i]r (xj)

)]+ λ ‖h[i]r ‖2HK .

Then, we can conclude that

1

n pr

[n∑

j=1,j 6=i


)+ pi,r ρτ

(f [−i]r (xi)− f [i]r (xj)

)]+ λ ‖h[i]r ‖2HK

=1

n pr

[n∑

j=1,j 6=i


)+ pi,r ρτ

(f [−i]r (xi)− f [−i]r (xj)

)]+ λ ‖h[−i]r ‖2HK .

Given the weights parameters pi,r, i = 1, . . . , n and r = 0, 1, it is straightforward to check that the optimization

problems underlying f[i]r and f

[−i]r are convex. Thereby, we must have f

[i]r = f

[−i]r . This completes the proof.

Lemma 2 yields, for i = 1, . . . , n and r = 0, 1,

fr(xi)− f [−i]r (xi) = fr(xi)− f [i]r (xi) =[∂fr(xi)

∂yi

(yi − f [−i]r (xi)

)](1 + o(1)

). (28)

Recall the notation ψi,r = ∂fr(xi)/∂yi, i = 1, . . . , n and r = 0, 1. Then adopt the same argument as in Equations

(3.5)-(3.6) in Yuan (2006), we get

ρ′τ,δ(yi − fr(xi)

) [fr(xi)− f [−i]r (xi)

]≈ ρτ

(yi − fr(xi)

) ψi,r

1− ψi,r,

and so (27) can be approximated as

n∑i=1

pi,r ρτ(yi − f [−i]r (xi)

)≈

n∑i=1

pi,r ρτ(yi − fr(xi)

) 1

1− ψi,r.

It remains to compute the partial derivatives terms ψi,r. To this end, let

Ar = 1 c(∞)r +KS>D(∞)

r , r ∈ {0, 1}, (29)

where c(∞)r and D

(∞)r are the matrices defined in (26) and (25) upon convergence of the pseudo-data algorithm

for estimating fr. In fact, Ar is the hat matrix associated with the weighted quantile smoothing such that yr =

35

Ar y, where yr =(fr(x1), . . . , fr(xn)

)>. Thereby, we know ψi,r = aii, which is the i-th diagonal element of A,

i = 1, . . . , n.

Finally, note that when ∂fr(xi)/∂yi is close to zero, the second-order term of Taylor series which has been

omitted in (28), may dominate the first order term, causing the approximation behaves poorly. Follow the resolution

proposed by Yuan (2006) to apply the averaging trick on the individual partial derivative terms, the conditional

approximate cross-validation (CACV) score in (17) is now obtained and ready to implement.

36

robust estimates of insurance misrepresentation through ...jianxi/kqr_misrepresentation.pdf ·...

Documents