on foundations of parameter estimation for generalized partial linear models with b–splines and...
DESCRIPTION
The presentation by Gerhard Wilhelm Weber, Pakize Taylan and Lian Liu.TRANSCRIPT
On Foundations of Parameter Estimation for
Generalized Partial Linear Models
with B–Splines and Continuous Optimization
Gerhard-Wilhelm WEBER
Institute of Applied Mathematics, METU, Ankara,Turkey
Faculty of Economics, Business and Law, University of Siegen, Germany
Center for Research on Optimization and Control, University of Aveiro, Portugal
Universiti Teknologi Malaysia, Skudai, Malaysia
Pakize TAYLAN
Department of Mathematics, Dicle University, Diyarbakır, Turkey
Lian LIU
Roche Pharma Development Center in Asia Pacific, Shangai China
5th International Summer School
Achievements and Applications of Contemporary Informatics,
Mathematics and Physics
National University of Technology of the Ukraine
Kiev, Ukraine, August 3-15, 2010
• Introduction
• Estimation for Generalized Linear Models
• Generalized Partial Linear Model (GPLM)
• Newton-Raphson and Scoring Methods
• Penalized Maximum Likelihood
• Penalized Iteratively Reweighted Least Squares (P-IRLS)
• An Alternative Solution for (P-IRLS) with CQP
• Solution Methods
• Linear Model + MARS, and Robust CMARS
• Conclusion
Outline
The class of Generalized Linear Models (GLMs) has gained popularity as a statistical modeling tool.
This popularity is due to:
• The flexibility of GLM in addressing a variety of statistical problems,
• The availability of software (Stata, SAS, S-PLUS, R) )to fit the models.
The class of GLM is an extension of traditional linear models allows:
The mean of a dependent variable to depend on a linear predictor by a nonlinear link function......
The probability distribution of the response, to be any member of an exponential family of distributions.
Many widely used statistical models belong to GLM:
o linear models with normal errors,
o logistic and probit models for binary data,
o log-linear models for multinomial data.
Introduction
Many other useful statistical models such as with
• Poisson, binomial,
• Gamma or normal distributions,
can be formulated as GLM by the selection of an appropriate link function
and response probability distribution.
A GLM looks as follows:
• : expected value of the response variable ,
• : smooth monotonic link function,
• : observed value of explanatory variable for the i-th case,
• : vector of unknown parameters.
( ) ; T
i i iH x
( ) i i
E Yi
YH
ix
Introduction
Introduction
• Assumptions: are independent and can have any distribution from exponential family density
• are arbitrary “scale” parameters, and is called a natural parameter .
• General expressions for mean and variance of dependent variable :
, ,i i i
a b c
iY
( 1,2,...,
~ ( , , )
( )exp ( , ) ),
( )
ii Y i i
i i i i
i i
i
i n
Y f y
y bc y
a
i
iY
'
"
( ) ( ),
( ) ( ) ,
( ) ( ) , ( ) : / .
i i i i
i i
i i i i i i
E Y b
Var Y V
V b a
Estimation for GLM
• Estimation and inference for GLM is based on the theory of
• Maximum Likelihood Estimation
• Least –Squares approach:
• The dependence of the right-hand side on is solely through the dependence of the on .
• Score equations:
• Solution for score equations given by Fisher scoring procedure based on the aNewton-Raphson algorithm.
1
( ) : ( ( ) ( , )).
n
i i i i i i ii
l y b c y
-1
1
0
0,
, 1 ( 1,2,..., ; 0,1,..., ) .
n
iij i i i
i i
i i ij i j i
x V y
x x i = n j = m
i
Generalized Partial Linear Models (GPLMs)
• Particular semiparametric models are the Generalized Partial Linear Models (GPLMs) :
They extend the GLMs in that the usual parametric terms are augmented by a single nonparametric component:
• is a vector of parameters, and
is a smooth function,
which we try to estimate by splines.
• Assumption: m-dimensional random vector which represents (typically discrete) covariates,
q-dimensional random vector of continuous covariates,
which comes from a decomposition of explanatory variables.
Other interpretations of : role of the environment,
expert opinions,
Wiener processes, etc..
, ; TE Y X T G X T
T
m
TX
Newton-Raphson and Scoring Methods
The Newton-Raphson algorithm is based on a quadratic Taylor series approximation.
• An important statistical application of the Newton-Raphson algorithm is given by
maximum likelihood estimation:
• : log likelihood function of is based on the observed data
• Next, determine the new iterate from
• The Fisher’s scoring method replaces C by the expectation E(C).
0 : startingvalue;
1
0
0
0 0
20 2
( , )
( , )( , ) ( )
( , )( ) ;
( , )
a
T
l
ll
l
l
y
y
y
y
y
log( ), ) ( ,l L yy1 2( , ,..., ) .T
ny y yy =
( , ) 0 :al y
211 0: ( , ) , : ( , ) .,: Tl l r C= y = yC r
Penalized Maximum Likelihood
• Penalized Maximum Likelihood criterion for GPLM:
• log likelihood of the linear predictor and the second term penalizes the integrated squared
curvature of over the given interval
• smoothing parameter controlling the trade-off between
accuracy of the data fitting and its smoothness (stability, robustness or regularity).
• Maximization of given by B-splines through the local scoring algorithm.
For this, we write a k degree B-spline with knots at the values for
where are coefficients, and are k degree B-spline basis functions.
12
2( ( ))( , ) : ( , ) . b
a
tl dtj y
:l t , .a b
:
( , )j ( 1, 2, ..., )i i nt : t
,
1
( ),
v
j j k
j
t B t
j ,j kB
Penalized Maximum Likelihood
• Zero and k degree B-splines bases are defined by
• We write and define an matrix by
then,
• Further, define a matrix by
Β
1
1
, 1 1, 1
1 1
,0
, 1
(1,
)
(
0, ,
( ) ( )) .
j j
j j k
j k j k
j k j j
k
k
j
j
j
k
t t t
t t t tB t
B
B tt t t t
t
B t
otherwise
1: ( ),..., ( ) T
nt tt vn : ( );ij j iB B t
1 2= . vt Β
vv
: ( ) ( )
b
kl k l
a
K B t B t dt.
Penalized Maximum Likelihood
• Then, the criterion can be written as
• If we insert the least-squares estimation , we get:
where
• Now, we will find and to solve the optimization problem of maximizing .
• Let
12
( , ) ( , ) j l y
:= ( ) ( ) . Τ ΤΤΒ Β Β Β Β
1 1 22( , ) ;) :( : g gg gH tX X t
( , )j
1= (ˆ )T TB B B t
12
( , ) ( , ) , j l y
( , )j
• To maximize with respect to , we solve the following system of equations:
which we treat by Newton-Raphson method.
• These system equations are nonlinear in and .
We linearize them around a current guess by
1 1
2
2 2
( , ) ( , )0,
( , ) ( , )0,
g g
gg g
T
T
j l
j l
y
y
2( , ) ( , ) ( , )
0.
l ll
y y y
1 2g gand( , )j
2g
Penalized Maximum Likelihood
Penalized Maximum Likelihood
• We use this equation in the system equations :
where is a Newton-Raphson step and are evaluated at
• More simple form:
which can be resolved for
2( , ) ( , ); := , := ,
1 0
1 1
01 022 2
l lrC C g g y y
r gC C + gr C
g
1
1 11
1
2
; := , := ( ,) )
(A* B
B BIh S
C C Cgh C r C + M C
S Sg
. , ,0 0 1 1
1 2 1 2g g g g C rand
1
2
11
1
1
1
1
2 ( )
B
.h gX
S h g
g
g
Penalized Maximum Likelihood
• can be found explicitly without iteration (inner loop backfitting):
• Here, represents the regression matrix for the input data ,
computes a weighted B-spline smoothing on the variable ,
with weights given by
and is the adjusted dependent variable.
ˆ ˆ and
1
1
2
ˆ
{ ( )
ˆ
,
ˆ ˆ ( )
} ( )
ˆ .
B
T TB BX C I S X X C
g X
X
g S h
I S
X
h
2 ( , ) lC y
ix
BS it
h
X
• From the updated the outer loop must be iterated to update and, hence, and
then, the loop is repeated until sufficient convergence is achieved.
Step size optimization is performed by
and we turn to maximize
• Standard results on the Newton-Raphson procedure ensure local convergence.
• Asymptotic properties of the
where is the weighted additive fit operator:
If we replace by their asymptotic versions
then, we get the covariance matrix for
ˆ ˆ , h
( )( ).j
, Bh R Cand
0( 1) (1 ) ,
BR
1ˆ ˆ ˆ( ),
,
B
BR C r
R h
;C
ˆˆ = , ) , lr y
00 0 ,, Bh R Candˆ
Penalized Maximum Likelihood
and
• Here, has mean and variance , and is the matrix
that produces from based on B-splines.
• Furthermore, is asymptotically distributed as
1
0
0 0
T
B BR C R
jBR0h h
1
0
1
Cov( )ˆ
0 0
T
B B
T
B B
R C R
R C R
1( 1,2).Cov( )ˆ
s s
Ts B B sg R C R
1 1
0 C C
'' '' : asymptotically
ˆ
jg h
Penalized Maximum Likelihood
Penalized Iteratively Reweighted Least Squares (P-IRLS)
The penalized likelihood is maximized by the penalized iteratively reweighted least squares
to find the estimate of the linear predictor , which is given by
where is the iteratively adjusted dependent variable, given by
here, represents the derivative of with respect to , and
is a diagonal weight matrix with entries
where is proportional to the variance of according to the current estimate
H
[ +1]p
1 [ ]
[ ]
[ ]
ˆ ˆ ,
( ),
T
i
p
i
p
i
p
i
X T
H
[ ] [ ] [ ] [ ]: ( )( ); p p p p
i i i i ih H y
2
[ ] [ ]( ) *p p
C hB
1p th
H
[ ]ph
[ ]pC
[ ] [ ] [ ] 2: 1 ( ) ( ) ,p p p
ii i iC V H
[ ]( ) p
iV iY [ ].p
i
Penalized Iteratively Reweighted Least Squares (P-IRLS)
• If we use in , which we rewrite as
• With Green and Yandell (1985), we suppose that is of rank
Two matrices can be formed such that
where have rows and with full column ranks respectively.
Then, rewriting as
with vectors of dimensions , respectively.
Then, becomes
,
.z v
, 0 0,= = =T T T
J KJ I T KT J Tand
-z v zand
2[ ] [ ]( ) . p p C h X Β
J T and
= γ t Βλ *B
v
2
[ ] [ ]( , )
p pC h X ΒT ΒJ
* JC
, -z v zand J T and
*B
K
Penalized Iteratively Reweighted Least Squares (P-IRLS)
• Using Householder decomposition, the minimization can be split
by separating the solution with respect to from the one on
where is orthogonal and is nonsingular, upper triangular and of full rank
Then, we get the bilevel minimization problem of
(upper level)
with respect to , given based on minimizing
(lower level).
2
[ ] [ ] [ ]
2 2
*T k k T klower Q C h Q CE BJ
2
[ ] [ ] [ ]
1 1
*
T k k T kupper Q C h R Q C BJE
.
[ ] [
1 2
], , , , 0* p pT T Q R C X ΒT C X ΒTQD
1 2,Q = Q Q . m v z
R
Penalized Iteratively Reweighted Least Squares (P-IRLS)
• The term can be set to 0.
• If we put
becomes the problem of minimizing
which is a ridge regression problem. The solution is
The other parameters can be found as
• Now, we can compute using and, finally,
.
IV V V H
1 [ ]
2 ( ).
T kR Q C H BJ
2,H V
* upperE
*E
*C
[ +1] p X Β
[ ] [ ] [ ]
2 2, , T k k T k
Q C h Q C BJV
* lowerE
An Alternative Solution for (P-IRLS) with CQP
• Both penalized maximum likelihood and P-IRLS methods contain a smoothing parameter This parameter can be estimated by
o Generalized Cross Validation (GCV),
o Minimization of an UnBiased Risk Estimator (UBRE).
• Different Method to solve P-IRLS by Conic Quadratic Programming (CQP).
Use Cholesky decomposition matrix in such that
Then, becomes
• The regression problem can be reinterpreted as
2
2
( )
(
:
:)
0
G
g M
M
W v
U
min ( ),
where ( ) 0.
G
g
[ ]
[ ] [ ]
( , )
( , )
T T T
p
p p
W
v
C X B
C h
2 2
. W v U
.
*Bv v K .TK U U
*F
*F
*H
*B
• Then, our optimization problem is equivalent to
Here, W and U are and matrices,
are vectors, respectively.
• This means:
,min ,
,
.
tt
t
M
W v
U
where
( ) n m v vand
v v ( ) m v nand
*H
,
2 2
2
min ,
, 0,
;
tt
t t
M
W v
U
where
*I
An Alternative Solution for (P-IRLS) with CQP
• Conic Quadratic Programming (CQP) problem:
our problem is from CQP with
• We first reformulate as a Primal Problem: *I
1 1 1
1 2 ( 1) 2 2 1 2
(1, ) , ( , ) ( , , ) , ( , ), , (1,0,...,0) ,
0, ( , ), , , ; 2.
0 0
0 0 0
T T T T T T T T
m v m v n
v m+ v m v
t t
q q M k
c x D W d v p
D U d p
( 1, 2,..., )
min ,
;
T
T
i i i i i k
xc x
D x d p x qwhere
,
1 1
min ,
: ,1 0
: ,0
, ,
0
0
00 0
0 0
t
n
T
m v
vv v m
T T
m v
n v
t
t
t
M
L L
W v
U
such that
An Alternative Solution for (P-IRLS) with CQP
with ice-cream (or second order or Lorentz) cones:
• The corresponding Dual Problem is
1 2 2 21 1 1
11 2: ( ,..., ) | ... .
T ll l l
l x x x x x xL Rx
1
1
1 2
2
1
2
1
max ( ,0) ,
11,
, .
0
0 00
0 000
0
T T
v
T
vT
n
m v mTm
n v
vTm v
v
L L
M
v
WU
such that
An Alternative Solution for (P-IRLS) with CQP
0
Solution Methods
• Polynomial time algorithms requested.
– Usually, only local information on the objective and the constraints given.
– This algorithm cannot utilize a priori knowledge of the problem’s structure.
– CQPs belong to the well-structured convex problems.
• Interior Point algorithms:
– We use the structure of problem.
– Yield better complexity bounds.
– Exhibit much better practical performance.
Outlook
Important new class of GPLs:
, ,
( , ) GPLM ( ) ( )LM MARS
= +
TE Y X T G X T
X T X T
e.g.,
2 2* * * * X L
x
y
+( , )=[ ( )]c x x ( , )=[ ( )]-c x x
CMARS
Outlook
Robust CMARS:
RCMARS
semi-length of confidence interval
.. ..outlier outlier
confidence interval
. ......
( ) jT
... . .. ... .. .... .. . . . ..
References
[1] Aster, A., Borchers, B., and Thurber, C., Parameter Estimation and Inverse Problems, Academic
Press, 2004.
[2] Craven, P., and Wahba, G., Smoothing noisy data with spline functions, Numer. Math. 31, Linear
Models, (1979), 377-403.
[3] De Boor, C., Practical Guide to Splines, Springer Verlag, 2001.
[4] Dongarra, J.J., Bunch, J.R., Moler, C.B., and Stewart, G.W., Linpack User’s Guide, Philadelphia,
SIAM, 1979.
[5] Friedman, J.H., Multivariate adaptive regression splines, (1991), The Annals of Statistics
19, 1, 1-141.
[6] Green, P.J., and Yandell, B.S., Semi-Parametric Generalized Linear Models, Lecture Notes in
Statistics, 32 (1985).
[7] Hastie, T.J., and Tibshirani, R.J., Generalized Additive Models, New York, Chapman and Hall,
1990.
[8] Kincaid, D., and Cheney, W., Numerical Analysis: Mathematics of Scientific computing, Pacific
Grove, 2002.
[9] Müller, M., Estimation and testing in generalized partial linear models – A comparive study,
Statistics and Computing 11 (2001) 299-309, 2001.
[10] Nelder, J.A., and Wedderburn, R.W.M., Generalized linear models, Journal of the Royal Statistical
Society A, 145, (1972) 470-484.
[11] Nemirovski, A., Lectures on modern convex optimization, Israel Institute of Technology
http://iew3.technion.ac.il/Labs/Opt/opt/LN/Final.pdf.
[12] Nesterov, Y.E , and Nemirovskii, A.S., Interior Point Methods in Convex Programming,
SIAM, 1993.
[13] Ortega, J.M., and Rheinboldt, W.C., Iterative Solution of Nonlinear Equations in Several
Variables, Academic Press, New York, 1970.
[14] Renegar, J., Mathematical View of Interior Point Methods in Convex Programming, SIAM,
2000.
[15] Sheid, F., Numerical Analysis, McGraw-Hill Book Company, New-York, 1968.
[16] Taylan, P., Weber, G.-W., and Beck, A., New approaches to regression by generalized
additive and continuous optimization for modern applications in finance, science and
technology, Optimization 56, 5-6 (2007), pp. 1-24.
[17] Taylan, P., Weber, G.-W., and Liu, L., On foundations of parameter estimation for
generalized partial linear models with B-splines and continuous optimization, in the
proceedings of PCO 2010, 3rd Global Conference on Power Control and Optimization,
February 2-4, 2010, Gold Coast, Queensland, Australia.
[18] Weber, G.-W., Akteke-Öztürk, B., İşcanoğlu, A., Özöğür, S., and Taylan, P., Data Mining:
Clustering, Classification and Regression, four lectures given at the Graduate Summer
School on New Advances in Statistics, Middle East Technical University, Ankara, Turkey,
August 11-24, 2007 (http://www.statsummer.com/).
[19] Wood, S.N., Generalized Additive Models, An Introduction with R, New York, Chapman
and Hall, 2006.
References
Thank you very much for your attention!
http://www3.iam.metu.edu.tr/iam/images/7/73/Willi-CV.pdf