on foundations of parameter estimation for generalized partial linear models with b–splines and...

On Foundations of Parameter Estimation for

Generalized Partial Linear Models

with B–Splines and Continuous Optimization

Gerhard-Wilhelm WEBER

Institute of Applied Mathematics, METU, Ankara,Turkey

Faculty of Economics, Business and Law, University of Siegen, Germany

Center for Research on Optimization and Control, University of Aveiro, Portugal

Universiti Teknologi Malaysia, Skudai, Malaysia

Pakize TAYLAN

Department of Mathematics, Dicle University, Diyarbakır, Turkey

Lian LIU

Roche Pharma Development Center in Asia Pacific, Shangai China

5th International Summer School

Achievements and Applications of Contemporary Informatics,

Mathematics and Physics

National University of Technology of the Ukraine

Kiev, Ukraine, August 3-15, 2010

http://www.dicle.edu.tr/ekart/dicle3.php?folder_ad=logo&resim_no=5

http://www.ntu-kpi.kiev.ua/

http://ssa.ntu-kpi.kiev.ua/

• Introduction

• Estimation for Generalized Linear Models

• Generalized Partial Linear Model (GPLM)

• Newton-Raphson and Scoring Methods

• Penalized Maximum Likelihood

• Penalized Iteratively Reweighted Least Squares (P-IRLS)

• An Alternative Solution for (P-IRLS) with CQP

• Solution Methods

• Linear Model + MARS, and Robust CMARS

• Conclusion

Outline

The class of Generalized Linear Models (GLMs) has gained popularity as a statistical modeling tool.

This popularity is due to:

• The flexibility of GLM in addressing a variety of statistical problems,

• The availability of software (Stata, SAS, S-PLUS, R) )to fit the models.

The class of GLM is an extension of traditional linear models allows:

The mean of a dependent variable to depend on a linear predictor by a nonlinear link function......

The probability distribution of the response, to be any member of an exponential family of distributions.

Many widely used statistical models belong to GLM:

o linear models with normal errors,

o logistic and probit models for binary data,

o log-linear models for multinomial data.

Introduction

Many other useful statistical models such as with

• Poisson, binomial,

• Gamma or normal distributions,

can be formulated as GLM by the selection of an appropriate link function

and response probability distribution.

A GLM looks as follows:

• : expected value of the response variable ,

• : smooth monotonic link function,

• : observed value of explanatory variable for the i-th case,

• : vector of unknown parameters.

( ) ; T

i i iH x

( ) i i

E Yi

YH

ix

Introduction

Introduction

• Assumptions: are independent and can have any distribution from exponential family density

• are arbitrary “scale” parameters, and is called a natural parameter .

• General expressions for mean and variance of dependent variable :

, ,i i i

a b c

iY

( 1,2,...,

~ ( , , )

( )exp ( , ) ),

( )

ii Y i i

i i i i

i i

i

i n

Y f y

y bc y

a

i

iY

'

"

( ) ( ),

( ) ( ) ,

( ) ( ) , ( ) : / .

i i i i

i i

i i i i i i

E Y b

Var Y V

V b a

Estimation for GLM

• Estimation and inference for GLM is based on the theory of

• Maximum Likelihood Estimation

• Least –Squares approach:

• The dependence of the right-hand side on is solely through the dependence of the on .

• Score equations:

• Solution for score equations given by Fisher scoring procedure based on the aNewton-Raphson algorithm.

1

( ) : ( ( ) ( , )).

n

i i i i i i ii

l y b c y

-1

1

0

0,

, 1 ( 1,2,..., ; 0,1,..., ) .

n

iij i i i

i i

i i ij i j i

x V y

x x i = n j = m

i

Generalized Partial Linear Models (GPLMs)

• Particular semiparametric models are the Generalized Partial Linear Models (GPLMs) :

They extend the GLMs in that the usual parametric terms are augmented by a single nonparametric component:

• is a vector of parameters, and

is a smooth function,

which we try to estimate by splines.

• Assumption: m-dimensional random vector which represents (typically discrete) covariates,

q-dimensional random vector of continuous covariates,

which comes from a decomposition of explanatory variables.

Other interpretations of : role of the environment,

expert opinions,

Wiener processes, etc..

, ; TE Y X T G X T

T

m

TX

Newton-Raphson and Scoring Methods

The Newton-Raphson algorithm is based on a quadratic Taylor series approximation.

• An important statistical application of the Newton-Raphson algorithm is given by

maximum likelihood estimation:

• : log likelihood function of is based on the observed data

• Next, determine the new iterate from

• The Fisher’s scoring method replaces C by the expectation E(C).

0 : startingvalue;

1

0

0

0 0

20 2

( , )

( , )( , ) ( )

( , )( ) ;

( , )

a

T

l

ll

l

l

y

y

y

y

y

log( ), ) ( ,l L yy1 2( , ,..., ) .T

ny y yy =

( , ) 0 :al y

211 0: ( , ) , : ( , ) .,: Tl l r C= y = yC r

Penalized Maximum Likelihood

• Penalized Maximum Likelihood criterion for GPLM:

• log likelihood of the linear predictor and the second term penalizes the integrated squared

curvature of over the given interval

• smoothing parameter controlling the trade-off between

accuracy of the data fitting and its smoothness (stability, robustness or regularity).

• Maximization of given by B-splines through the local scoring algorithm.

For this, we write a k degree B-spline with knots at the values for

where are coefficients, and are k degree B-spline basis functions.

12

2( ( ))( , ) : ( , ) . b

a

tl dtj y

:l t , .a b

:

( , )j ( 1, 2, ..., )i i nt : t

,

1

( ),

v

j j k

j

t B t

j ,j kB


• Zero and k degree B-splines bases are defined by

• We write and define an matrix by

then,

• Further, define a matrix by

Β

1

1

, 1 1, 1

1 1

,0

, 1

(1,

)

(

0, ,

( ) ( )) .

j j

j j k

j k j k

j k j j

k

k

j

j

j

k

t t t

t t t tB t

B

B tt t t t

t

B t

otherwise

1: ( ),..., ( ) T

nt tt vn : ( );ij j iB B t

1 2= . vt Β

vv

: ( ) ( )

b

kl k l

a

K B t B t dt.


• Then, the criterion can be written as

• If we insert the least-squares estimation , we get:

where

• Now, we will find and to solve the optimization problem of maximizing .

• Let

12

( , ) ( , ) j l y

:= ( ) ( ) . Τ ΤΤΒ Β Β Β Β

1 1 22( , ) ;) :( : g gg gH tX X t

( , )j

1= (ˆ )T TB B B t

12

( , ) ( , ) , j l y

( , )j

• To maximize with respect to , we solve the following system of equations:

which we treat by Newton-Raphson method.

• These system equations are nonlinear in and .

We linearize them around a current guess by

1 1

2

2 2

( , ) ( , )0,

( , ) ( , )0,

g g

gg g

T

T

j l

j l

y

y

2( , ) ( , ) ( , )

0.

l ll

y y y

1 2g gand( , )j

2g



• We use this equation in the system equations :

where is a Newton-Raphson step and are evaluated at

• More simple form:

which can be resolved for

2( , ) ( , ); := , := ,

1 0

1 1

01 022 2

l lrC C g g y y

r gC C + gr C

g

1

1 11

1

2

; := , := ( ,) )

(A* B

B BIh S

C C Cgh C r C + M C

S Sg

. , ,0 0 1 1

1 2 1 2g g g g C rand

1

2

11

1

1

1

1

2 ( )

B

.h gX

S h g

g

g


• can be found explicitly without iteration (inner loop backfitting):

• Here, represents the regression matrix for the input data ,

computes a weighted B-spline smoothing on the variable ,

with weights given by

and is the adjusted dependent variable.

ˆ ˆ and

1

1

2

ˆ

{ ( )

ˆ

,

ˆ ˆ ( )

} ( )

ˆ .

B

T TB BX C I S X X C

g X

X

g S h

I S

X

h

2 ( , ) lC y

ix

BS it

h

X

• From the updated the outer loop must be iterated to update and, hence, and

then, the loop is repeated until sufficient convergence is achieved.

Step size optimization is performed by

and we turn to maximize

• Standard results on the Newton-Raphson procedure ensure local convergence.

• Asymptotic properties of the

where is the weighted additive fit operator:

If we replace by their asymptotic versions

then, we get the covariance matrix for

ˆ ˆ , h

( )( ).j

, Bh R Cand

0( 1) (1 ) ,

BR

1ˆ ˆ ˆ( ),

,

B

BR C r

R h

;C

ˆˆ = , ) , lr y

00 0 ,, Bh R Candˆ


and

• Here, has mean and variance , and is the matrix

that produces from based on B-splines.

• Furthermore, is asymptotically distributed as

1

0

0 0

T

B BR C R

jBR0h h

1

0

1

Cov( )ˆ

0 0

T

B B

T

B B

R C R

R C R

1( 1,2).Cov( )ˆ

s s

Ts B B sg R C R

1 1

0 C C

'' '' : asymptotically

ˆ

jg h


Penalized Iteratively Reweighted Least Squares (P-IRLS)

The penalized likelihood is maximized by the penalized iteratively reweighted least squares

to find the estimate of the linear predictor , which is given by

where is the iteratively adjusted dependent variable, given by

here, represents the derivative of with respect to , and

is a diagonal weight matrix with entries

where is proportional to the variance of according to the current estimate

H

[ +1]p

1 [ ]

[ ]

[ ]

ˆ ˆ ,

( ),

T

i

p

i

p

i

p

i

X T

H

[ ] [ ] [ ] [ ]: ( )( ); p p p p

i i i i ih H y

2

[ ] [ ]( ) *p p

C hB

1p th

H

[ ]ph

[ ]pC

[ ] [ ] [ ] 2: 1 ( ) ( ) ,p p p

ii i iC V H

[ ]( ) p

iV iY [ ].p

i


• If we use in , which we rewrite as

• With Green and Yandell (1985), we suppose that is of rank

Two matrices can be formed such that

where have rows and with full column ranks respectively.

Then, rewriting as

with vectors of dimensions , respectively.

Then, becomes

,

.z v

, 0 0,= = =T T T

J KJ I T KT J Tand

-z v zand

2[ ] [ ]( ) . p p C h X Β

J T and

= γ t Βλ *B

v

2

[ ] [ ]( , )

p pC h X ΒT ΒJ

* JC

, -z v zand J T and

*B

K


• Using Householder decomposition, the minimization can be split

by separating the solution with respect to from the one on

where is orthogonal and is nonsingular, upper triangular and of full rank

Then, we get the bilevel minimization problem of

(upper level)

with respect to , given based on minimizing

(lower level).

2

[ ] [ ] [ ]

2 2

*T k k T klower Q C h Q CE BJ

2

[ ] [ ] [ ]

1 1

*

T k k T kupper Q C h R Q C BJE

.

[ ] [

1 2

], , , , 0* p pT T Q R C X ΒT C X ΒTQD

1 2,Q = Q Q . m v z

R


• The term can be set to 0.

• If we put

becomes the problem of minimizing

which is a ridge regression problem. The solution is

The other parameters can be found as

• Now, we can compute using and, finally,

.

IV V V H

1 [ ]

2 ( ).

T kR Q C H BJ

2,H V

* upperE

*E

*C

[ +1] p X Β

[ ] [ ] [ ]

2 2, , T k k T k

Q C h Q C BJV

* lowerE

An Alternative Solution for (P-IRLS) with CQP

• Both penalized maximum likelihood and P-IRLS methods contain a smoothing parameter This parameter can be estimated by

o Generalized Cross Validation (GCV),

o Minimization of an UnBiased Risk Estimator (UBRE).

• Different Method to solve P-IRLS by Conic Quadratic Programming (CQP).

Use Cholesky decomposition matrix in such that

Then, becomes

• The regression problem can be reinterpreted as

2

2

( )

(

:

:)

0

G

g M

M

W v

U

min ( ),

where ( ) 0.

G

g

[ ]

[ ] [ ]

( , )

( , )

T T T

p

p p

W

v

C X B

C h

2 2

. W v U

.

*Bv v K .TK U U

*F

*F

*H

*B

• Then, our optimization problem is equivalent to

Here, W and U are and matrices,

are vectors, respectively.

• This means:

,min ,

,

.

tt

t

M

W v

U

where

( ) n m v vand

v v ( ) m v nand

*H

,

2 2

2

min ,

, 0,

;

tt

t t

M

W v

U

where

*I


• Conic Quadratic Programming (CQP) problem:

our problem is from CQP with

• We first reformulate as a Primal Problem: *I

1 1 1

1 2 ( 1) 2 2 1 2

(1, ) , ( , ) ( , , ) , ( , ), , (1,0,...,0) ,

0, ( , ), , , ; 2.

0 0

0 0 0

T T T T T T T T

m v m v n

v m+ v m v

t t

q q M k

c x D W d v p

D U d p

( 1, 2,..., )

min ,

;

T

T

i i i i i k

xc x

D x d p x qwhere

,

1 1

min ,

: ,1 0

: ,0

, ,

0

0

00 0

0 0

t

n

T

m v

vv v m

T T

m v

n v

t

t

t

M

L L

W v

U

such that


with ice-cream (or second order or Lorentz) cones:

• The corresponding Dual Problem is

1 2 2 21 1 1

11 2: ( ,..., ) | ... .

T ll l l

l x x x x x xL Rx

1

1

1 2

2

1

2

1

max ( ,0) ,

11,

, .

0

0 00

0 000

0

T T

v

T

vT

n

m v mTm

n v

vTm v

v

L L

M

v

WU

such that


0

Solution Methods

• Polynomial time algorithms requested.

– Usually, only local information on the objective and the constraints given.

– This algorithm cannot utilize a priori knowledge of the problem’s structure.

– CQPs belong to the well-structured convex problems.

• Interior Point algorithms:

– We use the structure of problem.

– Yield better complexity bounds.

– Exhibit much better practical performance.

Outlook

Important new class of GPLs:

, ,

( , ) GPLM ( ) ( )LM MARS

= +

TE Y X T G X T

X T X T

e.g.,

2 2* * * * X L

x

y

+( , )=[ ( )]c x x ( , )=[ ( )]-c x x

CMARS

Outlook

Robust CMARS:

RCMARS

semi-length of confidence interval

.. ..outlier outlier

confidence interval

. ......

( ) jT

... . .. ... .. .... .. . . . ..

References

[1] Aster, A., Borchers, B., and Thurber, C., Parameter Estimation and Inverse Problems, Academic

Press, 2004.

[2] Craven, P., and Wahba, G., Smoothing noisy data with spline functions, Numer. Math. 31, Linear

Models, (1979), 377-403.

[3] De Boor, C., Practical Guide to Splines, Springer Verlag, 2001.

[4] Dongarra, J.J., Bunch, J.R., Moler, C.B., and Stewart, G.W., Linpack User’s Guide, Philadelphia,

SIAM, 1979.

[5] Friedman, J.H., Multivariate adaptive regression splines, (1991), The Annals of Statistics

19, 1, 1-141.

[6] Green, P.J., and Yandell, B.S., Semi-Parametric Generalized Linear Models, Lecture Notes in

Statistics, 32 (1985).

[7] Hastie, T.J., and Tibshirani, R.J., Generalized Additive Models, New York, Chapman and Hall,

1990.

[8] Kincaid, D., and Cheney, W., Numerical Analysis: Mathematics of Scientific computing, Pacific

Grove, 2002.

[9] Müller, M., Estimation and testing in generalized partial linear models – A comparive study,

Statistics and Computing 11 (2001) 299-309, 2001.

[10] Nelder, J.A., and Wedderburn, R.W.M., Generalized linear models, Journal of the Royal Statistical

Society A, 145, (1972) 470-484.

[11] Nemirovski, A., Lectures on modern convex optimization, Israel Institute of Technology

http://iew3.technion.ac.il/Labs/Opt/opt/LN/Final.pdf.

http://iew3.technion.ac.il/Labs/Opt/opt/LN/Final.pdf

[12] Nesterov, Y.E , and Nemirovskii, A.S., Interior Point Methods in Convex Programming,

SIAM, 1993.

[13] Ortega, J.M., and Rheinboldt, W.C., Iterative Solution of Nonlinear Equations in Several

Variables, Academic Press, New York, 1970.

[14] Renegar, J., Mathematical View of Interior Point Methods in Convex Programming, SIAM,

2000.

[15] Sheid, F., Numerical Analysis, McGraw-Hill Book Company, New-York, 1968.

[16] Taylan, P., Weber, G.-W., and Beck, A., New approaches to regression by generalized

additive and continuous optimization for modern applications in finance, science and

technology, Optimization 56, 5-6 (2007), pp. 1-24.

[17] Taylan, P., Weber, G.-W., and Liu, L., On foundations of parameter estimation for

generalized partial linear models with B-splines and continuous optimization, in the

proceedings of PCO 2010, 3rd Global Conference on Power Control and Optimization,

February 2-4, 2010, Gold Coast, Queensland, Australia.

[18] Weber, G.-W., Akteke-Öztürk, B., İşcanoğlu, A., Özöğür, S., and Taylan, P., Data Mining:

Clustering, Classification and Regression, four lectures given at the Graduate Summer

School on New Advances in Statistics, Middle East Technical University, Ankara, Turkey,

August 11-24, 2007 (http://www.statsummer.com/).

[19] Wood, S.N., Generalized Additive Models, An Introduction with R, New York, Chapman

and Hall, 2006.

References

http://www.statsummer.com/

Thank you very much for your attention!

http://www3.iam.metu.edu.tr/iam/images/7/73/Willi-CV.pdf

on foundations of parameter estimation for generalized partial linear models with b–splines and...

Education