non-linear models - further statistical methods

Non-Linear Models - Further Statistical Methods

Sarah FilippiUniversity of Oxford

Hilary Term 2016

Information

Webpage: http://www.stats.ox.ac.uk/~filippi/

Lectures:

• Week 5: Wednesday at 11

• Week 6: Wednesday at 11, Thursday at 12

• Week 7: Thursday at 12

Practical:

• Week 7 – Friday

• Not assessed

http://www.stats.ox.ac.uk/~filippi/

References

A thorough treatment of non-linear regression is given in:

• Bates and Watts (1988). Nonlinear Regression Analysis andIts Applications

• Seber and Wild (1989). Nonlinear Regression

• Gallant (1987). Nonlinear Statistical Models

• Ratkowsky (1983). Nonlinear Regression Modeling: A UnifiedPractical Approach

• Ross (1990). Nonlinear Estimation

• Venables and Ripley (2002). Modern Applied Statistics with S

In the preparation of this course, I have also used

• Hastie et al (2009). The elements of statistical learning.

• Clark (2013). Generalized additive models

Linear models

There are several specific aspects to linear models as used inregression:

(a) There is a linear predictor which is a linear function of thepredictors x1, . . . , xp, and perhaps one of these predictors isconstant. So conventionally we write

η = β0 + β1x1 + · · · + βpxp

Note that this is linear in the parameters: the predictors mightbe non-linear functions (such as polynomials or splines) of theindependent variables.

(b) The linear predictor models the mean of the distribution of thedependent variable.

(c) Fitting is by least squares, which is maximum likelihoodestimation (for β) for iid Normal errors.

You have seen linear predictors used in other contexts:

• logistic regression and log-linear models,

• accelerated-life and proportional hazard survival models,

• hierarchical models.

These differ in points (b) and (c): the linear predictor models someother aspect of the distribution (the log-mean, probabilities on alogistic scale . . . ), and fitting is by maximum likelihood orsomething similar (maximum partial likelihood, REML).

A lot of the special properties of linear models come from thecombination of a linear predictor and fitting by least-squares.

The log-likelihood is concave (indeed quadratic) in the parametersβ. This has several consequences:

• There is a unique local maximum to the likelihood, which isthe global maximum.

• The least-squares/maximum likelihood solution can be foundanalytically.

• The asymptotic theory for MLEs is exact, since thelog-likelihood is exactly quadratic.

Note that if we need to estimate other parameters (e.g. the errorvariance σ2) the inference task may not be quite as simple.

This section of the course is about what may happen if we relaxsome of these assumptions:

• fitting by something other than least squares

• or predicting by a non-linear function of the parameters.

In general, numerical optimization methods have to be used toestimate the model parameters. These have some commoncharacteristics:

• They find a local maximum/minimum of the fit criterion.There may well be multiple local optima.

• They need a starting point, and that often needs to be a setof parameters for which the model fits quite well, to ensurethat an appropriate local optimum is found.

• The only distribution theory for the parameter estimates isasymptotic, but simulation-based methods can be used.

• They usually need complete identifiability.

Outline

1 Non-linear least square

2 Basis expansion and regularisation

3 Kernel smoothing methods

4 Additive and general additive linear models

5 Tree methods

1. Non-linear least-square

Subject-specific theory may suggest that the response variesnon-linearly with the independent variables. Perhaps the simplestexample is the exponential growth or decay with time t:

E[Y ] = a exp±b t

One could argue that this has a linear predictor (log(a)± b t), butthat does not predict the mean (it might predict the log-mean of alognormal distribution).

Note that the parameter b has an interpretation as a rate ofgrowth/decay, and it is typical of specific non-linear models thatthey are specified with some interpretable parameters.

Non linear regression

The general form of a nonlinear regression model is

y = E(Y |X = x) + ε = m(x, θ) + ε

where

• m(x, θ) is called the kernel mean function

• an independent error term ε which is centred with anunknown variance σ2 to be estimated.

Parameter estimation for non linear regression

Given a set of observation {xi, yi}Ni=1, the parameter θ can beestimated by minimising the residual sum of squares

S(θ) =

N∑i=1

(yi −m(xi, θ))2

• Contrary to the linear least-square problem, there is nogeneral formula of the solution.

• Use an iterative procedure starting with an initial guess for theparameters.

• The R function nls enables to estimate the parameters ofsuch models: it requires the model definition, the dataset andinitial values for all the model parameters.

Example 1: Weight lossData frame wtloss in package MASS gives the weight, inkilograms, of an obese patient at 52 time points over an 8 monthperiod of a weight rehabilitation programme.

●●

●●●

●

●

●●

●●

●

●●●●

●●

●

●● ●

●●

●●

● ●●●

●●●●

●●

●●●●

● ●●

● ●

●● ● ●

●●

●

0 50 100 150 200 250

110

120

130

140

150

160

170

180

Days

Wei

ght (

kg)

An exponential decay to a non-zero asymptote looks a suitablemodel

weight = b0 + b12−Days/th + ε

and the curve was fitted by

nls(Weight ~ b0 + b1*2^(-Days/th),

data = wtloss, start = list(b0 = 90, b1 = 95, th = 120))

The starting values were chosen based on their interpretation:

• the initial weight is b0 + b1,

• the target weight is b0 and

• the ‘half-life’ is th.

The model summary is

Formula: Weight ~ b0 + b1 * 2^(-Days/th)

Parameters:

Estimate Std. Error t value Pr(>|t|)

b0 81.374 2.269 35.86 <2e-16

b1 102.684 2.083 49.30 <2e-16

th 141.911 5.295 26.80 <2e-16

Residual standard error: 0.8949 on 49 degrees of freedom

Number of iterations to convergence: 3

Achieved convergence tolerance: 4.39e-06

Self-starting non-linear regressionFor some functions including exponential decays, starting values fornon-linear regressions can be calculated by an automatic procedure.

Example: a logistic growth model

y =θ1

1 + exp[−(θ2 + θ3x)]+ ε

A procedure to obtain initial parameter values from a set ofobservations {xi, yi}Ni=1 is:

1 Rewrite the model

log

[y/θ1

1− y/θ1

]' θ2 + θ3x

2 Since θ1 is the asymptote, a starting value for θ1 is somevalue larger than any value in the data

3 Starting values for θ2 and θ3 can be estimated by fitting alinear regression using the functions lm and logit

Non-linear models are often associated with families of solutions tosimple differential equations: a prominent field of application is thepharmaceutical industry, to PK (pharmacokinetics) and PD(pharmacodynamics) models.

The R package stats contains several specific non-linear modelsfor which self-starting (SS) procedures are supplied:

• functions SSasymp, SSasympOff and SSasympOrig fit variousforms of exponential decay

• SSbiexp fits the sum of two exponentials

• SSweibull fits a Weibull curve of the form a− b exp(c xρ)(exponential growth/decay in the ρth power of theexplanatory variable)

We can use an SS model for the Weight loss example by

nls(Weight ~ SSasymp(Days, target, start, lograte),

data = wtloss)

Confidence interval and hypothesis testingExact confidence interval procedures or exact hypothesis testingare generally not available for non-linear models.

Asymptotically, the estimated p-dimensional vector of parametersθ̂ ∼ N (θ, V (θ)/n) where n is the number of observations, and

V (θ) = σ2(F (θ)TF (θ)

)and F (θ) is the n× p matrix such that

F (θ)ij =∂ m(θ, xi)

∂ θj.

In practice, V (θ) is estimated by V (θ̂) and usingσ̂2 = S(θ̂)/(n− p).

The square root of the diagonal of V (θ̂) are the standard errorreturned by the summary function and the t-distribution (withn− p degrees of freedom) can be used to construct confidenceintervals or test for significance of the parameters.

More precise confidence intervals

To test a null hypothesis θ = θ∗ for the whole parameter vector oralso θj = θ∗j for an individual component, we can use an F-test formodel comparison like in linear regression.

Under the null-hypothesis θ = θ∗, the test statistic

F (θ∗) =n− pp

S(θ∗)− S(θ̂)S(θ̂)

∼ Fp,n−p .

A confidence region is obtained :{θ s.t. S(θ) ≤ S(θ̂)

(1 +

p

n− pq

)}where q is the (1− α) quantile of the F distribution with p andn− p degrees of freedom.

Example 2: Stormer viscosity data

The Stormer viscometer measures the viscosity of a fluid bymeasuring the time taken for an inner cylinder in the mechanism toperform a fixed number of revolutions in response to an actuatingweight. The viscometer is calibrated by measuring the time takenwith varying weights while the mechanism is suspended in fluids ofaccurately known viscosity.

The dataset stormer in package MASS comes from such acalibration, and theoretical considerations suggest a non-linearrelationship between time T , weight w and viscosity v of the form

T =β1v

w − β2+ ε

where β1 and β2 are unknown parameters to be estimated.This model is partially linear: given β2 we can fit β1 by linearregression.

Williams (1959) suggested finding initial values of the parametersby fitting the rewritten model

wT = β1v + β2T + (w − β2)ε

by least squares.

Doing so and then fitting the non-linear model gives

> fm0 <- lm(Wt*Time ~ Viscosity + Time - 1, data = stormer)

> b0 <- coef(fm0); names(b0) <- c("b1", "b2"); b0

b1 b2

28.876 2.8437

> storm.fm <- nls(Time ~ b1*Viscosity/(Wt-b2),

data = stormer, start = b0)

> storm.fm

b1 b2

29.401 2.218

residual sum-of-squares: 825.1

Confidence region for the regression parameters

Denote by (β̂1, β̂2) the parameters minimising the residual sum ofsquares

S(β1, β2) =

N∑i=1

(Ti −

β1viwi − β2

)2

If β1, β2 are true parameters, the extra sum of squares

F (β1, β2) =(S(β1, β2)− S(β̂1, β̂2))/2

S(β̂1, β̂2)/(N − 2)

is approximately distributed as Fp,N−p where p = 2.

The F statistic surface and a confidence region for the regressionparameters.

27 28 29 30 31 32

12

34

b1

b2

1

2 5

5

7

7

10

10

15

15

20

20

3.4668 95% CR

Example 3: Enzyme kinetics

The classic Michaelis–Menten model for enzyme kinetics has twoparameters Km, Vmax and relates the reaction rate v to theconcentration of substrate S by

v =VmaxS

Km + S

Variations with a competitive inhibition

v =VmaxS

Km(1 + I/Ki) + S

and non-competitive inhibition

v =VmaxS

(S +Km)(1 + I/Ki)

where I is the concentration of an inhibitor.

We consider a dataset from CRAN package nlstools containingobservations with an inhibitor.

data(vmkmki, package = "nlstools")

We start with a simple model to find starting values:

fm0 <- nls(v ~ SSmicmen(S, Vm, Km), data = vmkmki)

and them start the inhibitor models with a large value of ki (smallinhibitor effect).

fmc <- nls(v ~ S/(S + Km * (1 + I/Ki)) * Vm,

list(Vm=16, Km=21, Ki=1000), data = vmkmki)

fmnc <- nls(v ~ S/((S + Km) * (1 + I/Ki)) * Vm,

list(Vm=16, Km=21, Ki=1000), data = vmkmki)

We have three models, with RSS 406.5 (70 df), 177.3 and 55.0(69 df).

The first model is nested within each of the others, so we coulduse an (approximate) F test:

> anova(fm0, fmc)

Analysis of Variance Table

Model 1: v ~ SSmicmen(S, Vm, Km)

Model 2: v ~ S/(S + Km * (1 + I/Ki)) * Vm

Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)

1 70 406.47

2 69 177.25 1 229.21 89.227 4.633e-14

> anova(fm0,fmnc)

Analysis of Variance Table

Model 1: v ~ SSmicmen(S, Vm, Km)

Model 2: v ~ S/((S + Km) * (1 + I/Ki)) * Vm

Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)

1 70 406.47

2 69 54.97 1 351.5 441.24 < 2.2e-16 ***

but there is no need of a formal test: the non-competitiveinhibition model is clearly the best.

• We can do most of the regression diagnostics we would do fora linear model for mildly non-linear fits such as these.

• The one thing that is harder (at best) is drop-one-observationdiagnostics such as studentized residuals and Cook’s distance.

• We can also use simulation-based methods to explore thesampling distribution of the parameters or other quantities ofinterest, including bootstrapping and the jackknife.

• There are some functions to do so in package nlstools.

non-linear models - further statistical methods

Documents