nonlinear regression - statvision.com
TRANSCRIPT
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 1
Nonlinear Regression
Summary ......................................................................................................................................... 1 Analysis Summary .......................................................................................................................... 4 Plot of Fitted Model ........................................................................................................................ 6
Response Surface Plots ................................................................................................................... 7 Analysis Options ........................................................................................................................... 10 Reports .......................................................................................................................................... 11 Correlation Matrix ........................................................................................................................ 12 Observed versus Predicted ............................................................................................................ 13
Residual Plots................................................................................................................................ 13 Unusual Residuals ......................................................................................................................... 16
Influential Points ........................................................................................................................... 17 Save Results .................................................................................................................................. 18 Calculations................................................................................................................................... 18
Summary
The Nonlinear Regression procedure fits a user-specified function relating a single dependent
variable Y to one or more independent variables X. The model is estimated using nonlinear least
squares. The fitted model may be plotted, forecasts generated from it, and unusual residuals
identified.
Sample StatFolio: nonlinear reg.sgp
Sample Data
The file nonlin.sgd contains data on the amount of available chlorine in samples of a product as a
function of the number of weeks since it was produced. The data, from Draper and Smith (1998),
consists of n = 44 samples, a portion of which are shown below:
Weeks Chlorine
8 0.49
8 0.49
10 0.48
10 0.47
10 0.48
10 0.47
12 0.46
12 0.46
12 0.45
12 0.43
… …
It is desired to fit the following model to the data:
8)49.0( weeksbeaachlorine (1)
This model, suggested by a subject matter expert, contains two unknowns: a, the asymptotic
baseline value reached at large values of weeks, and b, the exponential rate of decay.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 2
Data Input The first of two data input dialog boxes requests the name of the dependent variable and the
model to be fit:
Dependent Variable: numeric column containing the n values of Y.
Function: a STATGRAPHICS expression representing the function to be fit. It must include
one or more names of numeric columns, representing the independent variables. It may also
include functions such as SQRT or EXP. Any unrecognized names are considered to
represent model parameters that need to be estimated.
Weight: an optional numeric column containing weights to be applied to the squared
residuals when performing a weighted least squares fit.
Select: subset selection.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 3
The second dialog box requests initial estimates for each of the unknown model parameters:
Enter an initial estimate for each parameter. The program will begin with the initial estimates and
perform a numerical search to find estimates that minimize the residual sum of squares.
Depending upon the complexity of the model, poor estimates may or may not lead to an optimal
solution. In all but the simplest cases, intelligent selection of initial estimates can greatly improve
the chances of obtaining a good solution. Typically, it is important to at least give estimates with
the proper sign (positive or negative), since the search procedure might otherwise move in an
entirely wrong direction.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 4
Analysis Summary
The Analysis Summary shows the results of the fit.
Nonlinear Regression - chlorine Dependent variable: chlorine
Independent variables:
weeks
Function to be estimated: a+(0.49-a)*exp(-b*(weeks-8))
Initial parameter estimates:
a = 0.1
b = 0.1
Number of observations: 44
Estimation method: Marquardt
Estimation stopped due to convergence of residual sum of squares.
Number of iterations: 4
Number of function calls: 14
Estimation Results
Asymptotic 95.0%
Asymptotic Confidence Interval
Parameter Estimate Standard Error Lower Upper
a 0.390144 0.00501534 0.380022 0.400265
b 0.101644 0.0133628 0.0746763 0.128611
Analysis of Variance
Source Sum of Squares Df Mean Square
Model 7.982 2 3.991
Residual 0.00500168 42 0.000119088
Total 7.987 44
Total (Corr.) 0.0395 43
R-Squared = 87.3375 percent
R-Squared (adjusted for d.f.) = 87.036 percent
Standard Error of Est. = 0.0109127
Mean absolute error = 0.00769665
Durbin-Watson statistic = 1.98378
Lag 1 residual autocorrelation = 0.00702451
Residual Analysis
Estimation Validation
n 44
MSE 0.000119088
MAE 0.00769665
MAPE 1.82283
ME -0.000097621
MPE -0.0826224
Included in the output are:
Data Summary: a summary of the input data.
Function to be Estimated: the function to be estimated and the initial parameter estimates.
Estimation Statistics: the method of estimation used and the number of iterations and
function calls performed.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 5
Parameter Estimates: the estimated parameters with approximate confidence intervals.
Confidence intervals that do not contain 0 indicate that the model parameter is statistically
significant at the stated confidence level.
Analysis of Variance: decomposition of the variability of the dependent variable Y into a
model sum of squares and a residual or error sum of squares.
Statistics: summary statistics for the fitted model, including:
R-squared - represents the percentage of the variability in Y which has been explained by the
fitted regression model, ranging from 0% to 100%. For the sample data, the regression has
accounted for about 87.3% of the variability amongst the observed chlorine concentrations.
Adjusted R-Squared – the R-squared statistic, adjusted for the number of coefficients in the
model. This value is often used to compare models with different numbers of coefficients.
Standard Error of Est. – the estimated standard deviation of the residuals (the deviations
around the model). This value is used to create prediction limits for new observations.
Mean Absolute Error – the average absolute value of the residuals.
Durbin-Watson Statistic – a measure of serial correlation in the residuals. If the residuals
vary randomly, this value should be close to 2. A small P-value indicates a non-random
pattern in the residuals. For data recorded over time, a small P-value could indicate that some
trend over time has not been accounted for.
Lag 1 Residual Autocorrelation – the estimated correlation between consecutive residuals, on
a scale of –1 to 1. Values far from 0 indicate that significant structure remains unaccounted
for by the model.
Residual Analysis – if a subset of the rows in the datasheet have been excluded from the
analysis using the Select field on the data input dialog box, the fitted model is used to make
predictions of the Y values for those rows. This table shows statistics on the prediction
errors, defined by
iii yye ˆ (2)
Included are the mean squared error (MSE), the mean absolute error (MAE), the mean
absolute percentage error (MAPE), the mean error (ME), and the mean percentage error
(MPE). This validation statistics can be compared to the statistics for the fitted model to
determine how well that model predicts observations outside of the data used to fit it.
For the sample data, the fitted model is
chlorine = 0.390144 + (0.49-0.390144)exp(-0.101644(weeks-8)) (3)
The model begins with chlorine = 0.49 at weeks = 8 and drops exponentially to a baseline at
approximately 0.39 as weeks increase.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 6
Plot of Fitted Model
This Plot of Fitted Model pane plots the fitted model versus any one of the independent
variables, with the other variables set equal to values specified on the Pane Options dialog box.
Plot of Fitted Model
weeks
ch
lorin
e
0 10 20 30 40 50
0.38
0.4
0.42
0.44
0.46
0.48
0.5
Pane Options
Select any one variable to plot on the horizontal axis, together with its range. For the other
variables, enter values to be substituted into the fitted model.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 7
Response Surface Plots
If more than one independent variable is included in the model, surface and contour plots can be
created. For example, Draper and Smith (1998) report on an experiment in which the fraction of
material Y remaining after a chemical reaction was described by the model
620
11expexp
2
211X
XY (4)
where X1 was the reaction time in minutes and X2 was the reaction temperature in degrees
Kelvin. The data is saved in the file nlreact.sgd and the analysis in nlreact.sgp. A surface plot of
the fitted model is shown below:
Estimated Response Surface
0 30 60 90 120 150time
600610
620630
640
temperature
0
0.2
0.4
0.6
0.8
1
mate
rial
In a surface plot, the height of the surface represents the predicted value of Y. The second option
labeled Response Surface Plots on the Graphical Options menu creates a contour plot:
Contours of Estimated Response Surface
time
tem
pera
ture
material
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 30 60 90 120 150
600
610
620
630
640
In a contour plot of the above form, each line represents combinations of X1 and X2 that result in
the same predicted value for Y.
Various other formats are available using Pane Options.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 8
Pane Options
Type: choose from a 3-D Surface Plot, where the height of the surface represents the value
of Y versus any two independent variables; a 2-D Contour Plot, where lines or colored
regions represent the value of Y as a function of any two independent variables; a 2-D Square
Plot, where the predicted value of Y is shown at different combinations of 2 independent
variables; or a 3-D Cube Plot, in which the predicted value of Y is shown at different
combinations of 3 independent variables.
Contours: the limits and spacing of the contour lines or regions. The contours may be drawn
as solid Lines representing a single value of Y, Painted Regions representing intervals, or
using a Continuous range of colors.
Resolution: the number of divisions along each axis at which the value of Y is plotted.
Increasing the resolution may improve the quality of the plot, but it can also increase the
length of time required to draw it.
Surface: for a surface plot, the number of divisions along each axis between the lines used to
draw the surface. The surface may be drawn as a Wire Frame (transparent mesh), as a solid
colored surface, or contoured (colored according to values of Y). Contours below puts a
contour plot in the bottom of the cube. Show Points plots the observations with lines drawn
to the surface.
Factors: press this button to select the factors to be plotted. A dialog box similar to that
described for the Plot of Fitted Model will be displayed.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 9
Example – Contour Plot with Continuous Colors
Contours of Estimated Response Surface
time
tem
pe
ratu
re
material0.00.10.20.30.40.50.60.70.80.91.0
0 30 60 90 120 150
600
610
620
630
640
Example – Surface Plot with Contour Below and Show Points Selected
Estimated Response Surface
time
temperature
ma
teri
al
material
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 30 60 90 120 150600
610
620
630
640
0
0.2
0.4
0.6
0.8
1
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 10
Analysis Options
The Analysis Options dialog box controls the algorithm used to fit the model:
Method: method used to estimate the model parameters. The Gauss-Newton method uses a
linearization technique that fits a sequence of linear regression models to locate the minimum
residual sum of squares. The Steepest-Descent method follows the gradient of the residual
sum of squares surface. Marquardt’s method, the default, is a fast and reliable compromise
between the other two.
Stopping Criterion 1: The algorithm is assumed to have converged when the relative
change in the residuals sums of squares from one iteration to the next is less than this value.
Stopping Criterion 2: The algorithm is assumed to have converged when the relative
change in all parameter estimates from one iteration to the next is less than this value.
Maximum Iterations: Estimation stops if convergence is not achieved within this many
iterations.
Maximum Function Calls: Estimation stops if convergence is not achieved when the
function being fit has been evaluated this many times. Multiple function evaluations are done
during each iteration.
Marquardt Parameter: The magnitude of the Marquardt parameter controls the extent to
which the other two methods are traded off against each other. For details on the Marquardt
algorithm, see Box, Jenkins and Reinsel (1994).
Confidence Level: the percentage used to calculate the asymptotic confidence intervals for
the model coefficients.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 11
Reports
The Reports pane creates predictions using the fitted model. By default, the table includes a line
for each row in the datasheet that has complete information on the X variables and a missing
value for the Y variable. This allows you to add columns to the bottom of the datasheet
corresponding to levels at which you want predictions without affecting the fitted model.
For example, suppose a prediction is desired at Weeks = 50 (admittedly an extrapolation of the
model). In row #45 of the datasheet, the value 50 would be added to the Weeks column but the
Chlorine column would be left blank. The resulting table is shown below:
Regression Results for chlorine
Fitted Stnd. Error Lower 95.0% CL Upper 95.0% CL Lower 95.0% CL Upper 95.0% CL
Row Value for Forecast for Forecast for Forecast for Mean for Mean
45 0.392467 0.0115998 0.369057 0.415876 0.38453 0.400403
Included in the table are:
Row - the row number in the data sheet containing the values of the independent
variables.
Fitted Value - the predicted value of the dependent variable using the fitted model.
Standard Error for Forecast - the estimated standard error for predicting a single new
observation.
Confidence Limits for Forecast - prediction limits for new observations.
Confidence Limits for Mean - confidence limits for the mean value of Y at the settings
of the independent variables.
For row #45, the predicted chlorine level is approximately 0.392 A new sample at Weeks = 50
would be expected to be between 0.369 and 0.416 with 95% confidence (provided the
extrapolation held The mean chlorine level at 50 weeks is estimated to be somewhere between
0.385 and 0.400.
Using Pane Options, additional information about the predicted values and residuals for the data
used to fit the model can also be included in the table.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 12
Pane Options
You may include:
Observed Y – the observed values of the dependent variable.
Fitted Y – the predicted values from the fitted model.
Residuals – the ordinary residuals (observed minus predicted).
Studentized Residuals – the Studentized deleted residuals as described earlier.
Standard Errors for Forecasts – the standard errors for new observations at values of the
independent variables corresponding to each row of the datasheet.
Confidence Limits for Individual Forecasts – confidence intervals for new observations.
Confidence Limits for Forecast Means – confidence intervals for the mean value of Y at
values of the independent variables corresponding to each row of the datasheet.
Correlation Matrix
The Correlation Matrix displays estimates of the correlation between the estimated coefficients.
Asymptotic correlation matrix for coefficient estimates
a b
a 1.0000 0.8864
b 0.8864 1.0000
This table can be helpful in determining how well the effects of different independent variables
have been separated from each other.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 13
Observed versus Predicted
The Observed versus Predicted plot shows the observed values of Y on the vertical axis and the
predicted values Y on the horizontal axis.
Plot of chlorine
predicted
observ
ed
0.38 0.4 0.42 0.44 0.46 0.48 0.5
0.38
0.4
0.42
0.44
0.46
0.48
0.5
If the model fits well, the points should be randomly scattered around the diagonal line. It is
sometimes possible to see curvature in this plot, which would indicate the need for a curvilinear
model rather than a linear model. Any change in variability from low values of Y to high values
of Y might also indicate the need to transform the dependent variable before fitting a model to
the data.
Residual Plots
As with all statistical models, it is good practice to examine the residuals. In a regression, the
residuals are defined by
iii yye ˆ (5)
i.e., the residuals are the differences between the observed data values and the fitted model.
The Nonlinear Regression procedure creates various type of residual plots, depending on Pane
Options.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 14
Scatterplot versus X
This plot is helpful in visualizing any need for a different model.
Residual Plot
predicted chlorine
Stu
de
nti
ze
d r
esid
ua
l
0.38 0.4 0.42 0.44 0.46 0.48 0.5
-3.6
-1.6
0.4
2.4
4.4
Normal Probability Plot
This plot can be used to determine whether or not the deviations around the line follow a normal
distribution, which is the assumption used to form the prediction intervals.
Normal Probability Plot for chlorine
Studentized residual
pe
rce
nta
ge
-2.7 -0.7 1.3 3.3 5.3
0.1
1
5
20
50
80
95
99
99.9
If the deviations follow a normal distribution, they should fall approximately along a straight
line. In the above plot, the data deviate quite a bit from the straight line, indicating that the
deviations follow a distribution with longer tails than that of a normal distribution.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 15
Residual Autocorrelations
This plot calculates the autocorrelation between residuals as a function of the number of rows
between them in the datasheet.
Residual Autocorrelations for chlorine
lag
au
toco
rre
lati
on
0 2 4 6 8 10 12
-1
-0.6
-0.2
0.2
0.6
1
It is only relevant if the data have been collected sequentially. Any bars extending beyond the
probability limits would indicate significant dependence between residuals separated by the
indicated “lag”, which would violate the assumption of independence made when fitting the
regression model.
Pane Options
Plot: the type of residuals to plot:
1. Residuals – the residuals from the least squares fit.
2. Studentized residuals – the difference between the observed values yi and the predicted
values iy when the model is fit using all observations except the i-th, divided by the
estimated standard error. These residuals are sometimes called externally deleted
residuals, since they measure how far each value is from the fitted model when that
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 16
model is fit using all of the data except the point being considered. This is important,
since a large outlier might otherwise affect the model so much that it would not appear to
be unusually far away from the line.
Type: the type of plot to be created. A Scatterplot is used to test for curvature. A Normal
Probability Plot is used to determine whether the model residuals come from a normal
distribution. An Autocorrelation Function is used to test for dependence between consecutive
residuals.
Plot Versus: for a Scatterplot, the quantity to plot on the horizontal axis.
Number of Lags: for an Autocorrelation Function, the maximum number of lags. For small
data sets, the number of lags plotted may be less than this value.
Confidence Level: for an Autocorrelation Function, the level used to create the probability
limits.
Unusual Residuals
Once the model has been fit, it is useful to study the residuals to determine whether any outliers
exist that should be removed from the data. The Unusual Residuals pane lists all observations
that have Studentized residuals of 2.0 or greater in absolute value.
Unusual Residuals for chlorine
Predicted Studentized
Row Y Y Residual Residual
10 0.43 0.456641 -0.0266407 -2.67
17 0.46 0.42628 0.0337201 3.59
18 0.45 0.42628 0.0237201 2.35
35 0.38 0.400815 -0.0208151 -2.02
Studentized residuals greater than 3 in absolute value correspond to points more than 3 standard
deviations from the fitted model, which is a rare event for a normal distribution. Row #17 is
more than 3.5 standard deviations from the fitted model, which is a very rare event if the
deviations follow a normal distribution.
Note: Points can be removed from the fit while examining the Plot of the Fitted Model by
clicking on a point and then pressing the Exclude/Include button on the analysis toolbar.
Excluded points are marked with an X.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 17
Influential Points
In fitting a regression model, all observations do not have an equal influence on the parameter
estimates in the fitted model. In a simple regression, points located at very low or very high
values of X have greater influence than those located nearer to the mean of X. The Influential
Points pane displays any observations that have high influence on the fitted model:
Influential Points for chlorine
Mahalanobis Cook's
Row Leverage Distance DFITS Distance
10 0.0407876 0.80918 -0.550164 0.00516818
17 0.051007 1.2807 0.833184 0.0130882
18 0.051007 1.2807 0.544379 0.00647643
40 0.0752918 2.44299 -0.440596 0.00654216
Average leverage of single data point = 0.0454545
Points are placed on this list for one of the following reasons:
Leverage – measures how distant an observation is from the mean of all n observations in
the space of the independent variables. The higher the leverage, the greater the impact of the
point on the fitted values .y Points are placed on the list if their leverage is more than 3 times
that of an average data point.
Mahalanobis Distance – measures the distance of a point from the center of the collection of
points in the multivariate space of the independent variables. Since this distance is related to
leverage, it is not used to select points for the table.
DFITS – measures the difference between the predicted values iy when the model is fit with
and without the i-th data point. Points are placed on the list if the absolute value of DFITS
exceeds np /2 , where p is the number of coefficients in the fitted model.
STATGRAPHICS – Rev. 9/16/2013
2013 by StatPoint Technologies, Inc. Nonlinear Regression - 18
Save Results
The following results may be saved to the datasheet:
1. Predicted Values – the predicted value of Y corresponding to each of the n observations.
2. Standard Errors of Predictions - the standard errors for the n predicted values.
3. Lower Limits for Predictions – the lower prediction limits for each predicted value.
4. Upper Limits for Predictions – the upper prediction limits for each predicted value.
5. Standard Errors of Means - the standard errors for the mean value of Y at each of the n
values of X.
6. Lower Limits for Forecast Means – the lower confidence limits for the mean value of Y
at each of the n values of X.
7. Upper Limits for Forecast Means– the upper confidence limits for the mean value of Y at
each of the n values of X.
8. Residuals – the n residuals.
9. Studentized Residuals – the n Studentized residuals.
10. Leverages – the leverage values corresponding to the n values of X.
11. DFITS Statistics – the value of the DFITS statistic corresponding to the n values of X.
12. Mahalanobis Distances – the Mahalanobis distance corresponding to the n values of X.
13. Coefficients – the estimated model coefficients.
14. Function – a text string containing the STATGRAPHICS expression for the function that
was fit.
Calculations
Parameter estimates are found by numerically minimizing the residual sums of squares. The
variance-covariance matrix of the coefficients is estimated from the partial derivatives in the
neighborhood of the least squares solution.