a focus on parameter estimates and the score test...
TRANSCRIPT
Identifying Influential Observations in NonlinearRegressiona focus on parameter estimates and the score testKarin Stål
Academic dissertation for the Degree of Doctor of Philosophy in Statistics at StockholmUniversity to be publicly defended on Tuesday 14 April 2015 at 10.00 in De Geersalen,Geovetenskapens hus, Svante Arrhenius väg 14.
AbstractThis thesis contributes to influence analysis in nonlinear regression and in particular the detection of influentialobservations. The focus is on a regression model with a known mean function, which is nonlinear in its parameters andwhere the function is chosen according to the knowledge about the process generating the data. The error term in theregression model is assumed to be additive.
The main goal of this thesis is to work out diagnostic measures for assessing the influence of observations on variousresults from a nonlinear regression analysis. The obtained results comprise diagnostic tools for detecting observationsthat, individually or jointly with some other observations, are influential on the parameter estimates. Moreover, assessingconditional influence, i.e. the influence of an observation conditional on the deletion of another observation, is ofinterest. This can help to identify influential observations which could be missed due to complex relationships among theobservations. Novelties of the proposed diagnostic tools include the possibility to assess influence of observations on aspecific parameter estimate and to assess influence of multiple observations.
A further emphasis of this thesis is on the observations' influence on the outcome of a hypothesis testing procedure basedon Rao's score test. An innovative solution to the problem of visual identification of influential observations regarding thescore test statistic obtained in this thesis is the so called added parameter plot. As a complement to the added parameterplot, new diagnostic measures are derived for assessing the influence of single and multiple observations on the score teststatistic.
Keywords: Added parameter plot, differentiation approach, influential observation, nonlinear regression, score test.
Stockholm 2015http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-114324
ISBN 978-91-7649-115-7
Department of Statistics
Stockholm University, 106 91 Stockholm
Identifying Influential Observationsin Nonlinear Regression – a focus on parameter estimates and the scoretest
Karin Stål
Identifying Influential Observationsin Nonlinear Regressiona focus on parameter estimates and the score test
Karin Stål
Abstract
This thesis contributes to influence analysis in nonlinear regression and in par-ticular the detection of influential observations. The focus is on a regressionmodel with a known mean function, which is nonlinear in its parameters andwhere the function is chosen according to the knowledge about the processgenerating the data. The error term in the regression model is assumed to beadditive.
The main goal of this thesis is to work out diagnostic measures for assessingthe influence of observations on various results from a nonlinear regressionanalysis. The obtained results comprise diagnostic tools for detecting observa-tions that, individually or jointly with some other observations, are influentialon the parameter estimates. Moreover, assessing conditional influence, i.e. theinfluence of an observation conditional on the deletion of another observation,is of interest. This can help to identify influential observations which couldbe missed due to complex relationships among the observations. Novelties ofthe proposed diagnostic tools include the possibility to assess influence of ob-servations on a specific parameter estimate and to assess influence of multipleobservations.
A further emphasis of this thesis is on the observations’ influence on the out-come of a hypothesis testing procedure based on Rao’s score test. An innova-tive solution to the problem of visual identification of influential observationsregarding the score test statistic obtained in this thesis is the so called addedparameter plot. As a complement to the added parameter plot, new diagnosticmeasures are derived for assessing the influence of single and multiple obser-vations on the score test statistic.
Keywords: Added parameter plot, differentiation approach, influential obser-vation, nonlinear regression, score test
c©Karin Stål, Stockholm 2015
ISBN 978-91-7649-115-7
Printed in Sweden by Publit, Stockholm 2015
Distributor: Department of Statistics, Stockholm University
This thesis is dedicated to my loving family.
Contents
Abstract iv
List of Figures ix
List of Tables xi
Acknowledgments xiii
1 Introduction 151.1 Francis Galton, linear regression and correlation . . . . . . . . 151.2 Introduction to influence analysis in linear regression . . . . . 161.3 Introduction to influence analysis in nonlinear regression . . . 181.4 Influence analysis regarding a test statistic . . . . . . . . . . . 191.5 Aims of the dissertation . . . . . . . . . . . . . . . . . . . . . 20
2 Regression models 232.1 Linear regression models and least squares estimation . . . . . 232.2 Nonlinear regression models and estimation . . . . . . . . . . 24
2.2.1 Geometry of nonlinear regression . . . . . . . . . . . 302.3 Score testing in regression analysis . . . . . . . . . . . . . . . 31
2.3.1 The score test in linear regression . . . . . . . . . . . 312.3.2 The score test in nonlinear regression . . . . . . . . . 36
3 Influence analysis in regression 39
4 Graphical displays 474.1 Graphical displays in linear regression . . . . . . . . . . . . . 48
4.1.1 The added variable plot . . . . . . . . . . . . . . . . . 484.2 Graphical displays in nonlinear regression . . . . . . . . . . . 52
4.2.1 The added parameter plot . . . . . . . . . . . . . . . . 534.2.2 Numerical example . . . . . . . . . . . . . . . . . . . 60
5 Assessment of influence on parameter estimates 655.1 Assessment of influence of a single observation . . . . . . . . 66
5.1.1 The influence measure EIC in linear regression, de-rived via the differentiation approach . . . . . . . . . 67
5.1.2 The influence measure DIM, for use in nonlinear re-gression . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.3 A note on DIMθθθ ,k and DIM
θ j,k. . . . . . . . . . . . . 75
5.1.4 Numerical example: Influence analysis using DIMθθθ ,k . 82
5.1.5 Numerical example: Influence analysis using DIMθ j,k
. 865.2 Assessment of influence of multiple observations . . . . . . . 89
5.2.1 Joint influence in linear regression . . . . . . . . . . . 915.2.2 Joint influence in nonlinear regression . . . . . . . . . 945.2.3 Conditional influence in linear regression . . . . . . . 1025.2.4 Conditional influence in nonlinear regression . . . . . 107
5.3 Summary of Chapter 5 . . . . . . . . . . . . . . . . . . . . . 112
6 Assessment of influence on the score test statistic 1156.1 Assessment of influence of a single observation . . . . . . . . 115
6.1.1 Linear regression . . . . . . . . . . . . . . . . . . . . 1166.1.2 Nonlinear regression . . . . . . . . . . . . . . . . . . 1236.1.3 Numerical example . . . . . . . . . . . . . . . . . . . 133
6.2 Assessment of influence of multiple observations . . . . . . . 1346.2.1 Numerical example . . . . . . . . . . . . . . . . . . . 142
7 Concluding remarks and further research 145
Sammanfattning cli
References cliii
List of Figures
2.1 The Michaelis-Menten curve where y is the initial velocity andx is the substrate concentration. The parameter values are θ1 =0.9 and θ2 = 0.2. The dashed line represents the value of θ1,the dotted horizontal line represents the value of y that is halfof θ1, and the dotted vertical line represents the value of θ2. . 26
2.2 A growth curve where f (t) is the size of the population and tis time. The solid line represents the tangent line at the pointof inflection, represented by the filled circle. The slope of thetangent line is equal to µm = 12/e. The lag time, λ = 5/3, isthe intercept of the tangent line. The dotted line represents theasymptote, A = 20. . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 Added variable plot for explanatory variable "RGF" using thedata presented in Cook and Weisberg (1982) on jet fighters. . . 50
4.2 The added parameter plot for θ4 consisting of the scatter plotof yyy, the residuals resulting from regressing yyy on FFF1, againstxxx, the residuals resulting from regressing FFF2 on FFF1, and theestimated regression line with slope α . . . . . . . . . . . . . 62
5.1 Plot of the data given in Table 5.1, where y = initial velocityand x = substrate concentration, together with the estimatedcurve. Observation 40 is contaminated. . . . . . . . . . . . . . 84
5.2 The joint-parameter influence measure DIMθθθ ,k defined in (5.4),
for each observation in Table 5.1. Observations within thedashed lines represents 75 percent of the data. Observe thatDIM
θθθ ,k = (DIMθ1,k
,DIMθ2,k
). . . . . . . . . . . . . . . . . . 845.3 The influence measures DIM
θ1,kand DIM
θ2,kcalculated for
each observation in Table 5.1. . . . . . . . . . . . . . . . . . . 855.4 Standardized residuals and leverages computed using the data
in Table 5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.5 Plot of the data given in Table 5.1, where observation 9 is con-
taminated and observation 40 is uncontaminated. . . . . . . . 87
5.6 The marginal influence measure, DIMθ j,k
, for j = 1,2 and k =1, . . . ,49, when the 9th observation is contaminated. 75 per-cent of the data are within the dashed lines. . . . . . . . . . . 88
5.7 Marginal leverages of observations k = 1, . . . ,49 when the 9thobservation is contaminated. (a) describes the marginal lever-ages when θ1 is under consideration and (b) describes the marginalleverages when θ2 is under consideration. . . . . . . . . . . . 88
6.1 A plot of DIMSk against the observation number, where DIMSkis the diagnostic measure for assessing the influence of the ob-servations on the score test statistic, given in Definition 6.1.2.The data used are presented in Table 4.1. . . . . . . . . . . . . 133
List of Tables
4.1 Data from Bates and Watts (1988), used to fit the Michaelis-Menten model with expectation functions (4.22) and (4.23). . . 61
5.1 Simulated data according to the model given in (5.26) . . . . . 83
Acknowledgments
Finishing this thesis was a very stressful task, where dreams in the night abouttheorems and proofs were haunting me and the day didn’t seem to have enoughhours. However, being a Ph.D. student has been a true experience, and it warmsmy heart to think about the people who have supported me and been there forme, in good times and bad.
First and foremost, my deepest gratitude goes to my supervisor, Associate Pro-fessor Tatjana von Rosen. You have been a solid ground with your guidance,infinite knowledge and positive attitude. Your constant willingness to help isremarkable and you are never too tired to make an effort. Moreover, I reallyappreciate your sense of humor and I would like to thank you for the laughswe had.
To my assistant supervisor Professor Dietrich von Rosen I would like to saythank you for all the interesting discussions and for your insightful comments.During the final stage, your help has been invaluable and I appreciate that youalways find time to read and answer any questions.
Ellinor Fackle-Fornius, my assistant supervisor, thank you for all your helpwith my research and for the fantastic times we have had in the past nine years.Some are truly memorable, for instance the way we finished our UPC-course.
Thank you, all fellow Ph.D. students, former and present, at the Department ofStatistics. Together, we have had a lot of fun and I take good memories withme. I would like to send a special "thank you" to Olivia and Yuli. Your supporthas been invaluable and coming to work is much more fun when you are there.
Moreover, thanks to all my colleagues at the Department of Statistics, and es-pecially to Dan Hedlin, Richard Hager and Michael Carlson for being helpfuland willing to listen when problems arise. I would also like to thank ProfessorHans Nyquist for introducing me to the topic and for providing me with goodideas.
To my wonderful parents, mamma Berit and pappa Erik: what would I dowithout you? Better parents and supporters cannot be found. Whenever I need
you, you are there for me and I count myself lucky having you.
Daniel Bruce, I am sincere when I say that this thesis would not have beenwritten if it wasn’t for you. Your dedication to our family and the way you pri-oritized me and me finalizing my thesis is extraordinary. You are a wonderful,wonderful man and I love you.
The best things in my life are my sons, Elliott and Elmer. With you, there isnever a dull moment. Thank you for being you and for being part of my life.
Karin StålStockholm, March 9, 2015
1. Introduction
1.1 Francis Galton, linear regression and correlation
In 1889 the English polymath Francis Galton was taking a country walk at Na-worth Castle near Carlisle, Northern England. A rainstorm was sweeping thecountry, and as Galton took shelter from the rainstorm, an idea flashed acrosshim, namely the idea of correlation analysis. This was the beginning of corre-lation and regression analysis, according to Barnes (1998), where an amusingstory about the history of statistics in general, and regression analysis in partic-ular, is told. Whether this story is true or not is debated, see for instance Stigler(1986). However, assuming it is true, the idea of correlation that flashed acrossGalton that day in the rain did not appear in a vacuum. It was a concluding stepin a 20-year research project. He first observed reversion towards the mean inthe late 1870’s when he conducted experiments on seed size in successive gen-erations of sweet peas. In the 1880’s Galton was investigating the heights ofparents and their offspring, see Bulmer (2003). Galton found that tall parents,or taller than mediocrity as he called it, had children who were shorter thanthemselves and that parents who were shorter than mediocrity had childrentaller than themselves. This led him to call the phenomena "regression towardmediocrity." According to Sen and Srivastava (1990), the phenomenon regres-sion did not start with Galton. There were other mathematicians that were do-ing what we could call regression prior to Galton. What was interesting withGalton’s work was that he connected regression and correlation. Accordingto Stigler (1989), in the late 1880’s Galton was simultaneously pursuing twounrelated investigations, one in anthropology and one in forensic science. Inanthropology, the question was as follows: If a single thigh bone is recoveredfrom an ancient grave and measured, what can the measurement of the bonetell us about the total height of the individual to whom it belonged? The otherquestion was related: For the purpose of criminal identification, what can besaid about the relationship between measurements taken from different parts ofthe same person? What dawned on Galton was that these new problems wereidentical to the old one on kinship and that all three of them were no more thanspecial cases of a much more general problem, namely that of correlation. Notonly did he describe the relationships between variables through regression
15
toward mediocrity, but he also found a way to measure the strength of this re-lationship through the correlation coefficient. Moreover, Galton realized thatthe variation of one variable around the regression line could be divided intotwo parts, one part that could be explained by the other variable and one thatcould not.
Galton’s ideas about correlation and regression are not far from our conceptionof them. Correlation is used to study linear association between two variables.The correlation coefficient for assessing the strength of this association rangesbetween -1 and 1, where the sign indicates the direction of the association. Inlinear regression the linear association between the variables is described us-ing a linear function. Moreover, as Galton realized, the dependent variable de-pends on some unobservable error, often assumed to be a normally distributedrandom variable with expectation zero and constant variance. When the un-known parameters in the linear function is estimated, a fitted linear regressionmodel is obtained. Correlation and regression are connected as the estimateof the slope parameter in the linear function and the correlation coefficient arefunctionally related.
Galton rightly foresaw that the methods of regression and correlation wouldhave a prominent place in many applications, see Bulmer (2003). The linearregression model is widely used in e.g. business, the social and behavioral sci-ences, the biological sciences and many other disciplines.
1.2 Introduction to influence analysis in linear regression
It is well understood that not all observations in the data set play an equalrole when fitting a regression model. Some observations might have more im-pact on, for instance, the estimation process than others. Observations thatsignificantly influence certain results from the regression analysis are calledinfluential observations. The study of the data and how different parts of itinfluence the inference is called influence analysis. Influence analysis of theinference in linear regression models is a well-established area of research.Andrews and Pregibon (1978) highlighted that we need to find the outliers thatmatter. What is meant by this is that not all outliers need to be harmful inthe way that they have an undue influence on, for instance, the estimation ofthe parameters in the regression model. If not all outliers matter, examiningthe residuals alone might not lead us to the detection of aberrant or unusualobservations. Thus, other ways for finding influential observations are needed.Hoaglin and Welsch (1978) discussed the importance of the projection ma-
16
trix in linear regression, where the projection matrix is the matrix that projectsonto the regression space. They argued that the diagonal elements of the pro-jection matrix are important ingredients in influence analysis. The diagonal el-ements are referred to as leverages, since they can be thought of as the amountof leverage concerning the response value on the corresponding predicted re-sponse value. Perhaps the most well-known influence measure was proposedby Cook (1977), referred to as Cook’s distance. Cook’s distance is an influencemeasure used for assessing the influence of the observations on the estimatedparameter vector in the linear regression model. Cook’s distance is widelyused by practitioners for detecting influential observations, and it is includedin most statistical computer programs. There exists a wide range of other influ-ence measures to use in linear regression analysis for assessing the influenceof the observations on various results of the regression analysis. For example,Andrews and Pregibon (1978) derived a measure of the influence of an ob-servation on the estimated parameters. This measure, the Andrews-Pregibonstatistic, is based on the change in volume of confidence ellipsoids with andwithout a particular observation. Moreover, Belsley et al. (1980) suggested aninfluence measure for assessing the influence of an observation on the varianceof the estimated parameters in the linear regression model, known as COVRA-TIO. Besides the influence measures mentioned here there exist many more,see e.g. Chatterjee and Hadi (1986) and Hadi (1992) for excellent overviewsof influence measures.
Graphical investigation of data is a powerful tool in explorative analysis. Itcan be used to examine relationships between variables and discover observa-tions deviating form other. Hence, influential observations can also be detectedusing graphical tools. Mosteller and Tukey (1977) introduced the added vari-able plot, which is used for graphically detecting observations that have a largeinfluence on the parameter estimates. For details concerning the added vari-able plot, such as construction and properties, see e.g. Belsley et al. (1980),where the plot is referred to as the partial regression leverage plot, and Cookand Weisberg (1982). Other results on graphical tools in influence analysisare provided by e.g. Atkinson (1982) and Johnson and McCulloch (1987). Itis important to note that the graphical tools used in influence analysis are notconclusive, but rather suggestive.
From the previous paragraphs we can see that the 1970’s and the 1980’s werethe decades when most research results on influence analysis in linear regres-sion came to see the light. However, influence analysis in linear regression isstill an active research area. Nurunnabi et al. (2014) proposed a modificationof Cook’s distance. This modification enables the identification of multiple
17
influential observations. Furthermore, Beyaztas and Alin (2014) used a com-bined Bootstrap and Jackknife algorithm to detect influential observations.
In applied data analysis, there is an increasing availability of data sets con-taining a large number of variables. When such data is in the hands of theresearcher sparse regression can be implemented, which is another field ofresearch active today. In sparse regression, a penalty term on the regressionparameters is added which shrinks the number of parameters. Common ap-proaches to estimate the parameter in sparse regression are, however, sensitiveto influential observations and new methods are needed. Alfons et al. (2013)and Park et al. (2014) proposed robust estimation methods, where influentialobservations are not harmful to the resulting estimates.
1.3 Introduction to influence analysis in nonlinear regres-sion
In this thesis, new tools for conducting influence analysis in nonlinear regres-sion are proposed. The nonlinear regression model referred to in this thesis is amodel where the relationship between the variables is a function that is nonlin-ear in its parameters. We assume that the error term enters the model linearly.The motivation for using the nonlinear regression model arises from the needto describe real-life phenomena with a meaningful and realistic model. Thismeaning might be biological, chemical or physical (Bates and Watts, 1988).Thus, the function used to describe the relationship between the variables is of-ten known and it is chosen due to the knowledge about the process generatingthe data. The Michaelis-Menten model (Michaelis and Menten, 1913) will befrequently used as an example of a nonlinear regression model throughout thisthesis. The model is used, for instance, in studying enzymatic-catalyzed reac-tions, called enzyme kinetics. The motivation for using it is that the behaviorof the enzymatic reaction’s velocity (dependent variable) when adding differ-ent substrate concentrations (independent variable) to the process is known tobe well described by the Michaelis-Menten model. Moreover, the parametersin the Michaelis-Menten model have chemically meaningful interpretations.A more detailed discussion of the Michaelis-Menten model will be given inChapter 2 where the estimation of the parameters is also discussed.
The existing literature on influence analysis in nonlinear regression is not asextensive as for linear regression. One reason for this can be that there do notgenerally exist closed form estimators for the parameters in the nonlinear re-
18
gression model. Detection of influential observations on the fit of the nonlinearregression model is discussed by Cook and Weisberg (1982) and St. Laurentand Cook (1993). Cook and Weisberg (1982) developed a nonlinear version ofCook’s distance, and St. Laurent and Cook (1993) proposed an approach forassessing the influence of the observations on the fitted values and on the es-timate of the variance in a nonlinear regression model. These diagnostic toolswill be more thoroughly discussed in Chapter 3. For more recent research re-sults, see Galea et al. (2005) and Vanegas and Cysneiros (2010). Moreover,for a discussion of influence analysis concerning a specific nonlinear regres-sion model, see Lemonte and Patriota (2011) and Vanegas et al. (2012).
Two graphical tools for identifying observations that are influential on theparameter estimates in a nonlinear regression model are presented in Cook(1987). One of these plots is referred to as the first-order extension of anadded variable plot. This plot will be discussed in detail in Chapter 4.
1.4 Influence analysis regarding a test statistic
Testing of hypotheses is an important part in regression analysis. There areseveral testing procedures available for linear and nonlinear regression. In thisthesis, the focus is on Rao’s score test (Rao, 1948).
Several authors have presented work on the sensitivity of the score test statis-tic. Lee et al. (2004) used the score test to test for zero-inflation in count data.The null hypothesis under consideration was that the Poisson distribution fitsthe observed data well. However, for some applications there might be a largenumber of zeros in the data. In this case a more appropriate model could bea zero-inflated Poisson model. The alternative hypothesis is thus that the datafollows a zero-inflated Poisson distribution. Another score test is also of in-terest, namely to test the null hypothesis that the data follows a zero-inflatedPoisson model with the alternative that the zero-inflated negative binomial isa better model. When deriving the influence diagnostic, Lee et al. (2004)used the local influence approach, proposed by Cook (1986), which will bediscussed in Chapter 3.
Lustbader and Moolgavkar (1985) derive an expression for the change in thescore test statistic when deleting observations. This expression is derived forlinear regression models, but is discussed in detail for deletion of entire risksets in matched case-control studies and survival studies. Matched case-controlstudies are retrospective, observational studies where one seeks to determinethe relationship between a risk factor and, for instance, a disease, using a par-
19
ticular matching variable to produce groups. With case-control data, it is nat-ural to consider change in the score test for deletion of entire risk sets, i.e. thenumber of subjects at risk of experiencing a certain event. In survival analysisit is more desirable to compute the diagnostic for each individual.
Chen (1985) discussed the robustness of score tests for generalized linear re-gression models. The robustness referred to here is the robustness against thefunctional form chosen under the alternative hypothesis. Chen also discussedhow the score test statistic can be made more robust against possible extremeobservations. Moreover, Li (2001) discussed the sensitivity of the score test,the Wald test and the likelihood ratio test in relation to nuisance parameters,i.e. how the corresponding test statistics are affected by changes in the valuesof the nuisance parameters. Furthermore, see Vanegas et al. (2012, 2013) fora discussion of influence analysis concerning the F-test.
1.5 Aims of the dissertation
The general purpose of this thesis is to develop diagnostic tools for nonlinearregression models with additive error terms. More specifically, five aims canbe outlined.
The first aim of this dissertation is to propose a new approach for detectingsingle observations with high influence on the parameter estimates in a non-linear regression model. There is a lack of existing approaches for findingobservations with high influence on a specific parameter estimate in nonlinearregression models and an aim is that our new approach should have this prop-erty.
In real-life studies data sets seldom contain only one influential observation,and therefore methods for finding multiple influential observations are needed.Multiple influential observations have not yet been discussed in the literatureon influence analysis in nonlinear regression. A second aim of this thesis istherefore to extend the approach for finding single influential observations fordetecting multiple influential observations.
The third aim of the thesis is to study the conditional influence, i.e. the influ-ence of an observation on the parameter estimates given that another observa-tion is deleted first. By using the conditional influence approach, influentialobservations can be revealed, observations that might go unnoticed when "un-conditional" methods are used. Moreover, the use of the conditional influenceapproach can provide a more intimate knowledge about the data, since hidden
20
dependence among certain observations in the data set can be revealed.
A further aim of the present thesis is to evaluate how results from testing hy-potheses about the parameters in a nonlinear regression model are affected byindividual observations. The focus is on a particular test, namely Rao’s scoretest. This test has the advantage, over other tests such as the likelihood ratiotest, that only quantities evaluated for the parameter estimates under the nullhypothesis need to be considered when constructing the test statistic. Hence,the derivation of diagnostic tools might be less complicated compared to testswhere parameter estimates under both the null and the alternative were to beconsidered.
Graphical exploration of the data is of the utmost importance. Furthermore,graphical inspection of the observations’ contribution to a test statistic is alsoof great interest. Hence, the fourth aim is to construct a plot that allows forvisual identification of influential observations on the score test statistic. How-ever, a graphical tool is used for explorative purposes and does not quantifythe influence of the observations. To add more information to the influenceanalysis concerning the test statistic, a fifth aim is to propose influence mea-sures that can be used to assess the influence of observations on the score teststatistic.
To summarize, the aims of this thesis are as follows:
• To propose a new approach to assessing the influence of a single obser-vation on the parameter estimates in nonlinear regression models.
• To extend the influence approach concerning single observations to as-sessing the influence of multiple observations.
• To propose an approach for assessing conditional influence, i.e. influ-ence of an observation conditional on the deletion of another observa-tion.
• To develop a graphical tool for explorative data analysis, where obser-vations with high influence on the score test statistic can be identified.
• To propose influence measures for assessing the influence of observa-tions on the score test statistic.
Chapters 1-4 presents the appropriate background for a discussion of the abovelisted aims whereas Chapters 5-7 include a more explicit discussion of theaims.
21
22
2. Regression models
This section gives mainly a brief overview of existing results concerning esti-mation and testing in regression models. In particular we focus on least squaresestimation in nonlinear regression models, which are central for this thesis.Another focus in this chapter is Rao’s score test (Rao, 1948), since one of thenew results obtained in this thesis is closely connected to the score test.
2.1 Linear regression models and least squares estimation
In regression, the relationship between a response variable and explanatoryvariables is often represented by a functional relationship, f , and an additiveerror term. The function, f , is called the expectation function. When f is linearin its parameters, we may write the model
yyy =XXXβββ +εεε, (2.1)
where yyy : n× 1 is a response vector, XXX : n× p is the matrix of p explanatoryvariables, βββ : p×1 is a vector of unknown regression parameters and εεε : n×1is random error, εεε ∼ N(000n,σ
2In). Here 000n : n× 1 is a vector of zeros andIn : n×n is the identity matrix.
The parameters in (2.1) are often estimated by the method of least squares ormaximum likelihood. Both methods yield the following estimator of βββ in (2.1)
βββ = (XXXTXXX)−1XXXTyyy,
assuming that the rank of XXX equals p and where T denotes the transpose of thematrix. Moreover, the maximum likelihood estimator of σ2 is
nσ2 =
(yyy−XXXβββ
)T (yyy−XXXβββ
).
Linear regression models are fairly flexible since even a nonlinear behavior ofthe data can be modeled by introducing nonlinear explanatory variables. An
23
example of a linear regression model with nonlinear explanatory variables isthe polynomial regression model
yi = β0 +β1xi +β2x2i + εi, i = 1, . . . ,n.
However, there is a limit to what can adequately be approximated by a linearmodel. Moreover, it may be difficult to interpret the results. If a linear regres-sion model does not seem to fit the data well, an alternative solution might beto use a model that is not linear in its parameters, i.e. a nonlinear regressionmodel.
2.2 Nonlinear regression models and estimation
In this section we introduce the nonlinear regression model, the estimation pro-cess, provide some examples of different applications of nonlinear regressionmodels and give a briefly discuss the geometry in nonlinear regression.
Nonlinear regression models are widely used in many areas, such as eco-nomics, agriculture and biology. The decision to use a nonlinear regressionmodel can be made on the basis of the theoretical knowledge about the prob-lem at hand and the process generating the data. The function f is usuallyentirely known except for the parameters in the model. The parameters areoften meaningful to the researcher or scientist, where the meaning can be forexample graphical, physical, biological or chemical.
In this thesis we assume that a nonlinear regression model has a known f , andthat this function is chosen due to the knowledge about the process generatingthe data. The regression model is not linear in its parameters (can be partlylinear) and the error term is assumed to be additive. The general form of themodel is
yyy = fff (XXX ,θθθ)+εεε, (2.2)
where yyy : n×1 is a response vector, XXX : n× p is the matrix of explanatory vari-ables, θθθ : q×1 is a vector of unknown parameters and εεε : n×1 is the randomerror, εεε ∼ N(000n,σ
2In).
Nonlinear models have many applications to real life problems and next weconsider two examples of nonlinear regression models.
24
Example 2.1. The Michaelis-Menten model in enzyme kinetics
Consider a scientist who will study enzyme-catalyzed reactions. The scientistknows that the initial velocity of an enzymatic reaction follows Michaelis-Menten kinetics. That is, the relationship between y, the initial velocity of theenzymatic reaction, and x, substrate concentration, is modeled by the Michaelis-Menten equation
f (x) =Vmaxx
Kmax + x, (2.3)
where Vmax and Kmax are unknown parameters, which will be explained later.In this case, due to knowledge about chemical reactions, the function f isknown to the scientist and there is no need to search for the correct functionalrelationship between y and x. For more details on the theoretical basis of theMichaelis-Menten equation see Briggs and Haldane (1925).
The Michaelis-Menten equation (2.3) can be used to formulate a nonlinearregression model by assuming an additive error term
yi =θ1xi
θ2 + xi+ εi, i = 1, . . . ,n, (2.4)
where θ1 = Vmax and θ2 = Kmax and εii.i.d.∼ N(0,σ2). Interested readers may
see Richie and Prvan (1996), Pasaribu (1999) and Dette and Kunert (2014) forstatistical analysis of enzyme kinetics data using the Michaelis-Menten equa-tion.
It can be seen from (2.4) that the parameter θ1 enters the model linearly butthe parameter θ2 enters nonlinearly, and thus the relationship between y andx is nonlinear. In the model (2.4), the parameters θ1 and θ2 have physicalinterpretations. The parameter θ1 is the maximum initial velocity, which istheoretically attained when the enzyme has been saturated with respect to con-centration of a substrate. The second parameter, θ2, is the Michaelis parameter,which equals the concentration of substrate for "half-maximum" initial veloc-ity. When the parameters in the model are estimated they are dependent, sincea change in θ1 results in a change in θ2 as well. The Michaelis-Menten curve,with θ1=0.9 and θ2=0.2, is given in Figure 2.1.
25
0.0 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
y
x
Figure 2.1: The Michaelis-Menten curve where y is the initial velocity and x isthe substrate concentration. The parameter values are θ1 = 0.9 and θ2 = 0.2. Thedashed line represents the value of θ1, the dotted horizontal line represents thevalue of y that is half of θ1, and the dotted vertical line represents the value of θ2.
In the next example we will describe another interesting nonlinear regressionmodel.
Example 2.2. The Gompertz Growth Curve Model
In microbiology, models are often used to describe the behavior or growth ofmicroorganisms under different physical or chemical conditions. In order tobuild these models, growth is measured and modeled. For this purpose, it iscommon to use a type of nonlinear models called growth curve models. TheGompertz growth curve model is of particular interest, and it is defined as
f (t) = k exp(−exp(a−bt)), (2.5)
where f (t) is the size of the population at time t, and k, a and b are unknownparameters. See e.g. Zweitering et al. (1990) for a discussion of modelingbacterial growth with growth curve models and Chakraborty et al. (2014) forstatistical analysis of the Gompertz growth curve model.
A common feature of the Gompertz growth curve models are that they havetwo asymptotes: the curve approaches zero as x→−∞, a positive constant asx→ ∞, and accelerates to a maximum value, after which the growth rate de-clines. The point (x, f ) where the growth rate is maximum is called point of
26
inflection. For the Gompertz growth curve model (2.5), the curve approachesk as x→∞. Another property of the Gompertz growth curve model is that it isnot symmetric around the point of inflection.
Zwietering et al. (1990) analyzed growth data of Lactobacillus plantarum.Bacterial growth often shows a phase in which the specific growth rate starts ata value of zero. The growth rate then accelerates to a maximal value in a certainperiod of time, resulting in a so-called lag time. Thereafter, the growth curveenters a final phase in which the growth rate decreases and finally reaches zero.The size of the population of bacteria is then approaching an asymptote.
Different growth curve models can be used to describe the behavior of bacterialgrowth. However, the most suitable model should contain parameters that aremicro-biologically relevant. One of the candidate models that has been used isthe modified Gompertz growth curve model, which is given by
f (t) = Aexp [−exp [(µme/A)(λ − t)+1]] ,
where A is the asymptote as x→ ∞, λ is the lag time and µm is the maximumgrowth rate in a certain period of time. An alternative definition of λ and µm isgiven by considering a tangent line at the point of inflection. The parameter µm
is defined as the slope of the tangent line and the parameter λ is the interceptof the tangent line. An example of a growth curve is given in Figure 2.2, whereA = 20, λ = 5/3 and µm = 12/e.
By assuming an additive error term, the following nonlinear regression modelcan be formulated
yi = Aexp{−exp [(µme/A)(λ − ti)+1]}+ εi, i = 1, . . . ,n,
where εii.i.d.∼ N(0,σ2).
To estimate parameters in nonlinear regression models, least squares or max-imum likelihood methods are often used. These methods of estimation yieldthe same mean parameter estimates when the errors in the nonlinear regressionmodel are independent, normally distributed and have constant variance, i.e.εεε ∼ Nn(000,σ2I). In contrast to linear regression, analytical solutions for theleast squares and maximum likelihood estimators can generally not be found.Instead, numerical algorithms are required. The perhaps most well knownalgorithms are the Gauss-Newton (GN) algorithm and the Newton (N) algo-rithm. The GN algorithm is a modification of the N algorithm, proposed byGauss in 1809. Though the theory behind these algorithms is old, they are
27
0 2 4 6 8 10
0
5
10
15
20
f(t)
t
Figure 2.2: A growth curve where f (t) is the size of the population and t is time.The solid line represents the tangent line at the point of inflection, representedby the filled circle. The slope of the tangent line is equal to µm = 12/e. The lagtime, λ = 5/3, is the intercept of the tangent line. The dotted line represents theasymptote, A = 20.
still very useful. However, nowadays there are numerous modifications thatcan make them more reliable. Examples of such modifications are the quasi-Newton method, Hartley’s method and the Levenberg-Marquardt method. Seefor instance Nocedal and Wright (2006) for a thorough discussion about nu-merical optimization, and Seber and Wild (2003) for a detailed description ofthe algorithms mentioned above and others. Next the unmodified GN algo-rithm will be illustrated, since this algorithm forms the basis of a number ofleast squares problems.
The problem is to find θθθ that minimizes
(yyy− fff (XXX ,θθθ))T (yyy− fff (XXX ,θθθ)) , (2.6)
in (2.2). In the GN algorithm the expansion of fff (XXX ,θθθ) is utilized in a Taylorexpansion around an initial value, θθθ (0), called starting value. The expansion offff (XXX ,θθθ) by the Taylor expansion around the starting value results in a linear
28
model given by
fff (XXX ,θθθ) ≈ fff (XXX ,θθθ (0))+d fff (XXX ,θθθ)
dθθθ
∣∣∣∣Tθθθ=θθθ (0)
(θθθ −θθθ
(0)), (2.7)
where the derivative is defined in Appendix A. Using the approximation (2.7)in (2.6), the minimization problem is converted to a linear least squares prob-lem, namely(
r(0)−FFF(0)T (θθθ −θθθ(0)))T (
r(0)−FFF(0)T (θθθ −θθθ(0))), (2.8)
where FFF(0) = d fff (XXX ,θθθ)dθθθ
∣∣∣θθθ=θθθ (0)
and r(0) = yyy− fff (XXX ,θθθ (0)).
Now, minimizing (2.8) yields(θθθ −θθθ
(0))=(
FFF(0)FFF(0)T)−1
FFF(0)rrr(0),
assuming that the matrix inverse exists, leading to the GN algorithm
θθθ(1) = θθθ
(0)+δδδ(0),
where δδδ (0) =(FFF(0)FFF(0)T
)−1FFF(0)rrr(0) is referred to as the Gauss increment.
The process of updating θθθ is repeated until the increment, δδδ , is so small thatthere is no useful change in the elements of the parameter vector and the pro-cess results in the final estimate, θθθ . The GN algorithm is convergent, i.e. theiterated values tend to the least squares estimate of θθθ , as the number of it-erations tends to infinity, provided that the starting value is close enough tothe true θθθ . Moreover, there are some restrictions that need to be fulfilled inorder for the algorithm to provide the least squares estimate, see Seber andWild (2003).
In order to ensure a successful nonlinear regression analysis, one should pri-oritize the task of obtaining good starting values. One approach for findingstarting values is to interpret the behavior of the expectation function in termsof the parameters, analytically or graphically. Other approaches are to use in-formation available from previous, or related, experiments, or to transform theexpectation function into a form that can be easily estimated. Examples of howstarting values can be obtained are listed in Bates and Watts (1988).
29
2.2.1 Geometry of nonlinear regression
To get a flavor of the challenges that come when working with nonlinear re-gression models we briefly discuss the geometry of nonlinear least squares.
Let Θ denote the subset of Rq consisting of all possible parameter values θθθ .Moreover, define M to be the surface in Rn, such that it contains fff (XXX ,θθθ) forall θθθ in the parameter space, i.e.
M = { fff (XXX ,θθθ) : θθθ ∈Θ} ⊂ Rn.
The set M is called the expectation surface. If the function fff (XXX ,θθθ) is linear inθθθ , the expectation surface is a plane, i.e. a linear surface. However, for non-linear regression models the expectation surface is not a linear surface. Thenonlinearity of the expectation surface results in challenges when analyzingnonlinear regression models and techniques used for linear regression modelsmust be extended, which introduces considerable complexity.
To overcome these challenges, one idea is to use a linear approximation ofthe expectation surface through the tangent plane. The tangent plane of theexpectation surface at the point θθθ is given by the equations
fff (XXX ,θθθ) = fff (XXX ,θθθ)+FFF(θθθ)(θθθ − θθθ), (2.9)where
FFF(θθθ) =(
FFF1(θθθ), . . . ,FFFn(θθθ))=
ddθθθ
fff (XXX ,θθθ)
∣∣∣∣θθθ=θθθ
.
The tangent plane, defined by the equations (2.9), is the space spanned by thecolumns of FFFT (θθθ) and it is a linear, local approximation to the expectationsurface in a neighborhood of θθθ . This approximation will be appropriate iffff (XXX ,θθθ) is reasonably flat in the region near θθθ . There exist techniques to eval-uate if fff (XXX ,θθθ) is reasonably flat, but the discussion of these techniques willbe omitted here and we refer interested readers to e.g. Bates and Watts (1988)and Seber and Wild (2003).
In this thesis, we will utilize the linear approximation to the expectation sur-face via the tangent plane repeatedly. For instance, we will make use of thematrix PPPFFF , defined to be FFFT (θθθ)(FFF(θθθ)FFFT (θθθ))−1FFF(θθθ), which is a matrix thatprojects onto the tangent plane. The projection matrix PPPFFF will be more thor-oughly described in Chapter 3. In Chapter 5 we will demonstrate the role ofPPPFFF in influence analysis.
30
2.3 Score testing in regression analysis
There are several techniques available for testing hypotheses about the parame-ters in a nonlinear regression model, discussed by, for instance, Gallant (1987)and Seber and Wild (2003). Hypothesis testing procedures addressed in mosttext-books on nonlinear regression models comprise three classical tests: thelikelihood ratio test (Neyman and Pearson, 1928), the Wald test (Wald, 1943)and Rao’s score test (Rao, 1948). Various modifications of them are also dis-cussed, such as the efficient score test considered by Hamilton (1986) and are-scaling of the score test considered by Gallant (1987). Other examples ofmodifications of these tests are given in Hamilton and Wiens (1987), wherecorrections of the likelihood ratio test and the efficient score test are made dueto the nonlinearity of the expectation surface. Moreover, Markatou and Manos(1996) discussed robust tests in nonlinear regression and the extensions of theWald test and the score test in particular. A comparison in power betweenthe three classical tests is done by Gallant (1987), where it is found that thelikelihood ratio test has slightly better power than other tests. However, theWald test and the score test only require the estimates of the parameters underthe alternative and the null hypothesis, respectively, and they are therefore lesscomputationally demanding than the likelihood ratio test. The focus in thisthesis is on the score test, since one of our new results obtained in this thesis isclosely connected to this test.
2.3.1 The score test in linear regression
In this section the score test will be derived for linear regression models, whichwill be followed by the derivation of the score test for nonlinear regressionmodels in Section 2.3.2.
Consider the linear regression model (2.1) and without loss of generality, andconsider testing a single parameter in the model. For the derivation of thescore test, where restrictions on several parameters are specified in the nullhypothesis, see Chen (1983). We let ΨΨΨ =
(βββ T ,σ2
)T be the parameter spaceand the null hypothesis of interest
H0 : ΨΨΨ =ΨΨΨ0, (2.10)
where ΨΨΨ0 =(β0, . . . ,βp−1, 0, σ2
)T .
The score test is based on the score function, which is the partial derivative ofthe log likelihood function, with respect to the parameters. The likelihood
31
function for yyy in (2.1) is
L(ΨΨΨ,yyy) =(
1√2πσ2
)n
exp{− 12σ2 (yyy−XXXβββ )T (yyy−XXXβββ )}, (2.11)
the log likelihood function, `= lnL(ΨΨΨ,yyy), is the following
`=−2n
ln(2πσ2)− 1
2σ2 (yyy−XXXβββ )T (yyy−XXXβββ ), (2.12)
and the score vector is defined as
UUU(ΨΨΨ) =d`dΨΨΨ
.
Now, let ΨΨΨ =(
β0, . . . , βp−1, 0, σ2)T
denote the maximum likelihood estimateof ΨΨΨ under the null hypothesis (2.10). The score test statistic for the hypothesisin (2.10) is given by
S(ΨΨΨ) = UT (ΨΨΨ)I−1(ΨΨΨ)U(ΨΨΨ), (2.13)
where U(ΨΨΨ) and I(ΨΨΨ) are the score vector and the Fisher information matrix,respectively, both evaluated for the parameter estimates under the null hypoth-esis. The Fisher information matrix is defined to be
I(ΨΨΨ) = E[U(ΨΨΨ)UT (ΨΨΨ)
]ΨΨΨ=ΨΨΨ
=
(E[U(βββ )UT (βββ )
]E[U(βββ )U(σ2)
]E[U(σ2)UT (βββ )
]E[U(σ2)U(σ2)
])
ΨΨΨ=ΨΨΨ
=
(E[U(βββ )UT (βββ )
]000p
000Tp E
[U(σ2)U(σ2)
])
ΨΨΨ=ΨΨΨ
, (2.14)
since the first central moment of the normal distribution is zero.
If the score vector is evaluated using the estimates under the null hypothesiswe get
U(ΨΨΨ) =
(U(βββ )U(σ2)
)
32
where
U(βββ ) =d`dβββ
∣∣∣∣ΨΨΨ=ΨΨΨ
=1
σ2 XXXT (yyy−XXXβββ ),
and
U(σ2) =d`
dσ2
∣∣∣∣ΨΨΨ=ΨΨΨ
=− n2σ2 +
12σ4 (yyy−XXXβββ )T (yyy−XXXβββ ) = 0,
since σ2 = (1/n)(yyy−XXXβββ )T (yyy−XXXβββ ) is the maximum likelihood estimate ofσ2. Therefore
U(ΨΨΨ) =
(U(βββ )
0
). (2.15)
Inserting (2.14) and (2.15) in (2.13) we get
S(ΨΨΨ) = UT (βββ )I−1β β
U(βββ ),
where
Iβ β
= E[U(βββ )UT (βββ )
]ΨΨΨ=ΨΨΨ
=1
σ2 XXXTXXX .
Using the results above, the score test statistic in (2.13) can be simplified,resulting in the explicit expression given by
S(βββ ) =1
σ2
(yyy−XXXβββ
)TXXX(XXXTXXX
)−1XXXT(
yyy−XXXβββ
). (2.16)
We will now show that, under the null hypothesis (2.10), the score test statistic(2.16) has asymptotically a χ2−distribution with one degree of freedom.
Use the partition XXX = (XXX1... xxxp), where XXX1 : n× (p− 1) with rank p− 1 and
xxxp : n×1. Observe that
yyy−XXXβββ = (I−PPPXXX1)yyy,
where PPPXXX1 =XXX1(XXXT1 XXX1)
−1XXXT1 and that
S(βββ ) =1
σ2
(yyy−XXXβββ
)TXXX(XXXTXXX
)−1XXXT(
yyy−XXXβββ
)=
1σ2 yyyT (I−PPPXXX1)PPPXXX(I−PPPXXX1)yyy
=1
σ2 yyyT (PPPXXX −PPPXXX1)yyy.
In the following we need to utilize the following property of the projectionmatrix PPPXXX .
33
Proposition 2.3.1. The projection matrix PPPXXX can be written as a sum of pro-jection matrices such that
PPPXXX = PPPXXX1 +PPPxxx∗ (2.17)
= PPPXXX1 +(I−PPPXXX1)xxxpxxxT
p (I−PPPXXX1)
xxxTp (I−PPPXXX1)xxxp
.
Proof. We want to prove that PPPXXX =PPPXXX1 +PPPxxx∗ , where xxx∗ = (I−PPPX1)xxxp.
Using the partitioned form of XXX = (XXX1... xxxp) in the expression of PPPXXX yields
PPPXXX =
(XXX1
... xxxp
)(XXXT
1 XXX1 XXXT1 xxxp
xxxTpXXX1 xxxT
pxxxp
)−1( XXXT1
xxxTp
). (2.18)
In the continuation of the proof we are using well known rules of inversion ofa partitioned matrix, see e.g. Chatterjee and Hadi (1988, p. 15). Let MMM : q×qbe partitioned into a block form
MMM =
(AAA bbbbbbT c
),
where AAA : (q− 1)× (q− 1) is an invertible matrix, bbb : (q− 1)× 1 and c is ascalar. Observe that when XXX is partitioned, the matrix (XXXTXXX)−1 can be writtenin the same form as MMM. The inverse of MMM equals
MMM−1 =
(AAA−1 + 1
kAAA−1bbbbbbTAAA−1 −1kAAA−1bbb
−1kbbbTAAA−1 1
k
), (2.19)
where k = c−bbbTAAA−1bbb.
Applying the rules of inversion of a partitioned matrix defined in (2.19) thematrix (XXXTXXX)−1 can be expressed as
(XXXT1 XXX1
)−1+ 1
k
(XXXT
1 XXX1)−1 XXXT
1 xxxpxxxTpXXX1
(XXXT
1 XXX1)−1 −1
k
(XXXT
1 XXX1)−1 XXXT
1 xxxp
−1kxxxT
pXXX1(XXXT
1 XXX1)−1 1
k
(2.20)
where k = xxxTpxxxp−xxxT
pPPPXXX1xxxp = xxxTp (I−PPPXXX1)xxxp.
34
Inserting (2.20) in (2.18) yields
PPPXXX = PPPXXX1 + k−1 (PPPXXX1xxxpxxxTpPPPXXX1−PPPXXX1xxxpxxxT
p −xxxpxxxTpPPPXXX1 +xxxpxxxT
p)
= PPPXXX1 + k−1 ((I−PPPXXX1)xxxpxxxTp (I−PPPXXX1)
).
Let us evaluate the second term on the right hand side of the expression above.Since k is a scalar we can write
k−1 ((I−PPPXXX1)xxxpxxxTp (I−PPPXXX1)
)= (I−PPPXXX1)xxxp
(xxxT
p (I−PPPXXX1)xxxp)−1
xxxTp (I−PPPXXX1),
and since (I − PPPXXX1) is idempotent,(xxxT
p (I−PPPXXX1)xxxp)
can be written as(xxxT
p (I−PPPXXX1)(I−PPPXXX1)xxxp). Hence, we can identify
(I−PPPXXX1)xxxp(xxxT
p (I−PPPXXX1)(I−PPPXXX1)xxxp)−1
xxxTp (I−PPPXXX1),
to be a projection matrix for (I−PPPXXX1)xxxp, which equals xxx∗.
Hence,
PPPXXX =PPPXXX1 +PPPxxx∗ ,
and the proof is complete.�
Using (2.17) the score test statistic can be written
S(βββ ) =1
σ2 yyyT (PPPXXX −PPPXXX1)yyy =1
σ2 yyyTPPPxxx∗yyy. (2.21)
Let us look at the distribution of yyyTPPPxxx∗yyy. Under the null hypothesis we havethat
βββ = (βββT1 ,0)
T ,
where βββ 1 = (β1, . . . , βp−1)T . The expected value of PPPx∗yyy is
E [PPPxxx∗yyy] =PPPxxx∗E [yyy] =PPPxxx∗XXX1βββ 1 =(I−PPPXXX1)xxxpxxxT
p (I−PPPXXX1)XXX1βββ 1
xxxTp (I−PPPXXX1)xxxp
= 000,
since (I−PPPXXX1)XXX1βββ 1 = 000.
Therefore, since PPPxxx∗ is idempotent and rank(xxx∗) = 1,
1σ2 yyyTPPPxxx∗yyy∼ χ
2(1).
35
Since σ2 converges to σ2, according to Cramér’s theorem (see Gut, 1995),
1σ2 yyyTPPPxxx∗yyy→ χ
2(1).
Indeed one may note that (2.21) is exact F-distributed.
2.3.2 The score test in nonlinear regression
We will now derive the explicit expression of the score test statistic when test-ing a hypothesis about a parameter in a nonlinear regression model. Considerthe nonlinear model (2.2) and let ΨΨΨ =
(θθθ T ,σ2
)T , with θθθ = (θ1, . . . ,θq)T , be
the parameter space. Without loss of generality, consider the following hy-pothesis about a single parameter
H0 : ΨΨΨ =ΨΨΨ0, (2.22)
where ΨΨΨ0 =(θ1, . . . ,θq−1,0,σ2
)T . Let ΨΨΨ = (θθθT, σ2)T be the maximum like-
lihood estimate of ΨΨΨ under the null hypothesis (2.22).
The score test statistic for testing (2.22) is equal to the expression in (2.13), i.e
S(ΨΨΨ) = UT (ΨΨΨ)I−1(ΨΨΨ)U(ΨΨΨ).
The likelihood function and the log likelihood function for the nonlinear re-gression model is equal to (2.11) and (2.12), respectively, with the functionXXXβββ replaced with fff (XXX ,θθθ). Therefore, the score vector for θθθ is equal to
U(θθθ) =d`dθθθ
∣∣∣∣ΨΨΨ=ΨΨΨ
=1
σ2 FFF(θθθ)r,
wherer = (rk) = yyy− fff (XXX ,θθθ) (2.23)
are the residuals under the null hypothesis, and FFF(θθθ) : q×n is the matrix suchthat
FFF(θθθ) =(
FFF1(θθθ), . . . ,FFFn(θθθ))=
ddθθθ
fff (XXX ,θθθ)
∣∣∣∣θθθ=θθθ
. (2.24)
As in the linear regression case, U(σ2) = 0, since
σ2 =
1n(yyy− fff (XXX ,θθθ))T (yyy− fff (XXX ,θθθ))
36
is the maximum likelihood estimate of σ2. Hence, we have that
U(ΨΨΨ) =
(U(θθθ)
0
)=
(1
σ2 FFF(θθθ)r0
). (2.25)
The information matrix, I(ΨΨΨ), is given by
I(ΨΨΨ) =
(E[U(θθθ)UT (θθθ)
]000q
000Tq E
[U(σ2)UT (σ2)
])
ΨΨΨ=ΨΨΨ
, (2.26)
for the same reason as in the linear regression case. If (2.25) and (2.26) areinserted in the expression for S(ΨΨΨ), given in (2.13), we get
S(ΨΨΨ) = UT (θθθ)I−1θ θ
U(θθθ),
where
Iθ θ
= E[U(θθθ)UT (θθθ)
]ΨΨΨ=ΨΨΨ
=
(1
σ4 FFF(θθθ)E[(yyy− fff (XXX ,θθθ))(yyy− fff (XXX ,θθθ))T ]FFFT (θθθ)
)ΨΨΨ=ΨΨΨ
=1
σ2 FFF(θθθ)FFFT (θθθ).
Using the results above we find that the score test statistic for testing the hy-pothesis (2.22) can be written
S(θθθ) =1
σ2 rTFFFT (θθθ)(
FFF(θθθ)FFFT (θθθ))−1
FFF(θθθ)r, (2.27)
where rrr and FFF(θθθ) are defined in (2.23) and (2.24), respectively.
Under the null hypothesis (2.22), the score test statistic in (2.27) has asymp-totically a χ2 distribution with 1 degree of freedom, see Seber and Wild (2003).
37
38
3. Influence analysis in regression
It is well understood that not all observations in a data set play an equal rolefor inference about a statistical model. For instance, in regression analysis thecharacter of the regression line may be determined by only a few observationswhile most of the data is somewhat ignored. Such observations, that substan-tially influence the results of the inference and/or the data analysis, are calledinfluential observations. The study of the effect they have on the inference iscalled influence analysis or sensitivity analysis.
Detection of influential observations is an important part of the statisticalparadigm. Examination of the data, and the ability to find influential obser-vations, can be beneficial in several ways. It can
• help reveal spurious observations that might be a result of errors duringthe collection or the processing of the data. Examples of such errors aremeasurement errors and keypunching errors.
• make the researcher aware of the possibility that some part of the datamight come from another regime, or subpopulation, that have very dif-ferent features compared to the population of study.
• give a hint of what properties data should have, if additional data collec-tion is relevant, for instance in order to produce a more stable model andinsensitive estimates.
• give the researcher an increased confidence in the results and intimateknowledge about the data.
Influence analysis in regression has been a very active area of research. Nowa-days, there are many strategies available for detecting influential observations,for instance graphical displays. A popular graphical tool for detection of ob-servations with a substantial influence on the parameter estimates in linearregression models is the added variable plot, proposed by Mosteller and Tukey(1977). In Cook (1987), a similar plot is proposed for use in nonlinear regres-sion. A detailed discussion of diagnostic plots is given in Chapter 4.
39
Other approaches to identifying influential observations can be to use influencemeasures that enable a quantification of the observations’ influence on variousaspects of the regression analysis. The most well-known influence measureis Cook’s distance, proposed by Cook (1977) and widely used in linear re-gression. This influence measure is used for assessing the influence of theindividual observations on the vector of estimated regression parameters. Theexplicit expression of Cook’s distance for model (2.1) is given by
Ck =(βββ − βββ (k))
TXXXTXXX(βββ − βββ (k))
pσ2 , k = 1, . . . ,n, (3.1)
where p is the number of explanatory variables in the model and βββ (k) is theestimate of βββ when the kth observation is excluded from the calculations.
A version of Cook’s distance for assessing the influence of the observationson the vector of estimated parameters in the nonlinear regression model (2.2)is proposed by Cook and Weisberg (1982). The explicit expression of thismeasure is given by(
θθθ − θθθ (k)
)TFFF(θθθ)FFFT (θθθ)
(θθθ − θθθ (k)
)qσ2 , k = 1, . . . ,n, (3.2)
where q is the number of parameters in the model, θθθ (k) is the estimate of θθθ
when the kth observation is excluded from the calculations, and FFF(θθθ) : q×n isa matrix of derivatives such that
FFF(θθθ) =(
FFF1(θθθ), . . . ,FFFn(θθθ))=
ddθθθ
fff (XXX ,θθθ)
∣∣∣∣θθθ=θθθ
, (3.3)
where the derivative is defined in Appendix A.
Some other influence measures available for use in linear and nonlinear regres-sion analysis will be further discussed in Chapter 5. Moreover, in Chapter 5,the new influence measures proposed in this thesis will be derived in detail.
The idea behind the construction of the new influence measure, proposed inthis thesis, is to perform small perturbations in the model formulation. Whenthe perturbations are imposed, the resulting model is referred to as the per-turbed model. The perturbed model used in this thesis is defined to be
yyyω = fff (XXX ,θθθ)+εεεω , (3.4)
40
where εεεω ∼ Nn(000,σ2W−1(ω)) and W(ω) : n×n is a diagonal weight matrix.The expectation function fff (XXX ,θθθ) can be both linear and nonlinear in its pa-rameters.
Using the diagonal weight matrix, W(ω), we perturb the error variance in theregression model. We can choose to perturb the error variance of the kth obser-vation by introducing the weight, 0<ωk≤ 1, as the kth diagonal element in W,the other diagonal elements being equal to one. We can also choose to perturbthe error variance for multiple observations simultaneously. If we perturb theerror variance for all n observations in the data set we let the diagonal elementsof W be equal to the vector ωωω = (ω1, . . . ,ωn)
T , where 0 < ωk ≤ 1, k = 1, . . . ,n.
The perturbed model (3.4) is used for e.g. estimation of the parameters. Theestimates of the parameters are functions of the imposed perturbation weight,denoted θθθ(ω). The idea is to study the rate of change in the estimates as theweight approaches one. This approach to influence analysis is called the differ-entiation approach, see e.g. Chatterjee and Hadi (1988) and will be discussedin detail in Chapter 5. Moreover, we assume that there exists some null per-turbation weight ω0, so that θθθ(ω0) = θθθ is the estimate from the unperturbedmodel, i.e. the nonlinear regression model (2.2). In our case ω0 = 1.
The structure of the imposed perturbations is called perturbation scheme. Thereexist other perturbation schemes than the one described above, where the errorvariance is perturbed.
We will now describe the case-weighted perturbation scheme. Let 0≤ ωk ≤ 1and let W(ωk) = diag(1, . . . ,1,ωk,1, . . . ,1). For the linear regression model(2.1), define the perturbed model as
yyyω =XXXωβββ +εεε,
where yyyTω = yyyT W(ωk) and XXXT
ω = XXXT W(ωk). The estimator for βββ in the per-turbed model is a function of ωk and is equal to
βββ (ωk) =(XXXT
ωXXXω
)−1XXXT
ωyyyω .
If ωk = 1 then we have that βββ (ωk) = βββ , the estimator of βββ in the unper-turbed linear regression model (2.1). Moreover, if ωk = 0, then we have thatβββ (ωk) = βββ (k), the estimator of βββ in the unperturbed linear regression model(2.1) when the kth observation is excluded from the calculations. Using the
41
case-weighted perturbation scheme with ωk = 0 is also referred to as the case-deletion approach. An example of diagnostic measures constructed using case-deletion is Cook’s distance defined in (3.1) and the nonlinear version definedin (3.2). The approach of using case-deletion is also referred to as global in-fluence.
According to Ross (1987), rather than using the case-weighted perturbationscheme, an alternative way of studying case-deletion in nonlinear regression isto define the case-deletion model
yyy = fff (XXX ,θθθ)+dddiγ +εεε,
where dddi is the ith column of the identity matrix of size n and γ is an unknownparameter. Adding dddiγ to the model deletes the ith observation when the modelis fitted. This approach can also be used for linear regression models, see e.g.Chatterjee and Hadi (1988).
Besides case-deletion, another approach widely used in influence analysis isthe local influence approach. This approach was proposed by Cook (1986) andhas had, and still has, a great impact on the research area of influence analysis.In the local influence approach the weights are not restricted to be zero, as inthe case-deletion approach. Rather, they can vary between zero and one. Thisapproach relies on a well-behaved likelihood since a central concept is to usethe likelihood displacement, LD, in an influence graph.
The LD measures the amount that the maximum likelihood estimates, MLE’s,of the parameters from the perturbed model are displaced from the MLE’s ofthe parameters from the unperturbed model. The LD for the perturbed model(3.4) using a single perturbation weight, ωk, is defined as
LD(ωk) = 2(
lnL(θθθ)− lnL(θθθ(ωk))),
where lnL(θθθ) and lnL(θθθ(ωk)) are the log-likelihood functions for the unper-turbed model and the perturbed model, respectively.
An influence graph is the graph of a statistic, which is a function of the per-turbation weight, versus the perturbation weight. As an example, a graph ofLD(ωk) versus 0 < ωk ≤ 1 is an influence graph in R2. If we decide to useperturbation weights for all n observations, the resulting influence graph is agraph in Rn+1.
42
A central influence measure in the local influence approach is the curvature C,of the influence graph in the neighborhood of null perturbation weight, ω0 = 1.If n observations are perturbed, another central diagnostic is the vector ` in Rn
that describes the direction of perturbation. Let L(θθθ) be the likelihood of anunperturbed model, not necessarily the nonlinear regression model (2.2). Letθθθ : q×1 be a vector of unknown parameters and ωωω : n×1 be the perturbationweights introduced to the unperturbed model, for instance the perturbationsof the error variance as described above. Now, denote the curvature of theinfluence graph at the null perturbation by C` . Then,
C` = 2`T∆
T L−1∆ `,
where
L =d
dθθθ
(d lnL(θθθ)
dθθθ
)∣∣∣∣θθθ=θθθ
∆ =d
dθθθ
(d lnL(θθθ(ωωω))
dωωω
)∣∣∣∣θθθ=θθθ ,ωωω=ωωω0
,
and where the matrix derivative is defined in Appendix A.
Cook (1986) suggested the use of `max, the direction causing the maximumcurvature, Cmax, as an influence measure. The vector `max indicates how toperturb the model in order to obtain the greatest local change in the likeli-hood displacement. For instance, assume perturbing all observations in thedata set simultaneously and suppose that the kth element of `max is found to berelatively large. This indicates that perturbations in the weight ωk of the kthobservation may lead to substantial changes in the results of the analysis andthat the kth observation is relatively influential.
The approach of local influence has several benefits. It is appealing as it allowsfor measuring the influence of a single observation as well as the assessmentof the influence of multiple observations, which was a new idea at the time thearticle was written. It is further discussed in Cook (1986) how to assess theinfluence of the observations on subsets of parameters in the linear regressionmodel (2.1). Moreover, the local influence approach is not restricted to linearregression models: it can be used for a variety of problems. In Cook (1986)the local influence approach is discussed, not only for linear regression mod-els, but also for generalized linear models.
The local influence approach is extended to nonlinear regression models bySt. Laurent and Cook (1993). The interest is in assessing the influence of the
43
observations on the fitted values and the estimate of the error variance whenall n observations in the data set are perturbed. They also discussed the oppor-tunity to assess the influence of a single observation on the fitted values andthe estimate of the error variance.
Influential observations are closely connected to high-leverage observationsand outliers. A deeper understanding of the diagnostic measures used to detectinfluential observations can be achieved when they are analyzed in terms ofhigh-leverage observations and outliers. Therefore, we will devote a few para-graphs to define and discuss high-leverage observations and outliers.
According to Hoaglin and Welsch (1978), a high-leverage observation in lin-ear regression analysis is an outlying observation in the X−space. The kthdiagonal element of the projection matrix, defined as
PPPXXX =XXX(XXXTXXX)−1XXXT (3.5)
,is a measure of leverage for the kth observation, and it is denoted
pkk = xxxTk(XXXTXXX
)−1xxxk.
The diagonal elements of the matrix PPPXXX are called leverages since they can bethought of as the amount of leverage of the response value on the correspond-ing predicted value, i.e.
PPPXXXyyy = yyy =XXXβββ . (3.6)
The matrix PPPXXX is also known as the hat matrix or the prediction matrix. Thisis due to the fact that when PPPXXX is post-multiplied by yyy it "puts a hat" on yyy andcreates the predictions, as seen from (3.6).
It is worth noting that the values of the diagonal elements of PPPXXX are betweenzero and one, i.e. 0 ≤ pkk ≤ 1. Moreover, rank(PPPXXX) = trace(PPPXXX) = p, wherep is the numbers of columns of XXX . As a consequence, the average of the di-agonal elements in PPPXXX is (p/n). Experience suggests that a reasonable ruleof thumb for large values of pkk is pkk > (2p/n). To read more about the pro-jection matrix in linear regression, see for instance Hoaglin and Welsch (1978).
Deriving measures of leverage for observations in a nonlinear regression modelis more complex than in the linear regression case, since the expectation sur-face is not a linear subspace. St. Laurent and Cook (1992) define two types
44
of leverages. One type of leverage is the tangent plane leverage. Considerthe matrix of derivatives FFF = FFF(θθθ) defined in (3.3). When FFF is of full col-umn rank we make use of the matrix that projects onto the tangent plane, i.e.PPPFFF = FFFT (FFFFFFT )−1FFF . A measure of the tangent plane leverage for the kth ob-servation is the kth diagonal elements of PPPFFF given by
pkk =FFFTk (FFFFFFT )−1FFFk, (3.7)
where FFFTk is the kth row of FFFT . The matrix PPPFFF is referred to as the tangent
plane leverage matrix (St. Laurent and Cook, 1992).
Another measure of leverage in nonlinear regression models is referred to bySt. Laurent and Cook (1992) as the Jacobian leverage. The Jacobian leveragematrix is given by
JJJ =FFFT (FFFFFFT −GGG(rrr⊗ Iq))−1
FFF ,
where rrr = (rk) = yyy− fff (XXX ,θθθ) is the n−vector of residuals, FFF is defined in (3.3)and GGG =GGG(θθθ) is a q×nq matrix of derivatives such that
GGG(θθθ) =d
dθθθ
(d fff (XXX ,θθθ)
dθθθ
)∣∣∣∣θθθ=θθθ
=dFFF(θθθ)
dθθθ.
St. Laurent and Cook (1992) argued that PPPFFF and JJJ can differ dramatically ifthe tangent plane is not an adequate approximation of the nonlinear expectationsurface (near θθθ ). By inspection of PPPFFF and JJJ, we see that they will be different ifthe components of GGG(rrr⊗ Iq) dramatically differ from zero. However, St. Lau-rent and Cook (1992) suggested using PPPFFF whenever possible since it is easierto conduct computations and interpretation of the results is similar to linearregression. For example, the diagonal elements of PPPFFF have the following prop-erties: 0 ≤ pkk ≤ 1 and ∑
nk=1 pkk = q, where q is the number columns of FFFT .
These properties do generally not hold for the diagonal elements of the Jaco-bian leverage matrix, see St. Laurent and Cook (1992). In accordance with theproperties of the projection matrix PPPXXX for the linear regression model, we canin this thesis define a large value of the tangent plane leverage as pkk > (2q/n).
We often define an outlier to be an observation for which the residual is largein magnitude compared to the other observations. Outliers can be detected byanalyzing the ordinary residuals rrr = (rk) = yyy− fff (XXX ,θθθ), k = 1, . . . ,n, and wherethe function fff (XXX ,θθθ) can be either linear or nonlinear in its parameters.
It is important to note the following: Outliers do not need to be influentialobservations, and influential observations do not need to be outliers. For ex-amples see Andrews and Pregibon (1978) and Chatterjee and Hadi (1986). The
45
same applies for high-leverage observations, which do not need to be influen-tial observations. However, plots of residuals and the diagonal elements of theprojection matrix provide a good basis for influence measures. It can providemore understanding concerning why an observation is influential. To illustratethe relationship between an influence measure and high-leverage observationsand outliers, rewrite Cook’s distance, defined in (3.1), as follows
Ck =1p
pkk
1− pkkrk,stud ,
where rk,stud = rk/(σ√
1− pkk) is the kth studentized residual andσ2 = (yyy−XXXβββ )T (yyy−XXXβββ )/(n− p), see Chatterjee and Hadi (1988). From theequation above, we see that a high-leverage observation and/or an outlier willincrease the value of Cooks’ distance through pkk and rk,stud , respectively.
What one should do when having identified an influential observation is a ques-tion with no clear answer. However, Cook and Weisberg (1982) provide someadvice. If unusual and influential observations are a consequence of, for in-stance, a mistake in the data collecting process, these points could simply beremoved or if possible corrected. Collecting more data or reporting the resultsof separate analyses, with and without the observations in question, are twoadditional possibilities. Moreover, if predictions are important, it may be pos-sible to partially circumvent the effects of the influential observations by iso-lating stable regions, or regions where the influence is minimal or unimportant.The emphasis of this thesis is, however, on identifying influential observationsrather than how to address them once they are found.
46
4. Graphical displays
Graphical displays are widely used as diagnostic methods in linear regressionand have a long history. We can go back almost a decade and find methods thatare still in use today: for instance, the partial residual plot proposed by Ezekiel(1924). However, a significant amount of work has been done to improve exist-ing graphical diagnostic tools and to develop new. A few achievements in thearea of regression graphics worth mentioning is the evolution in the 1970’s ofthe use of ordinary residuals in various scatter plots towards the use of differ-ent standardized residuals, see e.g. Behnken and Draper (1972) and Andrewsand Pregibon (1978). Mosteller and Tukey (1977) proposed the added variableplot, a diagnostic tool that can be used in multiple linear regression. This plotwill be described more thoroughly in the next section. For more referencesconcerning regression graphics, see Cook (1998).
Research on graphical tools in nonlinear regression has not been as extensiveas in the linear regression case. New thoughts and innovative ideas have beenintroduced into the area by Cook (1987), where the plot similar to the addedvariable plot has been proposed for use in nonlinear regression. We will pro-vide a deeper discussion of Cook’s results on the plot in Section 4.2.
This chapter will provide an overview of existing graphical methods togetherwith new results obtained in this thesis, which contributes to influence anal-ysis in nonlinear regression. The chapter is divided into two parts. Section4.1 gives an overview of existing graphical methods for the linear regressionmodel and the added variable plot is described in detail. Section 4.2 is devotedto graphical displays in nonlinear regression. In this section we will describethe added parameter plot, which is one of the main results obtained in this the-sis. In Section 4.2.2, the construction and interpretation of the added parameterplot will be illustrated with a numerical example.
47
4.1 Graphical displays in linear regression
It is well known that various scatter plots of residuals are fundamental for validinterpretation of the results obtained from regression analysis. Residual plotsare of utmost importance for validating the model assumptions. We can, for in-stance, construct a quantile-quantile plot of the standardized residuals to checkthe assumption of normality, and we can plot the residuals versus the fittedvalues to examine if the variance of the error terms seem to be homoscedastic.
Diagnostic plots involving the residuals, such as the added variable plot, thepartial residuals plot and the augmented partial residuals plot can be used toassess the effect of an additional explanatory variable in the regression model.Moreover, the added variable plot can also be used for finding observationswith high influence on the parameter estimates. These plots, and several others,are described by Chatterjee and Hadi (1988) where the partial residuals plot isreferred to as the components-plus-residuals plot. In the next section the addedvariable plot will be described in detail.
4.1.1 The added variable plot
The added variable plot, AVP, is a diagnostic tool that is used in multiple lin-ear regression. It displays the effect of including an extra explanatory variableto the regression, when the other explanatory variables are already taken intoaccount. The plot is helpful for detecting influential observations (see e.g.Belsley et al., 1980) and is referred to as the partial-regression leverage plotby Belsley et al. (1980).
For the linear regression model (2.1), the AVP for the explanatory variable
Xp is constructed in two steps. Firstly, partition XXX : n× p = (XXX1... xxxp) and
βββ T = (βββ T1
... βp). The matrix XXX1 is given by XXX1 = (xxx1, . . . ,xxxp−1) and the vectorβββ 1 is given by βββ 1 = (β1, . . . ,βp−1)
T .
Using the partitioned XXX and βββ , model (2.1) can be expressed as
yyy =XXX1βββ 1 +xxxpβp +εεε, (4.1)
where yyy and εεε are vectors of the response variable and the error term, respec-tively. It is assumed that εεε ∼ N(000n,σ
2In).
48
Secondly, define the projection matrix PPPXXX1 = XXX1(XXXT1 XXX1)
−1XXXT1 . The AVP for
explanatory variable Xp is defined to be the scatter plot of
yyy = (I−PPPXXX1)yyy (4.2)
against
xxx = (I−PPPXXX1)xxxp, (4.3)
along with their estimated regression line.
As a motivation for using the residuals yyy and xxx in the AVP (see e.g. Chatterjeeand Hadi, 1986), let us pre-multiply (4.1) by (I−PPPXXX1) and take expectation.We obtain
E ((I−PPPXXX1)yyy) = (I−PPPXXX1)XXX1βββ 1 +(I−PPPXXX1)xxxpβp. (4.4)
Observe that (I−PPPXXX1)XXX1βββ 1 = 000n, so that (4.4) becomes
E (yyy) = xxxα, (4.5)
where yyy = (I−PPPXXX1)yyy and xxx = (I−PPPXXX1)xxxp are both n−vectors and α = βp.This suggests that the residuals yyy and xxx in the AVP display the effect of intro-ducing the variable Xp to the regression model yyy = XXX1βββ 1 +εεε . Furthermore,a linear trend in the AVP indicates that the explanatory variable Xp should beincluded in the model.
Belsley et al. (1980) refer to the same plot as the partial-regression leverageplot for the parameter estimate βp, and not the explanatory variable Xp. This isdue to the fact that the slope of the regression line in the plot, i.e. α in model(4.5), is equal to the estimate of βp in (4.1). From this point of view, we cansay that a strong linear trend in the plot indicates that the parameter βp mightsignificantly differ from zero.
The AVP is also used to detect influential observations on the parameter esti-mate in question, and Example 4.1 illustrates the utilization of the AVP for thispurpose.
Example 4.1. Added variable plot in linear regression
An example of an AVP is given in Figure 4.1 where we use data from Stan-ley and Miller (1979). A detailed analysis of the data is provided by Cook
49
**
*
*
* *
*
**
**
**
**
*
*
*
*
*
*
*
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−0.4
−0.2
0.0
0.2
0.4
0.6
1415
16
18
22
4819
20
5 6
12
12
3
7
9
101113
17
21
y~
x~
Figure 4.1: Added variable plot for explanatory variable "RGF" using the datapresented in Cook and Weisberg (1982) on jet fighters.
and Weisberg (1982). By inspection of the observations in the plot observa-tion numbers 16 and 22 attract our attention since they are separated from therest. The solid line in Figure 4.1 represents the estimated regression line whenall observations are included in the analysis. The dashed line in Figure 4.1represents the estimated regression line when the observations 16 and 22 aredeleted. We can clearly see that the presence of these observations pulls theregression line upwards. Thus, these observations might be influential and fur-ther analysis is needed.
Belsley et al. (1980) present different features of the plot. Firstly, which isalready mentioned above, the slope of the regression line through the origin inthe AVP is equal to βp in the regression of yyy on XXX . Secondly, the residuals thatresult from regressing yyy on xxx are equal to the residuals from the regression of yyyon XXX , and third, the correlation between yyy and xxx is equal to the partial correla-tion between yyy and xxxp in the multiple regression of yyy on XXX .
In the reminder of this section, we want to highlight another interesting featureof the AVP. We will demonstrate that the AVP is connected to the score teststatistic for testing H0 : βp = 0 in model (2.1).
50
The score test statistic was defined in (2.16) and given by
S(βββ ) =1
σ2
(yyy−XXXβββ
)TXXX(XXXTXXX
)−1XXXT(
yyy−XXXβββ
), (4.6)
where βββ =(
β1, . . . , βp−1,0)
and σ2 are the maximum likelihood estimates of
βββ and σ2 under the null hypothesis, i.e. σ2 = 1n(yyy−XXXβββ )T (yyy−XXXβββ ).
Now, let SSR denote the sum of squares due to regression in the regression ofyyy on xxx. In Proposition 4.1.1 we will show in detail that SSR is proportional tothe numerator in the score test statistic.
Proposition 4.1.1. The score test statistic, given in (4.6), for testing H0 : βp = 0in model (2.1) is proportional to SSR in the regression of yyy on xxx defined in(4.2) and (4.3), respectively.
Proof. Observe that SSR in the regression of yyy on xxx can be written yyyTPPPxxxyyy,where PPPxxx = xxx(xxxT xxx)−1xxxT . Using the definition of the residuals in (4.2) we get
yyyTPPPxxxyyy = yyyT (I−PPPXXX1)PPPxxx(I−PPPXXX1)yyy. (4.7)
Since XXX is a partitioned matrix, the projection matrix PPPXXX can be decomposedinto a sum of projection matrices, see the proof of Proposition 2.3.1. In fact,
PPPXXX =PPPXXX1 +PPPxxx, (4.8)
Inserting (4.8) in (4.7) results in
yyyTPPPxxxyyy = yyyT (I−PPPXXX1)(PPPXXX −PPPXXX1)(I−PPPXXX1)yyy
= yyyT (I−PPPXXX1)(PPPXXX −PPPXXX1−PPPXXXPPPXXX1 +PPPXXX1PPPXXX1)yyy
= yyyT (I−PPPXXX1)(PPPXXX −PPPXXX1)yyy
= yyyT (I−PPPXXX1)PPPXXX(I−PPPXXX1)yyy,
since PPPXXXPPPXXX1 =PPPXXX1 and PPPXXX1 is idempotent.
Now, observe that the score test statistic in (4.6) can be written as
S(βββ ) =1
σ2
(yyy−XXXβββ
)TPPPXXX
(yyy−XXXβββ
),
and that yyy−XXXβββ are the residuals under the null hypothesis, i.e. (I−PPPXXX1)yyy.
51
Thus,
SSR = yyyTPPPxxxyyy = yyyT (I−PPPXXX1)PPPXXX(I−PPPXXX1)yyy
=(
yyy−XXXβββ
)TPPPXXX
(yyy−XXXβββ
)∝ S(βββ ),
and this completes the proof.�
Due to Proposition 4.1.1, the AVP can be viewed as a graphical representationof the score test for testing whether the parameter corresponding to the addedvariable is zero. An apparent linearity in the AVP for the explanatory variableXp suggests a high value of the SSR, and thus, a high value of the score teststatistic for the testing H0 : βp = 0 in model (2.1). This property makes theAVP useful for detecting observations that are not only influential on the pa-rameter estimate but also on the score test statistic.
4.2 Graphical displays in nonlinear regression
The ideas behind the AVP described in Section 4.1.1, can be extended to non-linear regression models. Cook (1987) described a plot similar to the AVPwhich is referred to as a first-order extension of an AVP. A further discussionabout this plot will be given in the next section, where we also derive one ofthe main results of this thesis, the added parameter plot, APP, along with itsfeatures that separates the APP from the first-order extension of an AVP.
Before embarking on the detailed discussions of the APP, the following re-mark is important. Recall that the AVP is designed to display informationavailable for assessing the significance of a specific explanatory variable forthe linear regression model. Moreover, the AVP is used to detect observationsthat have a substantial influence on the parameter estimate corresponding tothe added variable. The relation between the added variable Xp and the param-eter estimate βp is possible due to the one-to-one correspondence between thevariables and the parameters in the model. However, it is worth noticing thatthis one-to-one correspondence does not necessarily exist in nonlinear regres-sion models. Thus, for the plots in nonlinear regression similar to the AVP, thefundamental objective is to display information relevant for inference about aselected parameter rather than a selected variable.
52
4.2.1 The added parameter plot
When testing the significance of parameters in a nonlinear regression model,the score test described in Section 2.3 can be used. From the point of view ofexplorative data analysis, it would be helpful to be able to graphically displaydata points that lead to a high value of the score test statistic. Previously wehave shown that the AVP, used in multiple linear regression, can be consideredas a graphical representation of the score test and can help visualizing obser-vations with high influence on the score test statistic. A similar plot wouldcertainly be desirable for the nonlinear regression model (2.2).
The goal of this section is to construct a new plot, the added parameter plot,APP, and to show that the APP has the property of being a graphical represen-tation of the score test statistic for testing the hypothesis
H0 : θθθ = θθθ0, (4.9)
described in Section 2.3, where θθθ 0 = (θ1, . . . ,θq−1,0)T . In deriving the plot we
will utilize the results on the AVP obtained in linear regression and the resultsof Cook (1987), where a plot similar to the APP has been proposed.
Cook (1987) describes a plot, referred to as the first-order extension of an AVP.It is created in the same way as the AVP, letting the derivative of the expec-tation function with respect to the parameters take the role of the matrix XXX inmodel (2.1). Let us illustrate the construction of this plot for the parameterestimate of θq in the nonlinear regression model (2.2).
Let yyy = fff (XXX ,θθθ)+εεε and consider the linear expansion of fff (XXX ,θθθ) around themaximum likelihood estimate, θθθ ,
fff (XXX ,θθθ)≈ fff (XXX ,θθθ)+FFF(θθθ)(θθθ − θθθ),
where FFF(θθθ) : q×n is a matrix of derivatives such that
FFF(θθθ) =(
FFF1(θθθ), . . . ,FFFn(θθθ))=
ddθθθ
fff (XXX ,θθθ)
∣∣∣∣θθθ=θθθ
,
and where the derivative is defined in Appendix A.
53
Rewriting the model yyy = fff (XXX ,θθθ)+εεε , defined in (2.2), by replacing fff (XXX ,θθθ)
with its linear expansion around θθθ and rearranging terms suggests the linearmodel
rrr =FFF(θθθ)δδδ +εεε∗, (4.10)
where rrr = yyy− fff (XXX ,θθθ) is an n−vector of ordinary residuals, δδδ = (θθθ − θθθ) is aq−vector and εεε∗ represents the error.
Now, partition FFF(θθθ) so that FFFT (θθθ) = (FFF1... FFF2) and δδδ T = (δδδ T
1... δ2), where δ2
contains the parameter of interest, θq. The partitioned model becomes
rrr = FFF1δδδ 1 + FFF2δ2 +εεε∗. (4.11)
Define PPPFFF1= FFF1
(FFF
T1 FFF1
)−1FFF
T1 , pre-multiply (4.11) by (I−PPPFFF1
) which yields
(I−PPPFFF1)rrr = (I−PPPFFF1
)FFF1δδδ 1 +(I−PPPFFF1)FFF2δ2 +(I−PPPFFF1
)εεε∗, (4.12)
and observe that (I−PPPFFF1)FFF1δδδ 1 = 000, i.e. the effect of FFF1 is removed from the
model.
In order to arrive at the first-order extension of an AVP, we want to show that(I−PPPFFF1
)rrr in (4.12) is equal to rrr = yyy− fff (XXX ,θθθ), the ordinary residuals whenestimating parameters in (2.2) via the maximum likelihood approach.
To show that (I−PPPFFF1)rrr = rrr we will start by showing that rrr = (I−PPPFFF)rrr, where
PPPFFF =FFFT (θθθ)(FFF(θθθ)FFFT (θθθ))−1FFF(θθθ).
Observe that FFF(θθθ)rrr = FFF(θθθ)(yyy− fff (XXX ,θθθ)) = 000, since θθθ are found by using thenormal equations, given by FFF(θθθ)(yyy− fff (XXX ,θθθ)) = 000. Thus,
(I−PPPFFF)rrr = (I−FFFT (θθθ)(FFF(θθθ)FFFT (θθθ))−1FFF(θθθ))rrr = rrr. (4.13)
Inserting rrr = (I−PPPFFF)rrr, obtained in (4.13), into (I−PPPFFF1)rrr yields
(I−PPPFFF1)rrr = (I−PPPFFF1
)(I−PPPFFF)rrr = (I−PPPFFF)rrr = rrr,
since PPPFFFPPPFFF1=PPPFFF1
.
We have now shown that (I−PPPFFF1)rrr in (4.12) is equal to rrr = yyy− fff (XXX ,θθθ) and
(4.12) can thus be written as
rrr = (I−PPPFFF1)FFF2δ2 +(I−PPPFFF1
)εεε∗. (4.14)
54
Therefore, the first order extension of an AVP is defined as the scatter plot of
yyy∗ = rrr against xxx∗ = (I−PPPFFF1)FFF2. (4.15)
It is suggested that plotting yyy∗ against xxx∗ would display the effects that con-tribute to the estimate of δ2, since it depends on the linear constructed modeldefined in (4.10). If the linear constructed model in (4.10) is a valid approxi-mation of (2.2), the plot of the residuals yyy∗ against xxx∗ is a valuable diagnostictool for assessing the influence of the observations on the parameter estimatefor θq, see Cook (1987).
The novel idea behind the APP proposed in this thesis, is to modify the first-order extension of an AVP so that it visualizes the score test. The same ap-proach is used as described in Cook (1987), and FFF(θθθ) is used as XXX , whereFFF(θθθ) : q×n is a matrix defined as
FFF(θθθ) =(
FFF1(θθθ), . . . ,FFFn(θθθ))=
ddθθθ
fff (XXX ,θθθ)
∣∣∣∣θθθ=θθθ
, (4.16)
and θθθ = (θ1, . . . , θq−1,0) is the maximum likelihood estimates of θθθ under thenull hypothesis (4.9). Hereafter, FFF is used to denote FFF(θθθ).
The derivation of the plot that is a graphical representation of the score test
consists of two steps. Firstly, we partition the matrix FFF : q×n as FFFT=(FFF1
... FFF2),where FFF2 is the partial derivative of fff (XXX ,θθθ) with respect to θq evaluated at θθθ
and define PPPFFF1= FFF1(FFF
T1 FFF1)
−1FFFT1 . Secondly, we construct two sets of residuals
using the partition of FFFT
. The APP is defined as the following.
Definition 4.2.1. The scatter plot of
yyy = (I−PPPFFF1)yyy and xxx = (I−PPPFFF1
)FFF2, (4.17)
along with the least squares estimated regression line resulting from regressingyyy on xxx, are defined to be the APP for θq.
As a motivation for using the residuals yyy and xxx in the APP, replace fff (XXX ,θθθ) inthe nonlinear regression model (2.2) with its linear expansion around θθθ andrearrange the terms. This yields the model
yyy− fff (XXX ,θθθ) = FFFT (
θθθ − θθθ
)+εεε∗. (4.18)
55
Let δδδ = (θθθ − θθθ) and partition the model (4.18) so that it can be written
yyy− fff (XXX ,θθθ) = FFF1δδδ 1 + FFF2δ2 +εεε∗, (4.19)
where δ2 = (θq− θq) displays the parameter of interest. Applying the methoddescribed in Section 4.1.1 to the partitioned model (4.19) gives
(I−PPPFFF1)rrr = (I−PPPFFF1
)FFF1δδδ 1 +(I−PPPFFF1)FFF2δ2 +(I−PPPFFF1
)εεε, (4.20)
where rrr = yyy− fff (XXX ,θθθ).
Observe that in (4.20), (I−PPPFFF1)FFF1δδδ 1 =000 and the effect of FFF1 is removed from
the model. Moreover, observe that, if the constructed linear model (4.18) is avalid approximation of the nonlinear regression model under the null hypothe-sis (4.9), then rrr = (I−PPPFFF1
)yyy. Thus, the vector of responses in (4.20) becomes
(I−PPPFFF1)rrr = (I−PPPFFF1
)(I−PPPFFF1)yyy = (I−PPPFFF1
)yyy.
It is worth noting that the residuals yyy and xxx are created in the same way as theresiduals yyy∗ and xxx∗ defined in (4.15). However, an important distinction is thatthe sets of residuals in (4.17) are created when the matrix of derivatives FFF isevaluated for the parameter estimates under the null hypothesis, H0 : θq = 0.
The APP is based on similar ideas as the AVP in linear regression, and it hasproperties similar to the properties of the AVP. From Section 4.1.1 we knowthat the slope of the regression line in the AVP is equal to the estimate ofthe parameter corresponding to the added variable, when all other explanatoryvariables are included in the regression model. The following theorem illus-trates that a similar property holds for the APP.
Theorem 4.2.1. The least squares estimate, α , of the slope, α , resulting fromregressing yyy on xxx is equal to the updated parameter estimate, θ
(1)q , after one
iteration of the Gauss-Newton algorithm when θθθ is used as starting value.
Proof. The least squares estimate of the slope α resulting from regressing yyyon xxx is given by
α = (xxxT xxx)−1xxxT yyy.
When using the definition of yyy and xxx in (4.17) we get
α =(
FFFT2 (I−PPPFFF1
)(I−PPPFFF1)FFF2
)−1FFF
T2 (I−PPPFFF1
)(I−PPPFFF1)yyy
=(
FFFT2 FFF2− FFF
T2 PPPFFF1
FFF2
)−1FFF
T2 (I−PPPFFF1
)yyy.
56
We will now show that the updated parameter estimate, θ(1)q , after one itera-
tion of the Gauss-Newton algorithm when θθθ is used as starting value is equal
to(
FFFT2 FFF2− FFF
T2 PPPFFF1
FFF2
)−1FFF
T2 (I−PPPFFF1
)yyy.
The Gauss Newton algorithm was discussed in Chapter 2. Recall that whenusing the Gauss-Newton method we rewrite the nonlinear regression model(2.2), utilizing the linear expansion of fff (XXX ,θθθ) around the starting value, hereθθθ . Rearranging terms results in the following linear model
rrr = FFFδδδ +εεε,
where δδδ =(
θθθ − θθθ
)is a q−vector.
Minimizing the sum of squared residuals for the constructed linear modelyields the least squares estimator for δδδ = (FFFFFF
T)−1FFF
Trrr and an update of the
starting value, θθθ , is given by
θθθ(1)
= θθθ +(FFFFFFT)−1FFF
Trrr.
Using the partition of FFFT
presented in (4.19) we obtain
θθθ(1)
= θθθ +
(FFF
T1 FFF1 FFF
T1 FFF2
FFFT2 FFF1 FFF
T2 FFF2
)−1(FFF
T1
FFFT2
)rrr.
In the rest of the proof the rules of inversion of a partitioned matrix are used,which are presented in Chapter 2, Section 2.3.1.
Since we are interested in finding the explicit expression of θ(1)q , we can ex-
plicitly focus on the last element in θθθ and similarly on the last row of the matrix(FFFFFF
T)−1. Moreover, observe that the last element of θθθ , θq, equals zero, due to
the null hypothesis.
Applying (2.19) to the second row of (FFFFFFT)−1, using AAA = FFF
T1 FFF1, bbb = FFF
T1 FFF2,
c = FFFT2 FFF2 and k = FFF
T2 FFF2− FFF
T2 PPPFFF1
FFF2 we obtain
θ(1)q =
(− FFF
T2 FFF1(FFF
T1 FFF1)
−1
FFFT2 FFF2−FFF
T2 PPPFFF1
FFF2
1FFF
T2 FFF2−FFF
T2 PPPFFF1
FFF2
)(FFF
T1
FFFT2
)rrr
=
− FFFT2 PPPFFF1
FFFT2 FFF2− FFF
T2 PPPFFF1
FFF2
+FFF
T2
FFFT2 FFF2− FFF
T2 PPPFFF1
FFF2
rrr.
57
Continuing with writing rrr = (I−PPPFFF1)yyy yields
θ(1)q =
FFFT2
FFFT2 FFF2− FFF
T2 PPPFFF1
FFF2
(I−PPPFFF1)(I−PPPFFF1
)yyy
=(
FFFT2 FFF2− FFF
T2 PPPFFF1
FFF2
)−1FFF
T2 (I−PPPFFF1
)yyy,
and this completes the proof. �
The results obtained in Theorem 4.2.1 give us an idea of how the APP can beused for assessing the effect of introducing the parameter θq to the model. Asteep slope of the regression line in the APP indicates that the updated estimateof θθθ differs much from the starting values, θθθ . This in turn indicates that theparameter θq might have a contributing effect on the regression model. If thereis no linear trend in the APP, the slope is close to zero, and hence there is nochange in the updated estimate of θθθ when θθθ is used as a starting value.
The next lemma provides the results that will help when proving Theorem4.2.2 and Theorem 4.2.3.
Lemma 4.2.1. The projection matrix PPPxxx is equal to PPPFFF −PPPFFF1.
Proof. We want to prove that PPPFFF =PPPFFF1+PPPxxx.
Observe that FFFT= (FFF1
... FFF2) is a partitioned matrix and that xxx = (I−PPPFFF1)FFF2
by Definition 4.2.1. We can now use the proof for Proposition 2.3.1, lettingFFF
T=XXX , FFF1 =XXX1 and xxx = xxx∗. From the proof it follows that PPPFFF =PPPFFF1
+PPPxxx.
�
In Section 4.1.1 it has been outlined, that the residuals obtained from regress-ing yyy on xxx are equal to the residuals obtained from regressing yyy on XXX , i.e. allthe variables contained in XXX . The APP has a similar feature, as stated in thenext theorem.
Theorem 4.2.2. The residual vector uuu = yyy− αxxx, resulting from estimating theregression line in the added parameter plot is equal to (I−PPPFFF)yyy, i.e. the
residuals when yyy is regressed on all the columns of FFFT
.
Proof. The residuals obtained from regressing yyy on xxx are equal to (I−PPPxxx)yyy.
58
Since yyy = (III−PPPFFF1)yyy, the residuals can be written
uuu = (I−PPPxxx)yyy = (III−PPPxxx)(III−PPPFFF1)yyy.
From the result in Lemma 4.2.1, we know that PPPxxx = PPPFFF −PPPFFF1. Using this
property we get
uuu = (I−PPPxxx)yyy = (III−PPPFFF +PPPFFF1)(III−PPPFFF1
)yyy = (III−PPPFFF)yyy,
since PPPFFFPPPFFF1=PPPFFF1
. This completes the proof.�
In the next theorem it will be shown that, similar to the AVP, the APP can beconsidered to be a graphical representation of the score test.
Theorem 4.2.3. When regressing yyy on xxx, the resulting SSR = yyyTPPPxxxyyy is propor-tional to the score test statistic for testing the hypothesis
H0 : θq = 0,
where the score test statistic is defined in (2.27) and given by
S(θθθ) =1
σ2 rT FFFT (
FFFFFFT)−1
FFF r, (4.21)
and r = yyy− fff (XXX ,θθθ) and FFF is given by (4.16).
Proof. We want to show that yyyTPPPxxxyyy= σ2S(θθθ), where S(θθθ) is defined in (4.21).
Observe that
SSR = yyyTPPPxxxyyy =((III−PPPFFF1
)yyy)T
PPPxxx
((III−PPPFFF1
)yyy).
Using the result in Lemma 4.2.1
SSR = yyyT (III−PPPFFF1)(
PPPFFF −PPPFFF1
)(III−PPPFFF1
)yyy
= yyyT(
PPPFFF −PPPFFF1−PPPFFF1
PPPFFF +PPPFFF1
)(III−PPPFFF1
)yyy,
and using the fact that PPPFFFPPPFFF1=PPPFFF1
we get
SSR = yyyT(
PPPFFF −PPPFFF1
)(III−PPPFFF1
)yyy = yyyT(
III−PPPFFF1
)PPPFFF(III−PPPFFF1
)yyy.
Since (III−PPPFFF1)yyy are the residuals under the null hypothesis, SSR = yyyTPPPxxxyyy can
59
be written as
SSR =(
yyy− fff (XXX ,θθθ))T
FFFT (
FFFFFFT)−1
FFF(
yyy− fff (XXX ,θθθ))= rrrT FFF
T (FFFFFF
T)−1FFFrrr
= σ2S(θθθ).
Thus, SSR = yyyTPPPxxxyyy = σ2S(θθθ). The proof is complete. �
From the proof of Theorem 4.2.3 it follows that the score test statistic is pro-portional to SSR resulting from regressing yyy on xxx. Thus, plotting yyy against xxx andobserving the data points’ location can contribute to the knowledge of whichobservations strongly influence the value of the score test statistic.
In the next section a numerical example will be used to illustrate the construc-tion of an APP. The example also contains a discussion of how the individualobservations contribute to the value of the score test statistic.
4.2.2 Numerical example
Data from Bates and Watts (1988), given in Table 4.1 will be used to fit theMichaelis-Menten model (2.4) given by
y =θ1x
θ2 + x+ ε,
where y is the initial velocity of the enzymatic reaction and x is substrate con-centration. The parameter θ1 is the maximum initial velocity that is theoreti-cally attained when the enzyme has been saturated by an infinite concentrationof substrate. The second parameter, θ2, is the Michaelis parameter which isnumerically equal to the concentration of substrate for "half-maximum" initialvelocity.
In the example presented in Bates and Watts (1988), two blocks of experi-ments were run. In one block, the enzyme was treated with puromycin, and inthe other the enzyme was untreated. It was hypothesized that the puromycinshould affect the maximum velocity parameter θ1, but not the half-velocityparameter θ2. An indicator variable x2 was introduced so that x2 = 1 if theenzyme is treated and
f (x,θ) =(θ1 +θ3x2)x1
θ2 + x1. (4.22)
The Michaelis-Menten model is thus modified, now including θ3 to accountfor the effect of puromycin on the asymptotic velocity, θ1.
60
y x1 x2 y x1 x2
76.00 0.02 1.00 159.00 0.22 1.0067.00 0.02 0.00 131.00 0.22 0.0047.00 0.02 1.00 152.00 0.22 1.0051.00 0.02 0.00 124.00 0.22 0.0097.00 0.06 1.00 191.00 0.56 1.0084.00 0.06 0.00 144.00 0.56 0.00107.00 0.06 1.00 201.00 0.56 1.0086.00 0.06 0.00 158.00 0.56 0.00123.00 0.11 1.00 207.00 1.10 1.0098.00 0.11 0.00 160.00 1.10 0.00139.00 0.11 1.00 200.00 1.10 1.00115.00 0.11 0.00
Table 4.1: Data from Bates and Watts (1988), used to fit the Michaelis-Mentenmodel with expectation functions (4.22) and (4.23).
A second modification of the model entails including θ4, a parameter for thepotential effect of puromycin on θ2. The expectation function is now writtenas
f (x,θ) =(θ1 +θ3x2)x1
(θ2 +θ4x2)+ x1. (4.23)
The score test can be used to test if the model should include different half-velocity parameters depending on whether the enzyme is treated or not. In thiscase the hypotheses are
H0 : θ4 = 0, (4.24)
HA : θ4 6= 0.
To conduct the score test, first fit the model with f (x,θ) defined in (4.22)to the data in order to retrieve the estimates under the null hypothesis. Thisyields θθθ = (166.60, 0.06, 42.03, 0)T and σ2 = 97.43. The score test statisticfor testing (4.24) is given by
S(θθθ) =1
σ2 rT FFFT (
FFFFFFT)−1
FFF r,
and using the estimates under the null hypothesis, we get that S(θθθ) = 1.67.The p−value for this test is 0.20 and the null hypothesis that the half-velocity
61
parameter is unchanged by the puromycin treatment cannot be rejected.
To visualize the score test, an APP for θ4 is constructed. First let the columnsin FFF
Tact as independent variables, where
FFFT=(
FFF(θ1),FFF(θ2),FFF(θ3),FFF(θ4))=
(d fff (XXX ,θθθ)
dθ1
∣∣∣∣θθθ=θθθ
, . . . ,d fff (XXX ,θθθ)
dθ4
∣∣∣∣θθθ=θθθ
).
The first three columns in FFFT
form the matrix FFF1 and the last column formsthe vector FFF2. Next, yyy is constructed as the residuals in the regression when yyyis regressed on FFF1 and xxx is constructed as the residuals when FFF2 is regressedon FFF1. Now, plotting yyy against xxx and estimating the regression line resultingfrom regressing yyy on xxx yields the APP for θ4, which is given in Figure 4.2.
**
*
*
*
** *
**
**
*
*
*
*
*
*
*
**
* *
−300 −200 −100 0 100 200 300
−10
0
10
20
2
4
8
12
17
19
21
23
6
10
1
711
20
22
35
9
13
14
15
1618
y~
x~
Figure 4.2: The added parameter plot for θ4 consisting of the scatter plot of yyy,the residuals resulting from regressing yyy on FFF1, against xxx, the residuals resultingfrom regressing FFF2 on FFF1, and the estimated regression line with slope α .
The estimate of the slope of the regression line in Figure 4.2 is equal toα = 0.02. Moreover, the updated estimate of θθθ using a single iteration ofthe Gauss-Newton method is
θθθ(1)
= (160.90,0.05,51.30,0.02)T , (4.25)
62
Thus, we see that α = 0.02 = θ(1)4 , and this illustrates Theorem 4.2.1.
According to Theorem 4.2.3, the value of the score test statistic is equal to theratio of SSR, resulting from regressing yyy on xxx, and σ2. Here, SSR = 162.28,σ2 = 97.43 and S(θθθ) = 162.28
97.43 = 1.67, which corresponds to the value of thescore test statistic obtained above.
The APP in Figure 4.2 can be studied more thoroughly, searching for obser-vations with substantial influence on the score test statistic. Firstly, we notethat there is no strong linear trend in the scatter of the observations in the plot,which is consistent with not rejecting the null hypothesis. Secondly, we notethat the 1st and 2nd observations are separated from the rest of the data pointsand could be influential observations. A deeper analysis of the data yields thatwhen the 1st observation is removed from the calculations, SSR is increasedtogether with the value of the score test statistic. Moreover, the slope of theregression line also changes. The new values of SSR, the score test statisticand estimated slope of the regression line are 329.80, 4.32 and 0.02, respec-tively. Thus, observation 1 is influencing the score test statistic, decreasing itsvalue. The p−value, corresponding to a value of 4.32 for the score test statis-tic, is 0.04. In fact, when the 1st observation is excluded from the data the nullhypothesis would be rejected on a 5 percent significance level. When observa-tion 2 is removed from the calculations the new values of SSR, the score teststatistic and slope are 29.91, 0.41 and 0.01 respectively. Thus, the presence ofobservation 2 is increasing the score test statistic.
63
64
5. Assessment of influence onparameter estimates
Assessment of the influence of the observations on the parameter estimate isan important part of influence analysis, and there are many challenging issuesto consider. For instance,
• parameter estimates can be highly influenced by single observations.
• multiple observations can also have a large influence on the parameter esti-mates:
◦ several observations can simultaneously influence parameter estimates, hencetheir joint influence should be assessed.
◦ due to hidden, general, dependence among observations, there can be obser-vations that influence parameter estimates only when one or several observa-tions are removed from the data set.
In Section 5.1, we will present another important result of this thesis, an in-fluence measure that is used to assess the influence of a single observation onthe parameter estimates in a nonlinear regression model. The section will alsocontain a brief description of the corresponding influence measure in linear re-gression.
Section 5.2 is devoted to assessing the influence of multiple observations on theparameter estimates. The section is divided into two main parts, where one partconcerns the simultaneous influence of several observations on the parameterestimates. This type of influence will be referred to as joint influence. Theother part treats the influence that the kth observation has on the parameterestimates after another observation, say observation i, has been deleted. Thetype of influence that the kth observation has on the parameter estimates afterthe deletion of the ith observation is called conditional influence. Joint andconditional influence will be discussed for both linear and nonlinear regressionmodels, and two new diagnostic measures are worked out for use in nonlinearregression.
65
5.1 Assessment of influence of a single observation
The 1970’s and the 1980’s were the decades when a significant amount of re-search on influence analysis in linear regression was conducted. Statisticianswere not content with using the regression model and simply accepting the dataat hand as given. They were now seeking to investigate the data quality. Thepioneering work of Cook (1977) in this area resulted in the diagnostic measurereferred to as Cook’s distance. Cook’s distance is given in (3.1) and is usedto measure the influence of a single observation on the parameter estimatesin linear regression models. Earlier attempts to protect against influential oroutlying observations came through the concept of robustness and robust re-gression. The use of robust regression was motivated by the fact that the leastsquares estimator was not robust, but rather sensitive, against outlying observa-tions. This means that a single observation, being extremely influential, couldcause the least squares estimation to produce an incorrect result. Interestedreaders are referred to Huber (1972) and Hampel (1974).
Belsley et al. (1980) came out with a book on regression diagnostics and iden-tification of influential observations in particular. New diagnostic tools, e.g.DFFIT and DFBETA, were proposed as summaries of parameter changes bydeletion of an observation. These two diagnostic tools, together with Cook’sdistance, are the most commonly used diagnostic measures for conducting in-fluence analysis in linear regression, and they are implemented in most statis-tical software packages. Later, Cook and Weisberg (1982) and Chatterjee andHadi (1988) discussed the role of different residuals and diagnostic measuresin influence analysis.
Deleting observations and studying the change in the parameter estimates dueto deletion is a popular approach to influence analysis. This approach is knownas case-deletion. Cook’s distance, DFFIT and DFBETA are all examples ofmeasures where this strategy is adopted. Cook and Weisberg (1982) extendedthe ideas of case deletion and proposed a diagnostic measure, similar to Cook’sdistance, for assessing the influence of observations on parameter estimates inthe nonlinear regression model. Later, Ross (1987) studied the geometry ofcase deletion in nonlinear regression models and discussed the adequacy ofusing diagnostic measures based on case-deletion in nonlinear regression.
In 1986, Cook proposed a new approach for influence analysis, which is re-ferred to as local influence. In this approach weights are introduced to themodel, which does not necessarily need to be a linear regression model, by at-taching them to the observations. One novelty of the local influence approach
66
was that the weights were allowed to vary between zero and one, rather thanbeing zero, as in the case deletion approach. The article by Cook (1986) con-tained new diagnostic measures for conducting influence analysis about theparameters in the linear regression model. However, the proposal to use localinfluence approach stimulated the research of influence analysis in nonlinearregression since St. Laurent and Cook (1993) discussed the relation betweenleverage and local influence in nonlinear regression.
Influence analysis in nonlinear regression is not widely explored and the re-sults obtained in this thesis will make a certain contribution to this researcharea. One of the main results is the new influence measure for assessing theinfluence of single observations on the parameter estimates in a nonlinear re-gression model. This measure is denoted DIM
θθθ ,k. The abbreviation standsfor Differentiation approach & Influence Measure, since for deriving the in-fluence measure the differentiation approach is used, described by Belsley etal. (1980), Chatterjee and Hadi (1988) and Cook and Weisberg (1982). Thedifferentiation approach is used in linear regression to construct the influencemeasure EIC
βββ ,k, where the EIC stands for Empirical Influence Curve. Origi-nally this measure was derived from the influence curve, a theoretical conceptintroduced by Hampel (1974). In Section 5.1.1 we will derive EIC
βββ ,k fromthe differentiation approach to demonstrate the idea behind the constructionof it. Then, borrowing ideas from linear regression, we derive the nonlinearinfluence measure, DIM
θθθ ,k in Section 5.1.2.
5.1.1 The influence measure EIC in linear regression, derived via thedifferentiation approach
An approach for assessing the influence of an observation on the parameter es-timates in the linear regression model (2.1), described in Belsley et al. (1980),Chatterjee and Hadi (1988) and Cook and Weisberg (1982), is called the dif-ferentiation approach. The resulting influence measure is denoted EIC
βββ ,k andnext we will demonstrate the derivation of the EIC
βββ ,k in linear regression. Theidea behind it will later be extended to nonlinear regression models as well.
Let us consider the following perturbed linear regression model
yyyω =XXXβββ +εεεω , (5.1)
where εεεω ∼ Nn(0,σ2W−1(ωk)), ωk is the weight such that 0 < ωk ≤ 1 and theweight matrix W(ωk) is the diagonal matrix
W(ωk) = diag(1, . . . ,ωk, . . . ,1) .
67
The weighted least squares estimator for βββ in (5.1) is given by
βββ (ωk) =(XXXT W(ωk)XXX
)−1XXXT W(ωk)yyy.
Definition 5.1.1. The influence measure for assessing the influence of the kthobservation on βββ , denoted EIC
βββ ,k, is defined as the derivative of βββ (ωk) withrespect to ωk evaluated at ωk = 1:
EICβββ ,k =
ddωk
βββ (ωk)
∣∣∣∣ωk=1
.
The influence measure EICβββ ,k in Definition 5.1.1 describes how the calculated
estimate changes in the area near ωk = 1, i.e as the kth observation is givenfull weight. For instance, a value of EIC
βββ ,k close to zero corresponds to nochange, or a minor change, in the estimate if the kth observation is includedin the calculations of βββ . In this case, the kth observation is not an influentialobservation. On the other hand, a value of EIC
βββ ,k being substantially differentfrom zero means that the inclusion of the kth observation in the calculationsof βββ substantially changes the result of the estimation. A diagnostic measuresimilar to EIC
βββ ,k is given by taking the derivative of βββ (ωk), with respect to ωk,evaluated as ωk→ 0. It is denoted EIC
βββ ,(k) by Chatterjee and Hadi (1988) andit measures how the estimate of βββ changes when the kth observation is deletedfrom the data.
In the following theorem we will derive the expression of EICβββ ,k in Definition
5.1.1, using the differentiation approach.
Theorem 5.1.1. Let EICβββ ,k be given in Definition 5.1.1. Then,
EICβββ ,k = rkxxxT
k (XXXTXXX)−1, (5.2)
where rk is the kth component of the vector of residuals rrr = yyy−XXXβββ and xxxTk is
the kth row of the matrix XXX.
Proof. Let us evaluate the derivative in Definition 5.1.1 using the product rule,(see Appendix A for details of how to apply the derivative)
ddωk
βββ (ωk) =d(XXXT WXXX
)−1 XXXT Wyyydωk
=dWdωk
(yyy⊗XXX)(XXXT WXXX
)−1
−dWdωk
(XXX⊗XXX)((
XXXT WXXX)−1⊗
(XXXT WXXX
)−1)(
XXXT Wyyy⊗IIIp),
68
where ⊗ is the Kronecker product (see Kollo and von Rosen, 2010), definedas follows:
Let AAA = (ai j) be a p× q matrix and BBB = (bi j) be an r× s matrix. Then thepr×qs matrix AAA⊗BBB is a Kronecker product of the matrices AAA and BBB if
AAA⊗BBB = [ai jBBB], i = 1, . . . , p; j = 1, . . . ,q,
where
ai jBBB =
ai jb11 . . . ai jb1s... .
...ai jbr1 . . . ai jbrs
.
Due to linearity of W the following expression is obtained:
dWdωk
= dddTk ⊗dddT
k ,
where dddk is the kth column of the identity matrix of size n. Evaluating theexpression above at ωk = 1 we get
ddωk
βββ (ωk)
∣∣∣∣ωk=1
= (dddTk ⊗dddT
k )[(yyy⊗XXX)
(XXXTXXX
)−1
− (XXX⊗XXX)((
XXXTXXX)−1⊗
(XXXTXXX
)−1)(
XXXTyyy⊗IIIp)]
= (dddTk ⊗dddT
k )[(yyy⊗XXX)
(XXXTXXX
)−1−(
XXXβββ ⊗XXX)(
XXXTXXX)−1]
= (dddTk ⊗dddT
k )(
yyy−XXXβββ ⊗XXX)(
XXXTXXX)−1
= rkxxxTk(XXXTXXX
)−1.
Thus, the final expression for EICβββ ,k, derived using the differentiation ap-
proach, is
EICβββ ,k = rkxxxT
k(XXXTXXX
)−1,
and the proof is complete. �
In the next section, we extend the ideas of using the differentiation approach formeasuring the influence of an observation on the parameter estimate for non-linear regression models. The influence measure denoted DIM
θθθ ,k, is derivedfor assessing the influence of the kth observation on the parameter estimates inthe nonlinear regression model (2.2).
69
5.1.2 The influence measure DIM, for use in nonlinear regression
In this section two new influence measures for the parameter estimates in thenonlinear regression model (2.2) will be derived: the DIM
θθθ ,k and DIMθ j,k
.The first diagnostic measure, DIM
θθθ ,k, is used to assess the influence of a singleobservation on all parameter estimates in the model, simultaneously. It is con-structed when all parameters are estimated from a perturbed model, presentedin (5.3) later on, and it is referred to as the joint-parameter influence measure.
The DIMθ j,k
, on the other hand, is used to assess the influence of a single obser-vation on the jth parameter estimate in the model. When constructing DIM
θ j,k,
only the jth parameter is estimated from the perturbed model, later defined in(5.3): the other parameters are estimated from an unperturbed model and re-garded to be known. The DIM
θ j,kis referred to as the marginal-parameter
influence measure.
We will now start with the definition of DIMθθθ ,k. Consider the following per-
turbed nonlinear model
yyyω = fff (XXX ,θθθ)+εεεω , (5.3)
where εεεω∼Nn(000,σ2W−1(ωk)), ωk is the weight such that 0 < ωk ≤ 1 and theweight matrix W(ωk) = diag(1, . . . ,ωk, . . . ,1).
Definition 5.1.2. The influence measure for assessing the influence of the kthobservation on θθθ is defined as the following derivative
DIMθθθ ,k =
ddωk
θθθ(ωk)
∣∣∣∣ωk=1
, (5.4)
where θθθ(ωk) is the weighted least squares estimate of θθθ in the perturbed model(5.3).
Observe that, in Definition 5.1.2, if ωk→ 1, then θθθ(ωk)→ θθθ , the unweightedleast squares estimate.
To calculate the DIMθθθ ,k in (5.4) we need an estimator for θθθ in the perturbed
model (5.3). Using the method of weighted least squares, which is equivalentto the maximum likelihood approach, we have to find θθθ that minimizes thefollowing
Q(ωk) = (yyy− fff (XXX ,θθθ))T W(ωk)(yyy− fff (XXX ,θθθ)) .
70
Differentiating Q(ωk) with respect to θθθ one gets the following normal equa-tions
d fff (XXX ,θθθ)
dθθθW(ωk)(yyy− fff (XXX ,θθθ)) = 000, (5.5)
where the derivative, d fffdθθθ
, is defined in Appendix A.
The normal equations in (5.5) are solved for θθθ using iterative methods suchas the Gauss-Newton method. The obtained least squares estimate of θθθ is afunction of ωk.
In the next theorem, the explicit expression of DIMθθθ ,k, defined in (5.4), will be
presented.
Theorem 5.1.2. Let DIMθθθ ,k be given in Definition 5.1.2. Then
DIMθθθ ,k = rkFFFT
k (θθθ)(
FFF(θθθ)FFFT (θθθ)−GGG(θθθ)(rrr⊗ Iq))−1
,
provided that the inverse exists, where
rrr = (rk) = yyy− fff (XXX ,θθθ), (5.6)
FFF(θθθ) = (FFF1(θθθ), . . . ,FFFn(θθθ)) =d fff (XXX ,θθθ)
dθθθ
∣∣∣∣θθθ=θθθ
, q×n. (5.7)
and
G(θθθ) =
(d
dθθθ
(d fff (XXX ,θθθ)
dθθθ
))θθθ=θθθ
=dFFF(θθθ)
dθθθ, q×nq. (5.8)
Proof. Consider inserting the weighted least squares estimate, θθθ(ωk), in thenormal equations (5.5)
d fff (XXX ,θθθ)
dθθθ
∣∣∣∣θθθ=θθθ(ωk)
W(ωk)(
yyy− fff (XXX ,θθθ(ωk)))= 000, (5.9)
and letting FFF = FFF(θθθ(ωk)), W = W(ωk) and eee = yyy− fff (XXX) = yyy− fff (XXX ,θθθ(ωk)).The influence measure DIM
θθθ ,k can be obtained by differentiation of (5.9) withrespect to ωk on both sides, i.e.
ddωk
FFFWeee = 000. (5.10)
71
The product rule, defined in Appendix A, shows that (5.10) equals
ddωk
FFFWeee =dFFFdωk
(Weee⊗Iq)+dWdωk
(eee⊗FFFT )+ deee
dωkWFFFT. (5.11)
Nowdeee
dωk=−d fff (XXX)
dωk,
anddWdωk
= dddTk ⊗dddT
k ,
where dddk is the kth column of the identity matrix of size n. Moreover, applyingthe chain rule, see Appendix A, to (5.11) implies that (5.10) is identical to
dθθθ(ωk)
dωk
dFFF
dθθθ(ωk)(Weee⊗Iq)+dddT
k eee⊗dddTk FFF
T−dθθθ(ωk)
dωk
d fff (XXX)
dθθθ(ωk)WFFFT =000,
which after rearrangement of terms yields
(dddT
k eee⊗dddTk FFF
T)=
dθθθ(ωk)
dωk
(d fff (XXX)
dθθθ(ωk)WFFFT− dFFF
dθθθ(ωk)(Weee⊗Iq)
). (5.12)
Evaluating the derivatives in (5.12) at ωk = 1 together with (5.6)-(5.8) andDefinition 5.1.2 implies
dTk rrr⊗dT
k FFFT(θθθ) =
dθθθ(ωk)
dωk
∣∣∣∣∣ωk=1
(FFF(θθθ)FFFT (θθθ)−G(θθθ)(rrr⊗ Iq)
).
Thus,
rkFFFTk (θθθ) = DIM
θθθ ,k
(FFF(θθθ)FFFT (θθθ)−GGG(θθθ)(rrr⊗ Iq)
),
and
DIMθθθ ,k = rkFFFT
k (θθθ)(
FFF(θθθ)FFFT (θθθ)−GGG(θθθ)(rrr⊗ Iq))−1
.
This completes the proof.�
The DIMθθθ ,k derived in Theorem 5.1.2 measures the influence of the kth obser-
vation on all the parameter estimates in model (2.2) simultaneously. Therefore,DIM
θθθ ,k is regarded to be a joint-parameter influence measure. However, it can
72
be useful to measure the influence of the kth observation on a particular param-eter estimate of the model. In order to assess the influence of the k observationon the jth parameter estimate, θ j, a marginal-parameter influence measure willbe defined and its explicit expression will be derived.
Consider the perturbed model (5.3). Let θθθ = (θθθ 1, θ j) be a vector of parameterestimates, where θθθ 1 = (θ1, . . . , θ j−1, θ j+1, . . . , θq)
T , are the maximum likeli-hood estimates in the unperturbed model (2.2) and θ j is estimated from theperturbed model (5.3), with parameter estimates θθθ 1 inserted and regarded asknown.
Definition 5.1.3. The marginal influence measure for assessing the influenceof the kth observation on the parameter estimate θ j is defined as the followingderivative
DIMθ j,k
=d
dωkθ j(ωk)
∣∣∣∣ωk=1
, (5.13)
where θ j(ωk) is the weighted least squares estimate of θ j, givenθθθ 1 = (θ1, . . . , θ j−1, θ j+1, . . . , θq)
T .
Observe that, in Definition 5.1.3, if ωk→ 1, then θ j(ωk)→ θ j, the unweightedleast squares estimate.
The weighted least squares criterion and the normal equation for the case whena single parameter is estimated from the perturbed model are given by
Q(ωk) =(
yyy− fff (XXX ,θθθ 1,θ j))T
W(ωk)(
yyy− fff (XXX ,θθθ 1,θ j)),
and
d fff (XXX ,θθθ 1,θ j)
dθ jW(ωk)
(yyy− fff (XXX ,θθθ 1,θ j(ωk))
)= 0. (5.14)
In the next theorem, an explicit expression of the marginal-parameter influencediagnostic DIM
θ j,kdefined in (5.13) will be provided.
Theorem 5.1.3. Let DIMθ j,k
be given in Definition 5.1.3. Then
DIMθ j,k
= rkFk(θ j)(
FFF(θ j)FFFT (θ j)−GGG(θ j)rrr)−1
,
73
provided that the inverse exists, where rrr=(rk)=yyy− fff (XXX ,θθθ 1, θ j),
FFF(θ j) =(
FFF1(θ j), . . . ,FFFn(θ j))=
d fff (XXX ,θθθ 1,θ j)
dθ j
∣∣∣∣∣θ j=θ j
, 1×n, (5.15)
G(θ j) =dFFF(θ j)
dθ j
∣∣∣∣θ j=θ j
=d2 fff (XXX ,θθθ 1,θ j)
dθ 2j
∣∣∣∣∣θ j=θ j
, 1×n. (5.16)
Proof. The proof is very similar to the proof of Theorem 5.1.2. Consider in-serting the weighted least squares estimate of θ j in the normal equation (5.14)
d fff (XXX ,θθθ 1,θ j)
dθ j
∣∣∣∣∣θ j=θ j(ωk)
W(ωk)(
y− fff (XXX ,θθθ 1, θ j(ωk)))= 0, (5.17)
and letting FFF =FFF(θ j(ωk)), W=W(ωk) and eee=yyy− fff (XXX)=yyy− fff (XXX ,θθθ 1, θ j(ωk)).For the kth observation, the DIM
θ j,kcan be obtained by differentiating (5.17)
on both sides with respect to ωk, i.e.
ddωk
FWe = 0. (5.18)
Now, the product rule, defined in Appendix A, is used to calculate the deriva-tive in (5.18):
ddωk
FFFWeee =dFFFdωk
Weee+dWdωk
(eee⊗FFFT )+ deee
dωkWFFFT . (5.19)
Moreover, applying the chain rule, see Appendix A, to (5.19) gives
dθ j(ωk)
dωk
dFFF
dθ j(ωk)We+dddT
k eee ⊗ dddTk FFF
T−
dθ j(ωk)
dωk
d fff (XXX)
dθ j(ωk)WFFFT = 0,
and then, rearranging terms yields
dddTk eee ⊗ dddT
k FFFT
=dθ j(ωk)
dωk
(d fff (XXX)
dθ j(ωk)WFFFT − dFFF
dθ j(ωk)(Weee)
).
74
As previously mentioned, evaluating the derivative at ωk = 1 givesθ j = θ j(ωk = 1), and denoting rrr = yyy− fff (XXX ,θθθ 1, θ j(ωk = 1)) implies
dddTk rrr⊗ dddT
k FFFT (θθθ j) =dθ j(ωk)
dωk
∣∣∣∣∣ωk=1
(FFF(θ j)FFFT (θ j)−GGG(θ j)rrr
),
rkFk(θ j) = DIMθ j,k
(FFF(θ j)FFFT (θ j)−GGG(θ j)rrr
).
Thus, the final expression for DIMθ j,k
is
DIMθ j,k
= rkFk(θ j)(
FFF(θ j)FFFT (θ j)−GGG(θ j)rrr)−1
.
The proof is complete. �
5.1.3 A note on DIMθθθ ,k and DIM
θ j,k
When deriving the influence measures and studying the single observations’influence on the parameter estimates we observe some interesting aspects ofinfluence analysis in nonlinear regression.
A benefit of using the differentiation approach, where we compute derivativesof various quantities with respect to ωk and evaluate the derivatives at ωk = 1, isthat no additional iterations for computing the parameter estimates are needed.As was discussed in Section 5.1.1, an alternative way of using the differentia-tion approach is to evaluate the same derivatives as ωk → 0. If this approachwere to be used instead, the explicit expressions of DIM
θθθ ,k and DIMθ j,k
wouldbe functions of the parameter estimates with weights attached. As an example,consider the following derivative
d fff (XXX ,θθθ(ωk))
dωk
∣∣∣∣∣ωk→0
=FFF(θθθ(ωk)),
where ωk → 0. This means that we would need to compute a parameter esti-mate for each k and additional iterations are needed. On the contrary, with thenew proposed method in this thesis
d fff (XXX ,θθθ(ωk))
dωk
∣∣∣∣∣ωk=1
=FFF(θθθ),
75
which is the derivative of the expectation function from the unperturbed model(2.2) and hence, no additional iterations are needed.
We can further make a comparison between the proposed measure, DIMθθθ ,k,
and the nonlinear version of Cook’s distance, discussed in Chapter 3, and givenby (
θθθ − θθθ (k)
)TFFF(θθθ)FFFT (θθθ)
(θθθ − θθθ (k)
)qσ2 , k = 1, . . . ,n,
where q is the number of parameters in the model, θθθ (k) is the estimate of θθθ
when the kth observation is excluded from the calculations and FFF(θθθ) is definedin (5.7). The nonlinear version of Cook’s distance is based on case-deletion. Aconsequence of this is that re-estimation of the parameters is needed for everyobservation we are interested in. Thus, the nonlinear version of Cook’s dis-tance demands additional iterations when estimating the parameters, which isavoided using our measure DIM
θθθ ,k.
The joint-parameter influence measure DIMθθθ ,k is a 1×q−vector. The compu-
tation of DIMθθθ ,k will result in q−values, one value for each parameter estimate,
which will indicate whether the kth observation is influential on the specificparameter estimate. However, it is worth noting that the influence measureDIM
θθθ ,k is affected by the dependencies among the estimated parameters dueto the fact that they are estimated jointly. For instance, for the modified Gom-pertz growth curve model (2.6) described in Chapter 2, the parameter µm isdefined as the slope of the tangent line at the point of inflection and the pa-rameter λ is defined as the intercept of the tangent line. There is a dependencebetween λ and µm, when estimated values are used. If an observation has astrong influencing effect on µm, this effect will be partly transmitted to thevalue of the influence measure regarding λ as well. In the case of using theMichaelis-Menten regression model (2.4) for enzyme kinetics, the parameterθ1 is defined as the maximum initial velocity, which is attained when the en-zyme has been saturated by an infinite concentration of substrate. The param-eter θ2 is defined as the value of substrate corresponding to half the maximumvelocity. For inference we remark that observations that are highly influentialon θ1 will probably show impact on θ2 as well.
If one wants to be "certain" of what effect the observations have on a particularparameter estimate one should use DIM
θ j,k. This measure is constructed when
only the jth parameter is estimated from the perturbed model and the otherparameters in the model are assumed to be known, i.e. their estimates from the
76
unperturbed model are regarded to be the true parameter values. The fact thatwe are able to assess influence of observations on a specific parameter estimateis clearly beneficial over the already proposed approaches to influence analysisin nonlinear regression. The nonlinear version of Cook’s distance can be usedwhen assessing the influence of an observation on the whole vector of parame-ter estimates is of interest. The extension of the local influence approach, fromlinear regression models to nonlinear regression models, is considered by St.Laurent and Cook (1993). Their results concern the assessment of influence ofthe observations on the fitted values, and there is no suggestion of how the lo-cal influence approach can be extended to assessing the influence on a specificparameter estimate.
It is worth noting that the values of DIMθθθ ,k and DIM
θ j,kcan be positive or neg-
ative. A positive value of the influence measure for a given observation meansthat the presence of this observation increases the value of the correspondingparameter estimate. In a similar way, a negative value for a given observationmeans that the presence of that observation decreases the parameter estimate.
Nonlinear regression models can differ greatly. It is important to know theshape of the expectation function used in the real-life problem. Some observa-tions might be more influential if they are located in the area, which is impor-tant for the estimation of the parameters. These areas are of course different fordifferent nonlinear regression models. Both the Michaelis-Menten model andthe modified Gompertz growth curve model contain a parameter representingan asymptotic value. Observations located in the area near this asymptote areexpected to be more important in the estimation process. These observationswill thus be more influential on the parameter estimates than other observa-tions not located in this area. This fact certainly provides more information tothe analysis but it does not necessarily mean that an observation with a highabsolute value of the diagnostic measure is influential in the sense that the ob-servation is spurious. This aspect of influence analysis in nonlinear regressionwill be discussed further in the numerical example in Section 5.1.4.
Inspecting the explicit expression of the influence measure DIMθθθ ,k given in
Theorem 5.1.2, we observe that it is a function of the kth residual and thatit is related to the kth diagonal element of the tangent plane leverage matrixdiscussed in Chapter 3. To see this, first consider the result from the inversebinomial theorem (see e.g. Kollo and von Rosen, 2010).
(AAA+UBVUBVUBV )−1 =AAA−1−AAA−1UBUBUB(BBB+BVABVABVA−1UBUBUB
)−1BVABVABVA−1,
provided that AAA and BBB+BVABVABVA−1UBUBUB are nonsingular.
77
If we want to invert (AAA−UBVUBVUBV ) we have that
(AAA−UBVUBVUBV )−1 =AAA−1−AAA−1UBUBUB(BVABVABVA−1UBUBUB−BBB
)−1BVABVABVA−1. (5.20)
In the expression of DIMθθθ ,k let FFF(θθθ) = FFF and GGG(θθθ) = GGG. If applying (5.20)
then
DIMθθθ ,k = rkFFFT
k(FFFFFFT −GGG(rrr⊗ Iq)
)−1
= rkFFFTk(FFFFFFT )−1
(5.21)
−rkFFFTk
((FFFFFFT )−1GGG
((rrr⊗ I)(FFFFFFT )−1GGG− I
)−1
× (rrr⊗ I)(FFFFFFT )−1) .Now, the kth diagonal element of the tangent plane leverage matrix, see (3.7),is given by
FFFTk(FFFFFFT )−1
FFFk,
and we see that the first term of the expression on the right hand side of (5.21)is a function of the kth residual and related to the kth diagonal element of thetangent plane leverage matrix. A high value of the residual and/or a high lever-age value for the kth observation might result in an influential observation.Thus, the investigation of the residuals and the leverages of the observationscan contribute to a deeper understanding of why an observation is influentialor not.
In a similar manner, we observe that the explicit expression of the marginalinfluence measure, DIM
θ j,k, given in Theorem 5.1.3, is a function of
FFFTk (θ j)
(FFF(θ j)FFFT (θ j)
)−1FFFk(θ j). (5.22)
The quantity in (5.22) is the kth diagonal element of the (projection) matrix
FFFT (θ j)(
FFF(θ j)FFFT (θ j))−1
FFF(θ j). (5.23)
Since the derivative of fff (XXX ,θθθ) with respect to θ j is considered exclusively in(5.23), we denote the quantity in (5.22) as the marginal leverage. When study-ing the influence of single observations on the specific parameter estimate θ j,an investigation of the residuals and the marginal leverages forms a good basis.
78
The discussion about leverages have led us to suggest a modification of theinfluence measure DIM
θθθ ,k. Consider post-multiplying FFFk to DIMθθθ ,k
DIM∗θθθ ,k
= rkFFFTk(FFFFFFT −GGG(rrr⊗ Iq)
)−1FFFk.
In applying (5.21)
DIM∗θθθ ,k
= rkFFFTk(FFFFFFT −GGG(rrr⊗ Iq)
)−1
= rkFFFTk(FFFFFFT )−1
FFFk
−rkFFFTk
((FFFFFFT )−1GGG
((rrr⊗ I)(FFFFFFT )−1GGG− I
)−1
× (rrr⊗ I)(FFFFFFT )−1)FFFk,
and we observe that the first term of DIM∗θθθ ,k
consists of the kth residual andthe leverage of the kth observation. Moreover, DIM∗
θθθ ,kis a scalar and this mea-
sure can be regarded as a collective influence measure for all parameters in themodel.
Next, the components of the diagnostic measures DIMθθθ ,k and DIM
θ j,kwill be
illustrated.
Example 5.1 Illustration of the structures of DIMθθθ ,k and DIM
θ j,kConsider the Michaelis-Menten model,
yi =θ1xi
θ2 + xi+ εi, i = 1,2,3,
where θθθ = (θ1,θ2)T and εεε ∼ N3(000,σ2I). We observe that in a practical situ-
ation more than three observations are needed in order to estimate the model.However, since this is merely a demonstration of the explicit expressions ofDIM
θθθ ,k and DIMθ j,k
we use three observations in order to simplify the expres-sions.
Let us assume that we want to assess the influence of the 2nd observation onthe vector of parameter estimates, θθθ . The diagnostic measure to use is
DIMθθθ ,2 = r2FT
2 (θθθ)(
FFF(θθθ)FFFT (θθθ)−GGG(θθθ)(rrr⊗ I2))−1
, (5.24)
79
whereFFF(θθθ) =
d fff (XXX ,θθθ)
dθθθ
=
d f1(θθθ)
dθ1
d f2(θθθ)
dθ1
d f3(θθθ)
dθ1
d f1(θθθ)
dθ2
d f2(θθθ)
dθ2
d f3(θθθ(ω2))
dθ2
=
x1
θ2+x1
x2
θ2+x2
x3
θ2+x3
−θ1x1
(θ2+x1)2−θ1x2
(θ2+x2)2−θ1x3
(θ2+x3)2
and
G(θθθ) =dFFF(θθθ)
dθθθ
=
d2 f1(θθθ)
dθ 21
d2 f1(θθθ)
dθ1dθ2
d2 f2(θθθ)
dθ 21
d2 f2(θθθ)
dθ1dθ2
d2 f3(θθθ)
dθ 21
d2 f3(θθθ)
dθ1dθ2
d2 f1(θθθ)
dθ2dθ1
d2 f1(θθθ)
dθ 22
d2 f2(θθθ)
dθ2dθ1
d2 f2(θθθ)
dθ 22
d2 f3(θθθ)
dθ2dθ1
d2 f3(θθθ)
dθ 22
=
ddθ1
x1
θ2+x1
ddθ1
−θ1x1
(θ2+x1)2 · · · ddθ1
−θ1x3
(θ2+x3)2
ddθ2
x1
θ2+x1
ddθ2
−θ1x1
(θ2+x1)2 · · · ddθ2
−θ1x3
(θ2+x3)2
=
0 −x1
(θ2+x1)2 0 −x2
(θ2+x2)2 0 −x3
(θ2+x3)2
−x1
(θ2+x1)22θ1x1
(θ2+x1)3−x2
(θ2+x2)22θ1x2
(θ2+x2)3−x3
(θ2+x3)22θ1x3
(θ2+x3)3
.
Now, we investigate the matrix(F(θθθ)F
T(θθθ)−G(θθθ)(rrr⊗ I2)
),
whose inverse should be used for the calculation of DIMθθθ ,2 in (5.24). First,
F(θθθ)FT(θθθ) =
3∑
i=1
(d fi(θθθ)
dθ1
)2 3∑
i=1
d fi(θθθ)
dθ1
d fi(θθθ)
dθ2
3∑
i=1
d fi(θθθ)
dθ1
d fi(θθθ)
dθ2
3∑
i=1
(d fi(θθθ)
dθ2
)2
=
3∑
i=1
(xi
θ2+xi
)2−
3∑
i=1
−θ1x2i
(θ2+xi)3
−3∑
i=1
−θ1x2i
(θ2+xi)3
3∑
i=1
(−θ1xi
(θ2+xi)2
)2
.
80
Secondly, we calculate
G(θθθ)(r⊗I2) =
3∑
i=1
d2 f (xi,θθθ)ri
dθ 21
3∑
i=1
d2 f (xi,θθθ)ri
dθ1dθ2
3∑
i=1
d2 f (xi,θθθ)ri
dθ2dθ1
3∑
i=1
d2 f (xi,θθθ)ri
dθ 22
=
03∑
i=1
−xiri
(θ2+xi)2
3∑
i=1
−xiri
(θ2+xi)2
3∑
i=1
2θ1xiri
(θ2+xi)3
.
Thus,
DIMθθθ ,2=
(x2r2
θ2+x2
−θ1x2r2
(θ2+x2)2
)
×
3∑
i=1
(xi
θ2+xi
)2−
3∑
i=1
−θ1x2i
(θ2+xi)3
−3∑
i=1
−θ1x2i
(θ2+xi)3
3∑
i=1
(−θ1xi
(θ2+xi)2
)2
− 0
3∑
i=1
−xiri
(θ2+xi)2
3∑
i=1
−xiri
(θ2+xi)2
3∑
i=1
2θ1xiri
(θ2+xi)3
−1
.
Next, the marginal influence of the 2nd observation on the parameter estimateθ2 will be studied, and the diagnostic measure to use is
DIMθ2,2
= r2F2(θ2)(
FFF(θ2)FFFT (θ2)−GGG(θ2)rrr)−1
. (5.25)
In (5.25) the vector of the first derivatives of fff (XXX ,θθθ) is given by
FFF(θ2) =(
−θ1x1
(θ2+x1)2−θ1x2
(θ2+x2)2−θ1x3
(θ2+x3)2
),
and the vector of second derivatives is the following
G(θ2) =(
2θ1x1
(θ2+x1)32θ1x2
(θ2+x2)32θ1x3
(θ2+x3)3
).
The matrix(
FFF(θ2)FFFT (θ2)−GGG(θ2)rrr)
is of interest, because its inverse will beused for the calculation of DIM
θ2,2. Observe that,
FFF(θ2)FFFT (θ2) =3
∑i=1
(d fi(θ2)
dθ2
)2
=3
∑i=1
−θ1xi(θ2 + xi
)2
2
=3
∑i=1
(θ1xi)2
(θ2 + xi)4,
81
and
GGG(θ2)rrr =3
∑i=1
d fi(θ2)
dθ2ri =
3
∑i=1
(2θ1xi
(θ2 + xi)3
(yi−
θ1xi
θ2 + xi
))
=3
∑i=1
(2θ1xiyi
(θ2 + xi)3− 2(θ1xi)
2
(θ2 + xi)4
).
Since r2F2 =−θ1x2r2
(θ2+x2)2 , we have that
DIMθ2,2
=−θ1x2r2
(θ2 + x2)2
(3
∑i=1
(θ1xi)2
(θ2 + xi)4−
3
∑i=1
(2θ1xiyi
(θ2 + xi)3− 2(θ1xi)
2
(θ2 + xi)4
))−1
.
As an illustration of how DIMθθθ ,k and DIM
θ j,kcan be used in a practical situa-
tion, we will present two numerical examples using simulated data, in Section5.1.4 and 5.1.5.
5.1.4 Numerical example: Influence analysis using DIMθθθ ,k
The Michaelis-Menten model is used for studying enzyme kinetics, and it re-lates the initial velocity, y, of an enzymatic reaction to the substrate concentra-tion, x, through the equation
f (x,θθθ) =θ1x
θ2 + x.
In this numerical example we fit the Michaelis-Menten model using simulateddata. The data is simulated using a similar approach as Atkins and Nimmo(1975). First, a set of ’perfect’, i.e. error free, data is formed with θ1 = 1 andθ2 = 1. The values of substrate concentration are
x = (0.25,0.50,0.75,1.00,1.25,1.50,1.75) .
We make replicates of each x−value and a data set of 49 observations is cre-ated. Then the y−values are simulated from the perfect set using normallydistributed errors with a mean of zero and a standard deviation equal to 0.1.Thus, we let
yi j =θ1xi
θ2 + xi+ εi j, (5.26)
where the εi.i.d.∼ N(0,0.12) for i, j = 1, . . . ,7.
82
In order to verify whether our suggested influence measure DIMθθθ ,k can detect
an influential observation we contaminate the 40th observation by increasingits y−value. The reason for choosing observation 40 is because this observa-tion is located in the area, which is expected to be important for the estima-tion of θ1. By increasing its y−value we expect that this observation will bedeclared the most influential when using DIM
θθθ ,k as influence measure. More-
over, there is a strong dependence between θ1 and θ2 and we expect that thisobservation will stand out as influential on θ2 as well. The data is presentedand plotted in Table 5.1 and Figure 5.1, respectively.
yi j xi yi j xi yi j xi
0.27 0.25 0.46 0.75 0.37 1.250.08 0.25 0.46 0.75 0.75 1.250.37 0.25 0.59 0.75 0.42 1.500.18 0.25 0.63 0.75 0.60 1.500.21 0.25 0.56 1.00 0.74 1.500.25 0.25 0.48 1.00 0.59 1.500.36 0.25 0.36 1.00 1.20 1.500.37 0.50 0.48 1.00 0.62 1.500.52 0.50 0.58 1.00 0.68 1.500.12 0.50 0.44 1.00 0.49 1.750.42 0.50 0.43 1.00 0.64 1.750.32 0.50 0.58 1.25 0.68 1.750.43 0.50 0.64 1.25 0.81 1.750.29 0.50 0.53 1.25 0.32 1.750.35 0.75 0.57 1.25 0.75 1.750.44 0.75 0.70 1.25 0.64 1.750.47 0.75
Table 5.1: Simulated data according to the model given in (5.26)
The calculated DIMθθθ ,k, k = 1, . . . ,49, are presented graphically in two figures,
where the values of the influence measure corresponding to θ1 are given inFigure 5.2a and the values of the influence measure corresponding to θ2 aregiven in Figure 5.2b.
In Figure 5.2a we can see that DIMθθθ ,k identifies the 40th observation as the
most influential observation on θ1 and the value of the influence measure isDIM
θ1,40 = 0.09. The second largest value of the influence measure, in mag-nitude, corresponds to observation 47, where DIM
θ1,47 =−0.08. 75 percent of
83
*
*
*
**** *
*
*
***
******** *
****
**
*****
*
*
*
*
*
*
*
**
*
**
*
*
**
0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2 40
y
x
Figure 5.1: Plot of the data given in Table 5.1, where y = initial velocity andx = substrate concentration, together with the estimated curve. Observation 40 iscontaminated.
*
*
*
***
*
*
*
*
***
******
*************
*
*
*
*
*
*
*
*
**
*
**
*
*
*
*
0 10 20 30 40 50
−0.
10−
0.05
0.00
0.05
0.10
DIM
θ 1
2
3 79
10
36
38
40
43
46
47
48
(a) DIMθ1,k
*
*
*
***
*
*
*
*
*
*
*
******
**
*****
********
*
*
*
*
*
*
**
*
**
*
*
*
*
0 10 20 30 40 50
−0.
100.
000.
10
DIM
θ 2
2
3 7 9
10
14
36
40
43
46
47
48
(b) DIMθ2,k
Figure 5.2: The joint-parameter influence measure DIMθθθ ,k defined in (5.4), for
each observation in Table 5.1. Observations within the dashed lines represents75 percent of the data. Observe that DIM
θθθ ,k = (DIMθ1,k
,DIMθ2,k
).
the observations lies within the dashed lines. Observations 40 and 47 are wellseparated from these 75 percent.
84
The 40th observation is the most influential observation on θ2 as well, as canbe seen in Figure 5.2b where DIM
θ2,40 = 0.13. The second largest absolutevalue corresponds to observation 47 where DIM
θ2,47 = −0.12. 75 percent ofthe data lies within the dashed lines and observations 40 and 47 are isolatedfrom the most common 75 percent of the data. Moreover, the 10th observationhas a large value of the influence measure and is separated from the rest of thedata.
We can present the results of the calculation of DIMθθθ ,k in one figure by plot-
ting DIMθ1,k
against DIMθ2,k
, see Figure 5.3. From this figure it is clear that
the 40th and 47th observations are the most influential on both θ1 and θ2.
*
*
*
****
*
*
*
**
*
********
* * *** *******
*
*
*
**
*
*
**
*
* *
*
*
*
*
−0.10 −0.05 0.00 0.05 0.10 0.15
−0.
100.
000.
050.
10
DIM
θ 1
DIMθ2
40
47
Figure 5.3: The influence measures DIMθ1,k
and DIMθ2,k
calculated for eachobservation in Table 5.1.
Summarizing, DIMθθθ ,k successfully identifies the 40th observation as an influ-
ential observation. Observation 47, and in some sense observation 10, havehigh influence on the parameter estimates θθθ but are not contaminated. Thereason for their large values of DIM
θθθ ,k will be commented below.
In Section 5.1.3 we saw that DIMθθθ ,k is a function of the residuals and is closely
related to the tangent plane leverages. Therefore, to get a deeper understand-ing of the influence of the observations, we investigate the standardized resid-uals and the leverages, presented in Figure 5.4a and Figure 5.4b, respectively.From the figures it can be see that the 40th observation has a large standardizedresidual, which is outside the normal range of −2 to 2, and that the leverage is
85
medium high. Thus, observation 40 has both a relatively large value of lever-age and a large standardized residual, which explains the high value of DIM
θθθ ,k.
Inspecting the figures, we see that observation 47 has a large standardizedresidual, in magnitude. This is expected by chance and this observation is notspurious, since it is generated from the correct model. Moreover, the 47thobservation belongs to the group of observations with the largest values ofleverage and it is located in the area, which is expected to be important forestimation of θ1. A large standardized residual and the location of the obser-vation are the reasons for the large value, in magnitude, of DIM
θθθ ,47.
*
*
*
***
*
*
*
*
**
*
******
**
**
*
*
*
****
**
*
*
*
*
*
*
*
*
**
*
**
*
*
*
*
0 10 20 30 40 50
−2
−1
01
23
4 40
47
Sta
ndar
dize
d re
sidu
als
(a) Standardized residuals
**************
*******
**************
*******
*******
0 10 20 30 40 50
0.02
0.04
0.06
10
40
47
Leve
rage
(b) Leverage for each observation
Figure 5.4: Standardized residuals and leverages computed using the data inTable 5.1.
From Figure 5.4a and Figure 5.4b we see that the standardized residual cor-responding to the 10th observation is within the normal range and that theleverage for observation 10 is the second largest value. Moreover, the 10th ob-servation is located in the area, which is expected to be important for estima-tion of θ2. The large value of leverage and the location of the 10th observationis the explanation for the high value of DIM
θθθ ,k.
5.1.5 Numerical example: Influence analysis using DIMθ j,k
In order to verify whether the marginal influence measure DIMθ j,k
can suc-cessfully identify an influential observation, we will use the same simulateddata set as in the previous example in Section 5.1.4. Now, instead of contam-
86
inating the 40th observation, we contaminate the 9th observation, increasingits y−value, see Figure 5.5. Since observation 9 is located in the area whichis expected to be important for estimation of θ2 we expect that the marginalinfluence of this observation on θ2 will be larger than the marginal influenceon θ1.
*
*
*
***
* *
*
*
*
*
*
**
****
**
*
*
*
*
*
**
**
**
*
*
*
*
*
*
*
*
**
*
**
*
*
*
*
0.5 1.0 1.5
0.2
0.4
0.6
0.8
y
x
9
Figure 5.5: Plot of the data given in Table 5.1, where observation 9 is contami-nated and observation 40 is uncontaminated.
The results from the calculations of the marginal influence of the observationson θ1 and θ2 are presented in Figure 5.6a and Figure 5.6b, respectively. Asexpected, the 9th observation has the largest marginal influence on θ2, whereDIM
θ2,9=−0.03. However, DIM
θ1,9= 0.011 is not the largest influence mea-
sure in magnitude, since DIMθ1,47 =−0.013 and the 47th observation has more
influence on θ1 than the 9th observation.
Besides studying DIMθ j,k
we can also analyze the residuals and the marginalleverages, defined in (5.22). The standardized residuals for the 9th and 47thobservations are outside the -2 to 2 range, where the standardized residual forthe 9th observation is the largest in magnitude with a value of 3.33. The figureof standardized residuals is not shown here. In Figure 5.7a and Figure 5.7b thevalues of the marginal leverages for θ1 and for θ2 are plotted, respectively.
In Figure 5.7a we observe that the marginal leverage for the 47th observation
87
*
*
*
*****
*
*
***
******
**
**
*
*
*
**
**
**
*
*
*
*
*
*
*
*
**
*
**
*
*
*
*
0 10 20 30 40 50
−0.
015
−0.
005
0.00
5
DIM
θ 1
9
10
21
24
33
34
35
36
38
43
46
47
(a) DIMθ1,k
.
*
*
*
***
*
*
*
*
*
*
*
**
****
**
**
*
*
*
**
**
**
*
*
*
*
*
*
*
***
*
**
*
*
**
0 10 20 30 40 50
−0.
03−
0.01
0.01
DIM
θ 2
2
9
10
20
21
24
33
34
35
36
46
47
(b) DIMθ2,k
.
Figure 5.6: The marginal influence measure, DIMθ j ,k
, for j = 1,2 and k =
1, . . . ,49, when the 9th observation is contaminated. 75 percent of the data arewithin the dashed lines.
*******
*******
*******
*******
**************
*******
0 10 20 30 40 50
0.00
50.
015
0.02
50.
035
9
47
Leve
rage
(a) Fk(θ1)(
FFF(θ1)FFFT (θ1))−1
FTk (θ1)
*******
**************
*******
*******
*******
*******
0 10 20 30 40 50
0.01
40.
018
0.02
20.
026
9
47
Leve
rage
(b) Fk(θ2)(
FFF(θ2)FFFT (θ2))−1
FTk (θ2)
Figure 5.7: Marginal leverages of observations k = 1, . . . ,49 when the 9th ob-servation is contaminated. (a) describes the marginal leverages when θ1 is underconsideration and (b) describes the marginal leverages when θ2 is under consid-eration.
is much larger than for the 9th observation. In fact, observation 47 belongsto the group of observations with the largest values of marginal leverage. Onthe other hand, in inspecting Figure 5.7b shows that the marginal leverage ofthe 9th observation is much larger than the marginal leverage of the 47th ob-servation. Thus, the explanation for our result, that observation 9 is the most
88
influential on θ2 and observation 47 the most influential on θ1, is found whenstudying the marginal leverages for these two observations.
5.2 Assessment of influence of multiple observations
Thus far, we have discussed the differentiation approach to the detection ofsingle influential observations. However, in practice it is likely that a data setcontains more than one influential observation. Influence analysis concerningmultiple observations is a more challenging problem since multiple influentialobservations can be more difficult to detect.
Chatterjee and Hadi (1988) discussed the influence of a subset of observa-tions on the estimated regression parameters, they argued that the problem isthreefold. Firstly, it can be difficult to determine the size of the subset of ob-servations whose influence on parameter estimates should be investigated, i.e.should we investigate all pairs or triplets of observations? If the size of thesubset of interest is unknown, sequential methods can be used. For example,Belsley et al. (1980) suggested a procedure where one starts with a subset oftwo observations and analyzes every pair of observations in the data set. At thenext step, one continues with examining every group of three and four and soon. The challenges with a sequential method concern how to identify a mean-ingful stopping rule.
Secondly, there can be computational problems when searching for multipleinfluential observations. For example, if we are interested in examining ev-ery pair of observations in a data set consisting of 50 observations, there willbe 1225 combinations to examine. If we are also interested in examining ev-ery triplet of observations then we need to add another 19 600 combinations.For practitioners, this can be an overwhelming task. Moreover, the graphicalidentification of multiple influential observations is more complicated than forsingle influential observations. For more discussions concerning the identifi-cation of multiple influential observations we refer to, for instance, Atkinson(1986) and Peña and Yohai (1995).
Thirdly, multiple influential observations can cause so-called masking andswamping effects. Swamping occurs when observations without substantial in-fluence on the parameter estimates are identified as influential observations dueto the presence of another observation, which is highly influential. In the statis-tical literature several ways of defining masking has been discussed. Lawrence(1995) argued that two approaches to defining masking have emerged. He
89
illustrates these two approaches by using four quotations. Atkinson (1985):
...this structure would not be revealed by the calculation of sin-gle deletion diagnostic measures for each observation in turn, al-though it might well be detected by multiple deletion measures.This effect, which has been called ’masking’...
Chatterjee and Hadi (1988):
There may exist situations in which observations are jointly butnot individually influential, or the other way around...This situa-tion is sometimes referred to as the masking effect...
Both quotations focus on a joint aspect of masking, where the masking effectconcerns all the observations in one group, simultaneously. The observationsin a group or subset are masking each other.
There are alternative ways for defining masking, e.g. Rousseeuw and Leroy(1987) and Atkinson (1985) argued that:
...masking effect means that, after the deletion of one or more in-fluential points, another observation may emerge as extremely in-fluential, which was not visible at first...
...the importance of a particular observation may not be apparentuntil some other observation has been deleted...In the presence ofsuch masking effect...
These two quotations elucidate a conditional nature of masking, i.e. the obser-vation being masked cannot be identified as an influential observation unlessthe observation masking it is being deleted from the data.
The two different sides of masking resulted in two distinct influence measures(Lawrence, 1995), which are both based on Cook’s distance, the joint and theconditional influence measures. These influence measures will be presented inSection 5.2.1 and Section 5.2.3, respectively. In line with Lawrence (1995),we will use the terms joint and conditional influence when deriving measuresfor assessing the influence of multiple observations on the parameter estimatesin a nonlinear regression model.
This section will be divided into several parts. In the first two parts we discussthe joint influence of multiple observations in linear and nonlinear regressionanalysis. In Section 5.2.1, we will give a brief overview of the discussion of
90
this topic given in Belsley et al. (1980). Then, borrowing ideas from Belsleyet al. (1980) we derive the influence measure based on the differentiation ap-proach to assess the influence of multiple observations simultaneously on theparameter estimates in a nonlinear regression.
The other two parts concern conditional influence of observations in linearand nonlinear regression. In Section 5.2.3, we exemplify the assessment of theconditional influence of observations on parameter estimates in a linear regres-sion model. In Section 5.2.4, we will propose the diagnostic measure based onthe differentiation approach for assessing the conditional influence of the ob-servations on the parameter estimates in a nonlinear regression model. Thisdiagnostic measure will be referred to as DIM
θθθ (i),kand will assess the influ-
ence of the kth observation given that the ith observation is deleted.
5.2.1 Joint influence in linear regression
One approach for simultaneously assessing the influence of several observa-tions on the parameter estimates in the linear regression model is to use anextended version of Cook’s distance. Cook’s distance, when the kth and lthobservations are deleted, is given by
Ckl =(βββ − βββ (k,l))
TXXXTXXX(βββ − βββ (k,l))
pσ2 ,
where βββ (k,l) is the estimate of βββ when the kth and lth observations are excludedfrom calculations. This diagnostic measure is discussed more generally for agroup of m > 2 observations by Cook and Weisberg (1980, 1982).
Belsley et al. (1980) suggested an idea to perturb several observations simul-taneously using weights ωωω = (ω1, . . . ,ωn)
T , where 0 < ωk ≤ 1, k = 1, . . . ,n.The vector ωωω is then used as the diagonal in the weight matrix W. Considerthe following perturbed model
yyyω =XXXβββ +εεεω ,
where εεεω ∼ Nn(0,σ2W−1(ωωω)) and W(ωωω) is the diagonal weight matrix withdiagonal elements equal to ωωω . Let K be a subset containing the indices of theobservations whose influence on the parameter estimates we want to evaluate.
91
Belsley et al. (1980) also suggested to use the following "directional deriva-tive"
`T dβββ (ωωω)
dωωω
∣∣∣∣∣ωωω=111n
, (5.27)
to identify subsets of observations with a significant influence on βββ . Here,βββ (ωωω) is the weighted least squares estimate of βββ which is a function of theweight ωωω . Notice that if ωωω = 111n then βββ (ωωω) = βββ , the unweighted least squaresestimate of βββ . In (5.27), ` : n× 1 is a vector with nonzero components in therows corresponding to indices in the subset K.
Using the derivative (5.27) we are interested in the rate of change of the func-tion βββ (ωωω) as the weights simultaneously vary. However, there are several waysto allow for this. If we are interested in letting the weights vary equally fast wecould use the direction defined by the vector with ones in the rows with indicesin the set K and zeros elsewhere. However, there are many vectors pointingin this direction. We could replace the ones with twos, and this vector wouldpoint in the same direction. Therefore, often the unit vector is used to give thedesired direction, hence ‖`‖=
√`T ` = 1.
We will now borrow the idea of using the "directional" derivative and define theinfluence measure DIM
βββ ,K for assessing the influence of multiple observations
on βββ .
Definition 5.2.1. The diagnostic measure for assessing the influence of theobservations with indices specified in the subset K on βββ , is defined as
DIMβββ ,K = `T dβββ (ωωω)
dωωω
∣∣∣∣∣ωωω=111
, (5.28)
where ` : n× 1 is a vector with nonzero entries in the rows corresponding toindices in K and `T ` = 1.
In the following proposition we present the explicit expression of DIMβββ ,K .
Proposition 5.2.1. Let the DIMβββ ,K be given in Definition 5.2.1. Then
DIMβββ ,K = `TDDDrXXX
(XXXTXXX
)−1,
where DDDr : n× n is a diagonal matrix with diagonal elements r1, . . . ,rn andri = yi−xxxT
i βββ , i = 1, . . . ,n.
92
Proof. Let W = W(ωωω). The derivative
dβββ (ωωω)
dωωω
∣∣∣∣∣ωωω=111n
=d(XXXT WXXX)−1XXXT Wyyy
dωωω
∣∣∣∣ωωω=111n
is calculated using the product rule as well as the chain rule and the expressionfor the derivative of an inverse (see Appendix A)
dβ (ωωω)
dωωω=
dWdωωω
(yyy⊗XXX)(XXXT WXXX
)−1− dWdωωω
(XXX⊗XXX)
×((
XXXT WXXX)−1⊗
(XXXT WXXX
)−1)(
XXXT Wyyy⊗IIIp). (5.29)
In the expression abovedWdωωω
=UUU∗ = (uuu1,uuu2, . . . ,uuun)T ,
whereuuui = dddi⊗dddi
and dddi is the ith column of the identity matrix of size n.
Evaluating (5.29) at ωωω = 111n implies that βββ (ωωω = 111n) = βββ , i.e. the estimateof βββ from the unperturbed model (2.1) and yyy−XXXβββ (ωωω = 111n) is denoted rrr, theresiduals from the unperturbed model. Now
dβ (ωωω)
dωωω
∣∣∣∣∣ωωω=111n
=
= UUU∗[(yyy⊗XXX)
(XXXTXXX
)−1 − (XXX⊗XXX)((
XXXTXXX)−1⊗
(XXXTXXX
)−1)(
XXXTyyy⊗IIIp)]
= UUU∗[(
yyy⊗XXX(XXXTXXX
)−1)−(
XXX(XXXTXXX
)−1XXXTyyy⊗XXX
(XXXTXXX
)−1)]
= UUU∗[(
yyy⊗XXX(XXXTXXX
)−1)−(
XXXβββ ⊗XXX(XXXTXXX
)−1)]
= UUU∗[(yyy⊗IIIn)−
(XXXβββ ⊗IIIn
)]XXX(XXXTXXX
)−1
= UUU∗[(
yyy−XXXβββ ⊗IIIn
)]XXX(XXXTXXX
)−1,
where UUU∗[(yyy⊗IIIn)−
(XXXβββ ⊗IIIn
)]= DDDr and where DDDr is defined in the state-
ment of the proposition. Thus, we get
`T dβββ (ωωω)
dωωω
∣∣∣∣∣ωωω=111
= `TDDDrXXX(XXXTXXX
)−1,
and the proposition is established.�
93
Corollary 5.2.1. The joint influence measure DIMβββ ,K is a linear combination
of the influence measure EICβββ ,k, for k ∈ K, defined in (5.2).
Proof. Observe that
dβββ (ωωω)
dωωω
∣∣∣∣∣ωωω=111
= DDDrXXX(XXXTXXX
)−1
=
r1x11 . . . r1x1p...
. . ....
rnxn1 . . . rnxnp
(XXXTXXX)−1
,
is a matrix of n partial derivatives. The kth row of this matrix can be written
rkxxxTk(XXXTXXX
)−1,
which is equal to EICβββ ,k defined in (5.2). Therefore
`T dβββ (ωωω)
dωωω
∣∣∣∣∣ωωω=111
= `T(
EICTβββ ,1
, . . . ,EICTβββ ,n
)T,
which shows the corollary.�
As an example of the above corollary, let the observations contained in thesubset K have indices 1 and 2. Then, `T = (`1, `2,0, . . . ,0) and
`T dβββ (ωωω)
dωωω
∣∣∣∣∣ωωω=111
= `1EICβββ ,1 + `2EIC
βββ ,2.
It is natural to think of (5.28) as a summary measure used to assess joint influ-ence, i.e. influence of multiple observations simultaneously on the parameterestimates.
In the next section we will extend these ideas and derive a joint influence mea-sure for assing the influence of multiple observations on the parameter esti-mates in nonlinear regression models.
5.2.2 Joint influence in nonlinear regression
In this section we present another new result of the thesis, i.e. a diagnosticmeasure for assessing the influence of multiple observations simultaneouslyon the parameter estimates for a nonlinear regression model.
94
Consider the following perturbed nonlinear model
yyyω = fff (XXX ,θθθ)+εεεω , (5.30)
where εεεω∼Nn(000,σ2W−1(ωωω)), W(ωωω) : n×n is a diagonal weight matrix, withdiagonal elements ωωω = (ω1, . . . ,ωn)
T and where 0 < ωk ≤ 1, for k = 1, . . . ,n.Also, let K be the subset containing the indices of the observations for whichwe would like to assess influence.
In correspondence with Definition 5.2.1 we present the following definition
Definition 5.2.2. The diagnostic measure for assessing the influence of theobservations with indices specified in the subset K, on the parameter estimateθθθ , is defined as the following derivative
DIMθθθ ,K = `T dθθθ(ωωω)
dωωω
∣∣∣∣∣ωωω=111n
, (5.31)
where ` : n× 1 is a vector with nonzero entries in the rows with indices in K,where ‖`‖=
√`T ` = 1 and where θθθ(ωωω) is the weighted least squares estimate
of θθθ , which is a function of the weight ωωω .
If ωωω → 111n, then θθθ(ωωω)→ θθθ , the unweighted least squares estimate.
To derive the DIMθθθ ,K for assessing the influence of multiple observations si-
multaneously on the parameter estimates, we need the weighted least squaresestimate of θθθ in (5.30). The weighted least squares criterion, which should beminimized, is given by
Q(ωωω) = (yyy− fff (XXX ,θθθ))T W(ωωω)(yyy− fff (XXX ,θθθ)) .
The normal equations are then given by(d fff (XXX ,θθθ)
dθθθ
)W(ωωω)(yyy− fff (XXX ,θθθ)) = 000. (5.32)
Since fff (XXX ,θθθ) is a nonlinear function, there is generally no explicit solution tothe normal equations and iterative methods are used to find an estimate. Theobtained estimate of θθθ is a function of the weights, ωωω , and is denoted θθθ(ωωω).
The next theorem provides an explicit expression of the DIMθθθ ,K defined in
(5.31).
95
Theorem 5.2.1. Let DIMθθθ ,K be given in Definition 5.2.2. Then
DIMθθθ ,K = `TUUU∗
(rrr⊗FFFT (θθθ)
)(FFF(θθθ)FFFT (θθθ)−GGG(θθθ)(rrr⊗ Iq)
)−1,
provided that the inverse exists.
In the expression above, UUU∗ : n×n2 is a matrix with row vectors uuuTi ,
uuui = dddi⊗dddi, for i = 1, . . . ,n, (5.33)
where dddi is the ith column of the identity matrix of size n. The quantities rrr,FFF(θθθ) and GGG(θθθ) are defined in (5.6), (5.7) and (5.8), respectively.
Proof. Consider inserting θθθ(ωωω) in the normal equations (5.32)
d fff (XXX ,θθθ)
dθθθ
∣∣∣∣θθθ=θθθ(ωωω)
W(ωωω)(
yyy− fff (XXX ,θθθ(ωωω)))= 000, (5.34)
and letting FFF =FFF(θθθ(ωωω)), W = W(ωωω) and eee = yyy− fff (XXX) = yyy− fff (XXX ,θθθ(ωωω)).
To study the influence of multiple observations on θθθ , differentiate FFFWeee = 000,given in (5.34), on both sides, with respect to ωωω
ddωωω
FFFWeee = 000. (5.35)
To calculate the derivative in (5.35), the product rule, defined in Appendix A,is applied
ddωωω
FFFWeee =dFFFdωωω
(Weee⊗Iq)+dWdωωω
(eee⊗FFFT )+ deee
dωωωWFFFT. (5.36)
In the expression above
deeedωωω
=−d fff (XXX)
dωωω,
dWdωωω
=UUU∗.
Applying the chain rule, see Appendix A, to (5.36) gives
dθθθ(ωωω)
dωωω
dFFF
dθθθ(ωωω)(Weee⊗Iq)+UUU∗
(eee⊗FFFT )−dθθθ(ωωω)
dωωω
d fff (XXX)
dθθθ(ωωω)WFFFT =000,
96
which after rearrangement of terms yields
UUU∗(eee⊗FFFT )= dθθθ(ωωω)
dωωω
(d fff (XXX)
dθθθ(ωωω)WFFFT− dFFF
dθθθ(ωωω)(Weee⊗Iq)
). (5.37)
Evaluating the derivative in (5.37) at ωωω = 111n implies that θθθ(ωωω = 111n) = θθθ ,the estimate of θθθ for the unperturbed model (2.2) andyyy− fff (XXX ,θθθ(ωωω = 111n)) = rrr, the residuals for the unperturbed model. Further,
d fff (XXX)
dθθθ(ωωω)
∣∣∣∣∣ωωω=111n
=d fff (XXX ,θθθ(ωωω))
dθθθ(ωωω)
∣∣∣∣∣ωωω=111n
= FFF(θθθ(ωωω))∣∣∣ωωω=111n
=FFF(θθθ),
i.e. the matrix of derivatives of the expectation function from the unperturbedmodel. Moreover,
dFFF
dθθθ(ωωω)
∣∣∣∣∣ωωω=111n
=dFFF(θθθ(ωωω))
dθθθ(ωωω)
∣∣∣∣∣ωωω=111n
= GGG(θθθ(ωωω))∣∣∣ωωω=111n
=GGG(θθθ),
i.e. the matrix of second derivatives of the expectation function from the un-perturbed model. Thus, (5.37) becomes
UUU∗(
rrr⊗FFFT (θθθ))
=dθθθ(ωωω)
dωωω
∣∣∣∣∣ωωω=111
(FFF(θθθ)FFFT (θθθ)−G(θθθ)(rrr⊗ Iq)
).
Rearranging terms yields
dθθθ(ωωω)
dωωω
∣∣∣∣∣ωωω=111
= UUU∗(
rrr⊗FFFT (θθθ))(
FFF(θθθ)FFFT (θθθ)−G(θθθ)(rrr⊗ Iq))−1
, (5.38)
and
`T dθθθ(ωωω)
dωωω
∣∣∣∣∣ωωω=111
= `TUUU∗(
rrr⊗FFFT (θθθ))(
FFF(θθθ)FFFT (θθθ)−G(θθθ)(rrr⊗ Iq))−1
.
The proof is complete. �
Corollary 5.2.2. The joint influence measure DIMθθθ ,K is a linear combination
of the influence measures DIMθθθ ,k, for k ∈ K, defined in (5.4), where DIM
θθθ ,k
measures the influence of a single observation on θθθ .
97
Proof. Observe that UUU∗(rrr⊗FFFT (θθθ)) in (5.38) equals
UUU∗(
rrr⊗FFFT (θθθ))= (r1FFF1,r2FFF2, . . . ,rnFFFn)
T .
It follows that
dθθθ(ωωω)
dωωω
∣∣∣∣∣ωωω=111
=
r1FFFT
1r2FFFT
2...
rnFFFTn
(FFF(θθθ)FFFT (θθθ)−G(θθθ)(rrr⊗ Iq))−1
=
DIMθθθ ,1
...DIM
θθθ ,n
,
where
DIMθθθ ,k =
(DIM
θ1,k, . . . ,DIM
θq,k
)= rkFFFT
k (θθθ)(
FFF(θθθ)FFFT (θθθ)−G(θθθ)(rrr⊗ Iq))−1
.
Thus,
DIMθθθ ,K = `TUUU∗
(rrr⊗FFFT (θθθ)
)(FFF(θθθ)FFFT (θθθ)−G(θθθ)(rrr⊗ Iq)
)−1
= `T(
DIMTθθθ ,1
, . . . , DIMTθθθ ,n
)T,
which is a linear combination of DIMθθθ ,k for all k ∈ K and which establish the
corollary.
�
The DIMθθθ ,K is a diagnostic measure for assessing the simultaneous influence
of several observations on the parameter estimates, θθθ . Since all parameters inthe model are estimated from the perturbed model, DIM
θθθ ,K is regarded to bea joint-parameter influence measure. It can be of interest to assess the influ-ence of multiple observations on a particular parameter estimate, θ j, in model(2.2). If this is the case, we use the same methodology as above and obtain amarginal-parameter influence measure.
Let θθθ =(
θθθ 1, θ j
)be a vector of parameter estimates, where
θθθ 1 =(
θ1, . . . , θ j−1, θ j+1, . . . , θq
),
are the maximum likelihood estimates from the unperturbed model (2.2), andθ j is estimated from the perturbed model (5.30).
98
Definition 5.2.3. The marginal influence measure for assessing the influenceof the observations with indices specified in K, on the parameter estimate θ j,is defined as the following derivative
DIMθ j,K
= `T dθ j(ωωω)
dωωω
∣∣∣∣∣ωωω=111n
, (5.39)
where ` : n× 1 is a vector that has nonzero entries in rows with indices in K,`T ` = 1 and θ j(ωωω) is the weighted least squares estimate of θ j, which is afunction of the weight ωωω .
To get a practically applicable expression for (5.39), we need to look at theweighted least squares estimate of θθθ in model (5.30). The weighted leastsquare criterion
Q(ωωω) =(
yyy− fff (XXX ,θθθ 1,θ j))T
W(ωωω)(
yyy− fff (XXX ,θθθ 1,θ j)),
is minimized via the normal equation(d fff (XXX ,θθθ 1,θ j)
dθ j
)W(ωωω)
(yyy− fff (XXX ,θθθ 1,θ j)
)= 0. (5.40)
The solution for (5.40) is the weighted least squares estimate, θ j(ωωω).
The next theorem provides an explicit expression of the marginal-parameterinfluence measure DIM
θ j,Kdefined in (5.39).
Theorem 5.2.2. Let DIMθ j,K
be given in Definition 5.2.3. Then
DIMθ j,K
= `TUUU∗(
rrr ⊗ FFFT (θ j))(
FFF(θ j)FFFT (θ j)−GGG(θ j)rrr)−1
, (5.41)
provided that the inverse exists.
In (5.41), rrr = yyy − fff (XXX ,θθθ 1, θ j), UUU∗ : n × n2 is defined in (5.33),FFFT (θ j) : n×1 and GT (θ j) : n×1 are defined in (5.15) and (5.16), respectively.
Proof. Consider inserting the weighted least squares estimate of θ j in the nor-mal equation (5.40)
d fff (XXX ,θθθ 1,θ j)
dθ j
∣∣∣∣∣θ j=θ j(ωωω)
W(ωωω)(
yyy− fff (XXX ,θθθ 1, θ j(ωωω)))= 0. (5.42)
99
and letting FFF =FFF(θ j(ωωω)), W=W(ωωω) and eee=yyy− fff (XXX)=yyy− fff (XXX ,θθθ 1, θ j(ωωω)).
For the subset of observations with indices in K, DIMθ j,K
can be obtained bydifferentiation of (5.42) with respect to ωωω on both sides, i.e.
ddωωω
FWe = 0. (5.43)
Using the product rule (see Appendix A) to calculate the derivative in (5.43)we get
ddωωω
FFFWeee =dFFFdωωω
Weee+dWdωωω
(eee⊗FFFT )+ deee
dωωωWFFFT . (5.44)
Recall from Theorem 5.2.1 that
deeedωωω
=−d fff (XXX)
dωωω,
dWdωωω
=UUU∗.
Next, applying the chain rule, defined in Appendix A, to (5.44) gives
dθ j(ωωω)
dωωω
dFFF
dθ j(ωωω)We+UUU∗
(eee ⊗ FFFT )− dθ j(ωωω)
dωωω
d fff (XXX)
dθ j(ωωω)WFFFT = 0,
and a rearrangement of terms yields
UUU∗(eee ⊗ FFFT )= dθ j(ωωω)
dωωω
(d fff (XXX)
dθ j(ωωω)WFFFT − dFFF
dθ j(ωωω)Weee
).
As previously mentioned, evaluating the derivative at ωωω = 111n gives
UUU∗(
rrr ⊗ FFFT (θ j))
=dθ j(ωωω)
dωωω
∣∣∣∣∣ωωω=111n
(FFF(θ j)FFFT (θ j)−GGG(θ j)rrr
)where θ j(ωωω = 111n) = θ j, the estimate of θ j from the unperturbed model andwhere yyy− fff (XXX ,θθθ 1, θ j(ωωω =111n)) = rrr, the residuals from the unperturbed model.
Again, rearranging terms yields
dθ j(ωωω)
dωωω
∣∣∣∣∣ωωω=111n
= UUU∗(
rrr ⊗ FFFT (θ j))(
FFF(θ j)FFFT (θ j)−GGG(θ j)rrr)−1
,
and this completes the proof. �
100
Corollary 5.2.3. The influence measure DIMθ j,K
in (5.39), for assessing theinfluence of multiple observations with indices specified in K is a linear com-bination of the individual influence measures DIM
θ j,kfor all k specified in K.
Proof. When only the jth parameter is estimated from the perturbed model,we have that
`T dθ j(ωωω)
dωωω
∣∣∣∣∣ωωω=111n
= `TUUU∗(
rrr ⊗ FFFT (θ j))(
FFF(θ j)FFFT (θ j)−GGG(θ j)rrr)−1
= `T(
DIMθ j,1
, . . . ,DIMθ j,n
)T,
where DIMθ j,k
is defined in (5.13) and which establish the corollary.�
As an illustration of Corollary 5.2.3, let us assume that we are interested in as-sessing the joint influence of the 1st and 2nd observations, i.e. K = {1,2}and `T = (`1, `2,0, . . . ,0). What values to assign to `1 and `2 depends onwhich direction we want to use. If we want to find the rate of change ofθ j(ωωω) we let ω1 and ω2 vary equally fast, e.g. we can use the direction where`1 = `2 =
1√2. In this case the simultaneous influence of the 1st and 2nd obser-
vation on θ j are simply the weighted sum of the individual influence measures,i.e. 1√
2(DIM
θ j,1+DIM
θ j,2).
Note that, since DIMθθθ ,K and DIM
θ j,Kare linear functions of the individual di-
agnostic measures, DIMθθθ ,k and DIM
θ j,k, the discussion in Section 5.1.2 also
applies here. For instance, the DIMθθθ ,K is affected by the dependence between
the estimated parameters in the model, due to the fact that they are estimatedsimultaneously. If one wants to be "certain" of how the observations in the sub-set K are influencing a particular parameter estimate, the marginal diagnosticmeasure DIM
θ j,Kshould be used. Since this diagnostic measure is constructed
when only the jth parameter is estimated from the perturbed model, this diag-nostic measure is used to assess the influence of the observations on the partic-ular parameter estimate. Moreover, the individual influence measures can bepositive or negative. For example, the values of DIM
θθθ ,K and DIMθ j,K
will beclose to zero if the values of the individual influence measures of the observa-tions in the subset K are of opposite signs and similar magnitude. Noteworthyis that the joint influence of the observations in the subset K will be large inmagnitude if the values of the individual influence measures of these observa-tions are of the same signs.
101
Moreover, no suggestions have yet been made in the literature regarding howto study the influence of multiple observations on the parameter estimates. Inthe article about local influence, Cook (1986) discussed the perturbation of asubset of q observations in the linear regression case, where 1 < q < n. A sim-ilar perturbation scheme, where subsets of observations are under considera-tion, might be possible in the nonlinear regression case. Using the perturbationscheme with q perturbation weights would allow for local influence assessmentof, for instance, pairs or triplets of observations. However, this opportunity isnot discussed explicitly by St. Laurent and Cook (1993), where the extensionof the local influence approach from linear to nonlinear regression was dis-cussed.
5.2.3 Conditional influence in linear regression
In practice, there might be situations when an observation is not identified asan influential one unless another observation is deleted first. It can also bethe opposite, an observation labeled as influential does not appear to be afterthe deletion of another observation. To be able to handle such situations whenanalyzing data, we need tools to evaluate the influence of the observations con-ditionally on the deletion of another observation in the data set.
Lawrence (1995) was the first to introduce the term conditional influence andsuggested calculating Cook’s distance for an observation before and after thedeletion of another observation. The conditional influence measure defined asCook’s distance of the kth observation after the deletion of the ith observationis given by
Ck,(i) =(βββ (k,i)− βββ (i))
TXXXT(i)XXX (i)(βββ (k,i)− βββ (i))
pσ2 ,
where βββ (i) and βββ (k,i) are the estimates of βββ when the ith observation is ex-cluded from calculations and when both the kth and the ith observations areexcluded, respectively. For other references concerning conditional influence,see e.g. Wang and Critchley (2000) and Poon and Poon (2001).
In this section we will derive an influence measure for assessing the influenceof the kth observation on the estimate of the parameter vector in the linear re-gression model (2.1) conditional on the deletion of the ith observation. Theideas for deriving the conditional influence measure will then be extended tononlinear regression models in Section 5.2.4.
102
Let us consider the perturbed linear model
yyyω =XXXβββ +dddiγ +εεεω , (5.45)
where dddi : n × 1 is the ith column of the identity matrix of size n,γ is an unknown parameter, εεεω ∼ Nn(000,σ2W(ωk)) andW(ωk) = diag(1, . . . ,1,ωk,1, . . . ,1). Adding the component dddiγ and fittingthe model deletes the ith observation in the estimates (Chatterjee and Hadi,1988). Thus, using model (5.45) we perturb the error variance of the kth ob-servation and when the model is fitted, the ith observation is deleted. In thenext definition the perturbed linear model will be utilized.
Definition 5.2.4. The influence measure for assessing the influence of the kthobservation on βββ , conditional on the deletion of the ith observation, is definedas
DIMβββ (i),k
=dβββ (i)(ωk)
dωk
∣∣∣∣∣ωk=1
,
for i,k = 1, . . . ,n and i 6= k, where βββ (i)(ωk), is the weighted least squares esti-mate of βββ in the perturbed model (5.45), i.e the estimate when the ith observa-tion is excluded from the calculations.
In Definition 5.2.4 the weighted least squares estimator of βββ is needed. Inorder to derive the estimator we start by estimating γ in the perturbed model(5.45). Differentiating
Q(i)(ωk) = (yyy−XXXβββ −dddiγ)T W(ωk)(yyy−XXXβββ −dddiγ) , (5.46)
yields the following normal equation for γ
dQi(ωk)
dγ=−2
(yi−xxxT
i βββ − γ)= 0. (5.47)
Utilizing (5.47), the estimator is given by
γ = yi−xxxTi βββ .
Now, inserting γ in (5.46) deletes the ith observation from the expression and(5.46) results in
Q(i)(ωk) =(yyy(i)−XXX (i)βββ
)T W(i)(yyy(i)−XXX (i)βββ
), (5.48)
103
where XXX (i) : (n− 1)× p is the matrix with explanatory variables with the ithrow omitted, yyy(i) : (n−1)×1 is the response vector with the ith response omit-ted and W(i) = W(i)(ωk) is the weight matrix of order (n−1) with the ith rowand the ith column omitted.
Minimizing (5.48), the following normal equations for βββ are obtained
dQi(ωk)
dβββ=−2XXXT
(i)W(i)(yyy(i)−XXX (i)βββ
)= 000. (5.49)
Utilizing (5.49), the weighted least squares estimator of βββ is
βββ (i)(ωk) =(
XXXT(i)W(i)XXX (i)
)−1XXXT
(i)W(i)yyy(i). (5.50)
In the next theorem we obtain the DIMβββ (i),k
utilizing (5.50). Observe that the
theorem contains two explicit expressions of DIMβββ (i),k
, one for the case where
i < k and one for the case where i > k. When i < k, the position of the kthobservation, in the response vector and the matrix of explanatory variables, isaffected by the deletion of the ith observation. The kth observation will in thiscase be denoted k− 1. On the other hand, when i > k the position of the kthobservation will not be affected by the deletion of the ith observation.
Theorem 5.2.3. Let DIMβββ (i),k
be given in Definition 5.2.4. Then if i > k
DIMβββ (i),k
=(
yyyk,(i)−xxxTk,(i)βββ (i)
)xxxT
k,(i)
(XXXT
(i)XXX (i)
)−1.
Moreover, if i < k
DIMβββ (i),k
=(
yyyk−1,(i)−xxxTk−1,(i)βββ (i)
)xxxT
k−1,(i)
(XXXT
(i)XXX (i)
)−1.
In the expressions above, yyy(i) = (y1,(i), . . . ,yk−1,(i),yk,(i), . . . ,yn−1,(i))T
is the vector of responses with the ith observation excluded. The matrixXXX (i) : (n− 1)× p is the matrix with explanatory variables, with the ith rowexcluded and xxxT
k,(i) is the kth row of XXX (i).
Proof. The explicit expression of DIMβββ (i),k
is found by first differentiating the
estimator of βββ in (5.45) and then evaluating the derivative at ωk = 1.
104
Using the product rule and the rule of differentiating a matrix inverse, definedin Appendix A, the derivative equals
dβββ (i)(ωk)
dωk=
d(
XXXT(i)W(i)XXX (i)
)−1XXXT
(i)W(i)yyy(i)
dωk(5.51)
=dW(i)
dωk
(yyy(i)⊗XXX (i)
)(XXXT
(i)W(i)XXX (i)
)−1−
dW(i)
dωk
(XXX (i)⊗XXX (i)
)×((
XXXT(i)W(i)XXX (i)
)−1⊗(
XXXT(i)W(i)XXX (i)
)−1)(
XXXT(i)W(i)yyy(i)⊗IIIp
).
Due to linearity of W(i) we obtain, if i > k,
dW(i)
dωk= dddT
k ⊗dddTk ,
where dddk is the k−th column of the identity matrix of size n−1.
If i < k, we obtaindW(i)
dωk= dddT
(k−1)⊗dddT(k−1),
where ddd(k−1) is the (k−1)th column of the identity matrix of size n−1.
Now, if i > k, evaluating the derived expression (5.51) at ωk = 1 we get
ddωk
βββ (i)(ωk)
∣∣∣∣ωk=1
=(dddT
k ⊗dddTk)[(
yyy(i)⊗XXX (i))(
XXXT(i)XXX (i)
)−1
−(XXX (i)⊗XXX (i)
)((XXXT
(i)XXX (i)
)−1⊗(
XXXT(i)XXX (i)
)−1)(
XXXT(i)yyy(i)⊗IIIp
)]=
(dddT
k ⊗dddTk)[(
yyy(i)⊗XXX (i))(
XXXT(i)XXX (i)
)−1−(
XXX (i)βββ (i)⊗XXX (i)
)(XXXT
(i)XXX (i)
)−1]
=(dddT
k ⊗dddTk)(
yyy(i)−XXX (i)βββ (i)⊗XXX (i)
)(XXXT
(i)XXX (i)
)−1
=(
dddTk
(yyy(i)−XXX (i)βββ (i)
)⊗dddT
k XXX (i)
)(XXXT
(i)XXX (i)
)−1(5.52)
=(
yk,(i)−xxxTk,(i)βββ (i)
)xxxT
k,(i)
(XXXT
(i)XXX (i)
)−1,
and hence, the final expression for DIMβββ (i),k
is
DIMβββ (i),k
=(
yk,(i)−xxxTk,(i)βββ (i)
)xxxT
k,(i)
(XXXT
(i)XXX (i)
)−1.
105
On the other hand, if i < k, dddTk ⊗dddT
k is replaced with dddT(k−i)⊗dddT
(k−i) in (5.52),resulting in
ddωk
βββ (i)(ωk)
∣∣∣∣ωk=1=(
dddTk−1
(yyy(i)−XXX (i)βββ (i)
)⊗dddT
k−1XXX (i)
)(XXXT
(i)XXX (i)
)−1(5.53)
=(
yk−1,(i)−xxxTk−1,(i)βββ (i)
)xxxT
k−1,(i)
(XXXT
(i)XXX (i)
)−1,
Thus, the final expression for DIMβββ (i),k
is
DIMβββ (i),k
=(
yk−1,(i)−xxxTk−1,(i)βββ (i)
)xxxT
k−1,(i)
(XXXT
(i)XXX (i)
)−1,
and the proof is complete. �
To clarify why two expressions of the DIMβββ (i),k
are needed, we present a small
example.
Example 5.2. Illustration of the structure of DIMβββ (i),k
Let the linear regression model be
yyy =XXXβββ +εεε,
where yyy : 4× 1, XXX : 4× 2, βββ : 2× 1 and εεε : 4× 1. Assume that we want tostudy the influence of the 2nd observation conditional on the deletion of the1st observation. In this case i < k.
In the proof of Theorem 5.2.3 we see that we need to consider the derivativeof W(i) with respect to ωk. First, observe that W equals
W =
1 0 0 00 ω2 0 00 0 1 00 0 0 1
.
Secondly, W(i) is in this example denoted W(1) and it equals
W(1) =
ω2 0 00 1 00 0 1
,
106
with derivativedW(1)
dωk= (1,0, . . . ,0) = dddT
2−1⊗dddT2−1,
where ddd2−1 = ddd1 is the first column of the identity matrix of size 3.
Consider (5.53) in the proof of Theorem 5.2.3. For this simple example wehave that
yyy(1) =
y2y3y4
, XXX (1) =
x21 x22x31 x32x41 x42
, βββ (1) =
(β1,(1)
β2,(1)
),
so that yk−1,(i) in (5.53) equals y2, the first component of the vector yyy(1). In asimilar way, we observe that xxxT
k−1,(i) in (5.53) equals xxxT2 , the first row of the
matrix XXX (1) and finally, the expression of DIMβββ (1),2
is
DIMβββ (1),2
=(
y2−xxxT2 βββ (1)
)xxxT
2
(XXXT
(1)XXX (1)
)−1.
If, on the other hand, i = 2 and k = 1, we have that i > k and the position of theweight, ωk, is not effected by the deletion of the ith observation. To see thiswe consider the weight matrix, W, and the weight matrix, W(2)
W =
ω1 0 0 00 1 0 00 0 1 00 0 0 1
, W(2) =
ω1 0 00 1 00 0 1
.
Hence, in the case where i > k, the position of ωk is unchanged in the matrixW(i) and it follows that k can remain unchanged in the explicit expression ofDIM
βββ (i),kand throughout the proof.
In the next section the ideas from this section will be used in order to derive aninfluence measure for assessing the conditional influence of the observationson the parameter estimates in the nonlinear regression model (2.2).
5.2.4 Conditional influence in nonlinear regression
In this section, we will define and derive an influence measure for use in non-linear regression analysis, similar to DIM
βββ (i),kdiscussed in the previous sec-
tion. This measure is denoted DIMθθθ (i),k
and is used for assessing the influence
107
on the kth observation on the parameter estimate of θθθ in (2.2), conditional onthe deletion of the ith observation.
Now, consider the following perturbed nonlinear model
yyyω = fff (XXX ,θθθ)+dddiγ +εεεω , (5.54)
where dddi is the ith column of the identity matrix of size n and γ is an unknownparameter. Adding dddiγ to the perturbed model deletes the ith observation whenthe model is fitted, see Ross (1987).
Moreover, in (5.54) εεεω ∼Nn(000,σ2W−1(ωk)
)where the weight matrix W(ωk)
equals
W(ωk) = diag(1, . . . ,1,ωk,1 . . . ,1) .
Thus, in (5.54) we perturb the error variance for the kth observation and whenthe model is fitted, the ith observation is deleted. In the next definition theperturbed nonlinear model will be utilized.
Definition 5.2.5. The influence measure for assessing the influence of thekth observation on the parameter estimates in the nonlinear regression model(2.2), conditional on the deletion of the ith observation, is defined as
DIMθθθ (i),k
=dθθθ (i)(ωk)
dωk
∣∣∣∣∣ωk=1
,
where θθθ (i)(ωk), for i,k = 1, . . . ,n and i 6= k, is the weighted least squares es-timate of θθθ from the perturbed model (5.54), i.e. the estimate when the ithobservation is excluded from the calculations.
Observe that if ωk→ 1, then θθθ (i)(ωk)→ θθθ (i), the unweighted least squares es-timate of θθθ calculated without the ith observation.
To estimate θθθ in the perturbed nonlinear model (5.54) the method of weightedleast squares will be used. As in Section 5.2.3 we start by finding the estimatorfor γ , which for any θθθ is given by
γ = yi− fi(XXX ,θθθ),
where fi(XXX ,θθθ) is the ith entry of the vector fff (XXX ,θθθ).
108
Using γ in the estimation process deletes the ith observation and the normalequations obtained by minimizing the weighted sum of squares is the following(
d fff (XXX (i),θθθ)
dθθθ
)W(i)
(yyy(i)− fff (XXX (i),θθθ)
)= 000. (5.55)
where XXX (i) and yyy(i) is the design matrix and the response vector, respectively,with the ith row excluded. Moreover, W(i) = W(i)(ωk) is the weight matrix oforder (n−1) with the ith row and the ith column excluded.
The normal equations in (5.55) are solved for θθθ using iterative methods withthe ith observation excluded from the calculations. The obtained weightedleast squares estimate is denoted θθθ (i)(ωk).
In the next theorem, an explicit expression of DIMθθθ (i),k
is presented. Observethat, as in Theorem 5.2.3, the next theorem will contain two expressions ofDIM
θθθ (i),k, one corresponding to the case where i > k and one corresponding to
the case where i < k, for the same reason as for the linear regression model;see the previous section.
Theorem 5.2.4. Let DIMθθθ (i),k
be given in Definition 5.2.5. Then, if i > k
DIMθθθ (i),k
= rk,(i)FFFTk,(i)
(FFF(i)FFF
T(i)−GGG(i)
(rrr(i)⊗ Iq
))−1,
provided that the matrix inverse exists.
Moreover, if i < k
DIMθθθ (i),k
= rk−1,(i)FFFTk−1,(i)
(FFF(i)FFF
T(i)−GGG(i)
(rrr(i)⊗ Iq
))−1,
provided that the matrix inverse exists.
In the expressions above, FFF(i) =FFF(i)(θθθ (i)) is a matrix of derivatives such that
FFF(i) =d fff (XXX (i),θθθ)
dθθθ
∣∣∣∣θθθ=θθθ (i)
, q× (n−1)
where FFFTk−1,(i) and FFFT
k,(i) are the (k−1)th row and kth row, respectively.
The matrix GGG(i) =GGG(i)(θθθ (i)) is the following matrix of derivatives
GGG(i) =dFFF(i)
dθθθ (i)
, q×q(n−1).
Moreover, rrr(i) = (r1,(i), . . . ,rk−1,(i),rk,(i), . . . ,rn−1,(i))T = yyy(i)− fff (XXX (i),θθθ (i)).
109
Proof. Consider inserting θθθ (i)(ωk) in (5.55)
d fff (XXX (i),θθθ)
dθθθ
∣∣∣∣θθθ=θθθ (i)(ωk)
W(i)
(yyy(i)− fff (XXX (i),θθθ (i)(ωk))
)= 000, (5.56)
and letting eee(i) = yyy(i)− fff (XXX (i)) = yyy(i)− fff (XXX (i),θθθ (i)(ωk)), W(i) = W(i)(ωk) andFFF(i) =FFF(i)(θθθ (i)(ωk)). To derive the explicit expression of DIM
θθθ (i),kthe follow-
ing derivative of the normal equations (5.56) will be utilized
ddωk
FFF(i)W(i)eee(i) = 000. (5.57)
To calculate the derivative in (5.57), the product rule defined in Appendix A isapplied
ddωk
(FFF(i)W(i)eee(i)
)=
dFFF(i)
dωk
(W(i)eee(i)⊗ Iq
)+
dW(i)
dωk
(eee(i)⊗FFFT
(i)
)(5.58)
+deee(i)dωk
W(i)FFFT(i).
In (5.58)deee(i)dωk
=−d fff (XXX (i))
dωk,
and due to linearity of W(i) the following expressions are obtained
dW(i)
dωk= dddT
k ⊗dddTk , i > k,
dW(i)
dωk= dddT
k−1⊗dddTk−1, i < k.
where dddk and dddk−1 are the kth and the (k−1)th column of the identity matrixof size n−1, respectively.
If i > k, continuing from (5.57) to (5.58), applying the chain rule, defined inAppendix A, to (5.58) and rearranging terms give
dθθθ (i)(ωk)
dωk
(d fff (XXX (i))
dθθθ (i)(ωk)W(i)FFF
T(i)−
dFFF(i)
dθθθ (i)(ωk)
(W(i)eee(i)⊗ Iq
))
= dddTk eee(i)⊗dddT
k FFFT(i). (5.59)
110
Evaluating the derivative (5.59) at ωk = 1 yields
d fff (XXX (i))
dθθθ (i)(ωk)
∣∣∣∣∣ωk=1
=d fff (XXX (i),θθθ (i))
dθθθ (i)
=FFF(i)(θθθ (i)),
dFFF(i)
dθθθ (i)(ωk)
∣∣∣∣∣ωk=1
=dFFF(i)(θθθ (i))
dθθθ (i)
=GGG(i)(θθθ (i)),
and
dTk rrr(i)⊗dddT
k FFFT(i)(θθθ (i))=
dθθθ (i)(ωk)
dωk
∣∣∣∣∣ωk=1
(FFF(i)(θθθ (i))FFF
T(i)(θθθ (i))−G(i)(θθθ (i))
(rrr(i)⊗ Iq
))rk,(i)FFF
Tk,(i)(θθθ (i)) = DIM
θθθ (i),k
(FFF(i)(θθθ (i))FFF
T(i)(θθθ (i))−GGG(i)(θθθ (i))
(rrr(i)⊗ Iq
)).
Hence, the final expression of DIMθθθ (i),k
is given by
DIMθθθ (i),k
= rk,(i)FFFTk,(i)(θθθ (i))
(FFF(i)(θθθ (i))FFF
T(i)(θθθ (i))−GGG(i)(θθθ (i))
(rrr(i)⊗ Iq
))−1.
On the other hand, if i < k we have that
dTk−1rrr(i)⊗dddT
k−1FFFT(i)(θθθ (i)) = rk−1,(i)FFF
Tk−1,(i)(θθθ (i)),
resulting in
DIMθθθ (i),k
= rk−1,(i)FFFTk−1,(i)(θθθ (i))
(FFF(i)(θθθ (i))FFF
T(i)(θθθ (i))−GGG(i)(θθθ (i))
(rrr(i)⊗ Iq
))−1,
provided that the inverse exists, and the proof is complete.�
We say that hidden dependencies among the observations can be revealed whenstudying the conditional influence. To clarify this statement and to explain whyconditional influence might be of interest, consider the following: Some ob-servations are expected to be more important than others when it comes toestimating the parameters. This is due to the functional form of the particu-lar expectation function. For instance, when studying the Michaelis-Mentenmodel (2.4) we know that observations with high substrate concentration (i.e.high values on X) are important for estimation of θ1. It is expected that someof the observations with high substrate concentration will stand out from therest when studying the influence on θ1. If such an observation, located in the
111
area important for estimation and with a high value of the corresponding in-fluence measure were to be deleted, it would not be surprising at all if anotherobservation located in the same area would emerge as influential. However,if an observation outside the area that is important for estimation of θ1 wouldemerge as influential when deleting the influential observation with high sub-strate concentration, we would become suspicious. This would indicate somekind of dependence between the deleted observation and the influential obser-vation that we were not aware of. Further investigation of these two observa-tions are thus needed if one wants to understand the reason for this dependence.
Using the conditional influence approach can also give the researcher informa-tion about the observations that are deleted. If one observation is conditionallyinfluential by the deletion of another observation, we get information aboutthe deleted observation as well. For instance, we know that the observationbeing deleted exerts considerable influence on the other observations since ithas strong enough influence to hide another influential observation.
Studying the conditional influence between observations can certainly be in-teresting and useful. However, one should have in mind that this approachinvolves case-deletion, which leads to the fact that additional iterations for pa-rameter estimation are needed. For every i we study, i.e. for every "deletedobservation", we need to re-estimate the parameters iteratively.
5.3 Summary of Chapter 5
This chapter contains numerous new influence measures to use in various situ-ations. We will now make a summary of the proposed measures together witha short description of when to use them.
• DIMθθθ ,k - an influence measure to be used when we are interested in as-
sessing the influence of the kth observation on all parameter estimatesin the model. Thus, all parameter estimates in the model are of equalinterest, see Section 5.1.2.
• DIMθ j,k
- a marginal influence measure for assessing the influence of the
kth observation, on a specific parameter estimate, θ j. The other param-eters in the model are regarded as known, see Section 5.1.2.
• DIMθθθ ,K - a measure to be used when we are interested in assessing joint
influence of observations, i.e. influence of observations considered si-multaneously on the vector of parameter estimates. The indices of the
112
observations, for which we want to assess influence, are contained in thesubset K, see Section 5.2.2.
• DIMθ j,K
- a marginal influence measure for assessing the influence of
multiple observations, jointly, on a specific parameter estimate, θ j. Asabove, the indices of the observations for which we want to assess influ-ence are contained in the subset K, see Section 5.2.2.
• DIMθθθ (i),k
- a measure to be used when we are interested in assessingconditional influence of observations, i.e. the influence of the kth obser-vation, conditional on the deletion of the ith observation, on the vectorof parameter estimates, see Section 5.2.4.
113
114
6. Assessment of influence on thescore test statistic
The existing influence measures in regression analysis are constructed to mea-sure the impact of observations on the parameter estimates or the fitted values.However, it is of interest to assess the observations’ influence on other aspectsof the statistical inference as well, for instance testing a hypothesis. Some ob-servations have a stronger impact on the outcome of hypothesis testing thanothers. In fact, the result of the hypothesis testing procedure can become sig-nificant or non-significant due to the influence of a single observation. If thedata contains such influential observation, it is beneficial for the analyst to beable to detect it, since this observation may carry a lot of additional informa-tion.
In Chapter 4 we showed that the added variable plot can be considered as agraphical representation of the score test and we derived a nonlinear analogue,the added parameter plot, that has the same feature. Both of these plots can beused for data examination and search for influential observations on the scoretest statistic. In this chapter, we will continue the work on assessing the in-fluence of the observations on the score test statistic. In Section 6.1, we willderive a diagnostic measure for assessing the influence of single observationson the score test statistic, both in linear and nonlinear regression. This diag-nostic measure is derived using the differentiation approach and it is referredto as DIMSk, which is an abbreviation for Differentiation approach, InfluenceMeasure, Score test. In Section 6.2, DIMSK , that measures the joint influenceof multiple observations on the score test statistic in nonlinear regression, isproposed.
6.1 Assessment of influence of a single observation
In this section, an important aspect of the influence analysis is in focus, namely;how do individual observations contribute to the decision making when testingan hypothesis. One possibility to approach this question is through the explo-
115
rative analysis using the added parameter plot (see Section 4.2.1). In additionto the graphical tool, we propose in this section a formal influence measure forthe score test statistic. This measure can be used to quantify the influence ofthe individual observations on the score test statistic, in order to pinpoint theinfluential observations and add more information to the analysis.
We will start by deriving the expression of the influence measure for the linearregression model in Section 6.1.1, and continue with the nonlinear regressionmodel in Section 6.1.2.
6.1.1 Linear regression
In this section we will derive a diagnostic measure for assessing the influ-ence of the observations on the score test statistic for testing the hypothesisH0 : βp = 0, where βp is a regression parameter in the linear regression model(2.1). Moreover, when deriving the measure the same ideas as presented inChapter 5 are used.
Consider the perturbed linear regression model discussed in Section 5.1.1
yyyω =XXXβββ +εεεω , (6.1)
where εεεω ∼ Nn(0,σ2W−1(ωk)) and the weight matrix W(ωk) is the followingdiagonal matrix W(ωk) = diag(1, . . . ,ωk, . . . ,1).
Suppose that we want to use the score test to test
H0 : βp = 0, (6.2)
HA : βp 6= 0.
We will now derive the score test statistic, for testing (6.2), when the parameterestimates from the perturbed model are used. As a first step we will describethe parameter estimates from the perturbed model under the restriction thatβp = 0. Partitioning the perturbed model (6.1) yields
yyyω =XXX1βββ 1 +XXX2β2 +εεεω ,
so that βββ 1 = (β0,β1, . . . ,βp−1)T , β2 = βp and XXX = (XXX1
... XXX2) is a correspondingpartition of XXX . Let W = W(ωk).
116
The estimators for βββ and σ2 in the perturbed model, with the restriction thatβp = 0, are given by
βββ (ωk) =(((XXXT
1 WXXX1)−1XXXT
1 Wyyy)T ,0)T
=(
β0(ωk), β1(ωk), . . . , βp−1(ωk),0)T
,
σ2(ωk) =
1n(yyyω −XXXβββ (ωk))
T W(yyyω −XXXβββ (ωk)).
In Section 2.3 it was seen that the score test statistic for testing (6.2) is a func-tion of the score vector and the information matrix, both evaluated with pa-rameter estimates under the null hypothesis. Here we have that the score teststatistic equals
S(ΨΨΨ(ωk)) =UUUT (ΨΨΨ(ωk))I−1(ΨΨΨ(ωk))UUU(ΨΨΨ(ωk)), (6.3)
where ΨΨΨ(ωk) = (βββT(ωk), σ
2(ωk))T is the vector of parameter estimates from
the perturbed model under the restriction that βp = 0. Moreover, UUU(ΨΨΨ(ωk))
and I(ΨΨΨ(ωk)) are the score vector and the information matrix, both evaluatedfor the parameter estimates under the null hypothesis.
Now, as a second step in deriving the expression of the score test statistic, cor-responding to testing (6.2), when the parameter estimates from the perturbedmodel are used, we will evaluate UUU(ΨΨΨ(ωk)) and I(ΨΨΨ(ωk)). The score vector isgiven by
UUU(ΨΨΨ(ωk)) =
(UUU(βββ (ωk))UUU(σ2(ωk))
),
where
UUU(βββ (ωk)) =d`ω
dβββ
∣∣∣∣ΨΨΨ=ΨΨΨ(ωk)
=1
σ2(ωk)XXXT W(yyyω −XXXβββ (ωk)), (6.4)
and where the derivative is defined in Appendix A. In (6.4) the log-likelihood,`ω , equals
`ω =−2n
ln(2πσ2)+
12
ln |W|− 12σ2 (yyyω −XXXβββ )T W(yyyω −XXXβββ ),
where |�| denotes the determinant. Observe that UUU(σ2(ωk)) = 0 since
UUU(σ2(ωk)) =d`ω
dσ2
∣∣∣∣ΨΨΨ=ΨΨΨ(ωk)
= 0.
117
Thus, we have that
UUU(ΨΨΨ(ωk)) =
(UUU(βββ (ωk))
0
). (6.5)
From Section 2.3 we know that the information matrix evaluated with param-eter estimates under the null hypothesis is block diagonal, i.e.
I(ΨΨΨ(ωk)) =
I(βββ (ωk)) 000p
000Tp I(σ2(ωk))
. (6.6)
Using (6.5) and (6.6) in (6.3) we get
S(ΨΨΨ(ωk)) =UUUT (βββ (ωk))I−1(βββ (ωk))UUU(βββ (ωk)).
The information matrix evaluated for the parameter estimates from the per-turbed model, under the restriction that βp = 0, equals
I(βββ (ωk)) = E[UUU(βββ )UUUT (βββ )
]Ψ=Ψ(ωk)
= E[
1σ4 XXXT W(yyy−XXXβββ )(yyy−XXXβββ )T WXXX
]Ψ=Ψ(ωk)
=
(1
σ4 XXXT WE[(yyy−XXXβββ )(yyy−XXXβββ )T ]WXXX
)Ψ=Ψ(ωk)
=
(1
σ4 XXXT Wσ2W−1WXXX
)Ψ=Ψ(ωk)
=1
σ2(ωk)XXXT WXXX .
Thus, the score test statistic for testing H0 : βp = 0, using the estimates of theparameters in the perturbed linear regression model, is given by
S(βββ (ωk)) =UUU(βββ (ωk))T I−1(βββ (ωk))UUU(βββ (ωk))
=1
σ2(ωk)(yyyω −XXXβββ (ωk))
T WXXX(XXXT WXXX
)−1XXXT W(yyyω −XXXβββ (ωk)). (6.7)
The score test statistic (6.7) will now be used to define the influence measureDIMSk.
118
Definition 6.1.1. The diagnostic measure DIMSk for assessing the influenceof the kth observation on the score test statistic is defined as
DIMSk =dS(βββ (ωk))
dωk
∣∣∣∣∣ωk=1
,
where S(βββ (ωk)) is defined in (6.7).
In Definition 6.1.1, observe that when ωk→ 1, S(βββ (ωk))→ S(βββ ), i.e. the scoretest statistic using the parameter estimates from the unperturbed linear regres-sion model (2.1) under the restriction that βp = 0.
The next theorem provides an explicit expression of DIMSk.
Theorem 6.1.1. Let DIMSk be given in Definition 6.1.1. Then,
DIMSk =1
σ2
[2rkxxxT
k g−(xxxT
k g)2−
S(βββ )r2k
n
],
where rrr = (rk) = yyy−XXXβββ , xxxTk : 1× p is the kth row of XXX, S(βββ ) is the score test
statistic and g =(XXXTXXX
)−1 XXXT rrr : p×1.
Proof. To find DIMSk we want to take the derivative of S(βββ (ωk)) with re-spect to ωk. Rewrite S(βββ (ωk)) as a(ωk)
b(ωk), where b(ωk) = σ2 and a(ωk) =
b(ωk)DIMSk. When differentiating, the quotient rule will be used. Hence,
dS(βββ (ωk))
dωk=
a′(ωk)b(ωk)−a(ωk)b′(ωk)
b2(ωk)
=a′(ωk)−S(βββ (ωk))b′(ωk)
b(ωk).
First, the derivative of a(ωk) is considered. Let D = XXXT WXXX andC =XXXT W(yyy−XXXβββ (ωk)) =XXXT We. Then,
da(ωk)
dωk=
dCT
d(ωk)D−1C+
dD−1
d(ωk)(C⊗C)+
dCd(ωk)
D−1C
= 2(
dCd(ωk)
D−1C)+
dD−1
d(ωk)(C⊗C) .
The derivative of C with respect to ωk equals
dCdωk
=d
dωkXXXT We =
dWdωk
(e⊗XXX)+de
dωkWXXX . (6.8)
119
Applying the chain rule, defined in Appendix A, to (6.8) yields
dCdωk
= dddTk eee ⊗ dddT
k XXX− dβββ (ωk)
dωkXXXT WXXX , (6.9)
sincedWdωk
= dddTk ⊗dddT
k ,
where dddk is the k−th column of the identity matrix of size n, and
deeedωk
= −dβββ (ωk)
dωk
dXXXβββ (ωk)
dβββ (ωk)WXXX =−βββ (ωk)
dωkXXXT WXXX .
Next, the derivative of D with respect to ωk is considered. Using the rule fordifferentiation of a matrix inverse (see Appendix A) the derivative of D−1 withrespect to ωk is given by
dD−1
dωk=− dD
dωk
(D−1⊗D−1) .
Now,
dDdωk
=d
dωk
(XXXT WXXX
)=
dWdωk
(XXX⊗XXX) ,
and
dD−1
dωk= −dW
dωk(XXX⊗XXX)
((XXXT WXXX
)−1⊗(XXXT WXXX
)−1)
= −(dddT
k XXX⊗dddTk XXX)((
XXXT WXXX)−1⊗
(XXXT WXXX
)−1). (6.10)
Finally, βββ (ωk = 1) = βββ and let yyy−XXXβββ (ωk = 1) = rrr. The derivative in (6.9),evaluated at ωk = 1, becomes
ddωk
XXXT We∣∣∣∣ωk=1
=
(dddT
k eee ⊗ dddTk XXX− βββ (ωk)
dωkXXXT WXXX
)ωk=1
= rkxxxTk −
βββ (ωk)
dωk
∣∣∣∣∣ωk=1
(XXXTXXX
)= rkxxxT
k −EICβββ ,k
(XXXTXXX
),
120
since we define
EICβββ ,k =
βββ (ωk)
dωk
∣∣∣∣∣ωk=1
=
(dβ1(ωk)
dωk
∣∣∣∣ωk=1
, . . . ,dβp−1(ωk)
dωk
∣∣∣∣ωk=1
, 0).
The derivative in (6.10), evaluated at ωk = 1, can be written
d(XXXT WXXX)−1
dωk
∣∣∣∣ωk=1
= −(dddT
k XXX⊗dddTk XXX)((
XXXTXXX)−1⊗
(XXXTXXX
)−1)
= −(xxxT
k ⊗xxxTk)((
XXXTXXX)−1⊗
(XXXTXXX
)−1).
Let g =(XXXTXXX
)−1 XXXT r. Then,
da(ωk)
dωk
∣∣∣∣ωk=1
= 2(
rkxxxTk −EIC
βββ ,k
(XXXTXXX
))(XXXTXXX
)−1XXXT rrr
−(xxxT
k ⊗xxxTk)((
XXXTXXX)−1⊗
(XXXTXXX
)−1)(
XXXT r⊗XXXT r)
= 2(
rkxxxTk −EIC
βββ ,k
(XXXTXXX
))g−(xxxT
k ⊗xxxTk)(g⊗g)
= 2(
rkxxxTk g−EIC
βββ ,kXXXT rrr)−(xxxT
k g)2. (6.11)
The expression above can be simplified. Observe that the first p−1 elementsof XXXT rrr equals zero due to the normal equations. Moreover, observe that thelast element of EIC
βββ ,k is equal to zero and we get that EICβββ ,kXXX
T rrr = 0. Now,(6.11) equals
da(ωk)
dωk
∣∣∣∣ωk=1
= 2rkxxxTk g−
(xxxT
k g)2.
Moreover, the derivative of the variance term needs to be calculated. The maxi-mum likelihood estimator of σ2(ωk) under the null hypothesis of (6.2) satisfies
σ2(ωk) =
1n
(eT We
).
121
Using the product and the chain rule, defined in Appendix A, the derivative ofσ2(ωk) with respect to ωk is the following
dσ2(ωk)
dωk=
1n
ddωk
eT We =1n
(2
dedωk
We+dWdωk
(e⊗ e))
=1n
(dWdωk
(e⊗ e)−2dXXXβββ (ωk)
dωkWe
)
=1n
((dT
k e⊗dTk e)−2
dβββ (ωk)
dωkXXXT We
)
=1n
(e2
k−2dβββ (ωk)
dωkXXXT We
).
Evaluated at ωk = 1, βββ (ωk = 1) = βββ , e = r and hence
dσ2(ωk)
dωk
∣∣∣∣ωk=1
=1n
(r2
k −2EICβββ ,kXXX
T r)
=1n
r2k .
The expression above is simplified since EICβββ ,kXXX
T rrr = 0.
Finally, the expression for DIMSk is given by
DIMSk =1
σ2
(2rkxxxT
k g−(xxxT
k g)2−
S(βββ )r2k
n
),
and the proof is complete. �
Remark
In Theorem (6.1.1) we can observe that xxxTk ggg = ∑
nj=1 pk j r j and
(xxxTk g)2 =
(∑
nj=1 pk j r j
)2, where pk j denotes the element in the kth row andthe jth column of the projection matrix PXXX defined in (3.5). Hence, the DIMSkcan be rewritten as
DIMSk =1
σ2
2rk
n
∑j=1
pk j r j−
(n
∑j=1
pk j r j
)2
−S(βββ )r2
kn
. (6.12)
We see from (6.12) that the DIMSk is a function of the residuals under thenull hypothesis, the score test statistic and the leverages for the kth observation
122
since the leverage, pkk, is a part of ∑nj=1 pk j.
In the next section we will derive one of the main results in this thesis, anexplicit expression of the influence measure DIMSk for use in nonlinear re-gression analysis.
6.1.2 Nonlinear regression
Similar ideas and techniques from the previous section, Section 6.1.1, willbe utilized when deriving the influence measure, DIMSk, for assessing theinfluence of the observations on the score test statistic, given in (2.27), whentesting
H0 : θq = 0 (6.13)
HA : θq 6= 0,
where θq is the last element of the vector of parameters for the nonlinear re-gression model (2.2).
First, we consider the perturbed nonlinear model, discussed in Section 5.1.2,defined as
yyyω = fff (XXX ,θθθ)+εεεω , (6.14)
where εεεω ∼ Nn(0,σ2W−1(ωk)) and the weight matrix W(ωk) is the followingdiagonal matrix
W(ωk) = diag(1, . . . ,ωk, . . . ,1) .
We will now derive the score test statistic when testing (6.13) using the param-eter estimates from the perturbed model (6.14).
Let ΨΨΨ = (θθθ T ,σ2)T be the vector of parameters and let
ΨΨΨ(ωk) = (θθθT(ωk), σ
2(ωk))T be the maximum likelihood estimates from the
perturbed model, under the restriction that θq = 0. Recall from Section 6.1.1that the score test statistic is a function of the score vector and the informa-tion matrix, both evaluated with the plug-in parameter estimates under the nullhypothesis, i.e.
S(ΨΨΨ(ωk)) =UUU(ΨΨΨ(ωk))T I−1(ΨΨΨ(ωk))UUU(ΨΨΨ(ωk)).
123
The score vector is given by
UUU(ΨΨΨ(ωk)) =
(UUU(θθθ(ωk))UUU(σ2(ωk))
).
As in the linear regression case UUU(σ2(ωk)) = 0 since σ2(ωk) is the maximumlikelihood estimate of σ2 and
UUU(θθθ(ωk)) =d`ω
dθθθ
∣∣∣∣ΨΨΨ=ΨΨΨ(ωk)
=1
σ2(ωk)FFF(θθθ(ωk))W(yyyω − fff (XXX ,θθθ(ωk))),
where
`ω =−n2
ln(2πσ2)+
12
ln |W|− 12σ2 (yyyω − fff (XXX ,θθθ))T W(yyyω − fff (XXX ,θθθ)),
W = W(ωk) and FFF(θθθ(ωk)) : q×n is the matrix such that
FFF(θθθ(ωk)) =(
FFF1(θθθ(ωk)), . . . ,FFFn(θθθ(ωk)))=
ddθθθ
fff (XXX ,θθθ)
∣∣∣∣θθθ=θθθ(ωk)
. (6.15)
Moreover, the information matrix is block diagonal, see Section 2.3 for details,such that
I(ΨΨΨ(ωk)) =
I(θθθ(ωk)) 000q
000Tq I(σ2(ωk))
.
Using the results from deriving the score vector and using the fact that theinformation matrix is block diagonal, the score test statistic equals
S(θθθ(ωk)) =UUUT (θθθ(ωk))I−1(θθθ(ωk))UUU(θθθ(ωk)). (6.16)
The information matrix in (6.16) is defined as
I(θθθ(ωk)) = E[UUU(θθθ)UUUT (θθθ)
]ΨΨΨ=ΨΨΨ(ωk)
= E
[d`ω
dθθθ
(d`ω
dθθθ
)T]
ΨΨΨ=ΨΨΨ(ωk)
= E[
1σ4 FFF(θθθ)W(yyyω − fff (XXX ,θθθ))(yyyω − fff (XXX ,θθθ))T WFFFT (θθθ)
]ΨΨΨ=ΨΨΨ(ωk)
=1
σ2(ωk)FFF(θθθ(ωk))WFFFT (θθθ(ωk)),
124
where in the second row it was used that
E[(yyyω − fff (XXX ,θθθ))(yyyω − fff (XXX ,θθθ))T ] = σ2W−1.
Thus, the score test statistic for testing (6.13), derived from the perturbed non-linear model (6.14), is as follows
S(θθθ(ωk)) =1
σ2(ωk)eT WFFFT (FFFWFFFT )−1
FFFWe, (6.17)
where FFF =FFF(θθθ(ωk)) and e = yyy− fff (XXX ,θθθ(ωk)).
The score test statistic in (6.17) will now be used in the following definition ofthe influence measure DIMSk.
Definition 6.1.2. The diagnostic measure DIMSk for assessing the influenceof the kth observation on the score test statistic is defined as
DIMSk =dS(θθθ(ωk))
dωk
∣∣∣∣∣ωk=1
,
where S(θθθ(ωk)) is defined in (6.17).
Note that, when ωk→ 1 we observe that S(θθθ(ωk))→ S(θθθ), i.e. the score teststatistic using the parameter estimates from the unperturbed nonlinear regres-sion model (2.2) under the restriction that θq = 0.
Before presenting the explicit expression of the DIMSk in a theorem, we willstate the definition of the influence measure DIM
θθθ ,k, similar to the influencemeasure presented in Definition 5.1.2, since DIMSk is a function of DIM
θθθ ,k.
Definition 6.1.3. Let θθθ = (θ1, . . . , θq−1,0)T be the parameter estimates underthe null hypothesis, H0 : θq = 0. The diagnostic measure for assessing theinfluence of the kth observation on θθθ is defined as
DIMθθθ ,k =
(DIM
θθθ ,k, 0),
where DIMθθθ ,k is given in Definition 5.1.2, and θθθ = (θ1, . . . , θq−1)
T are theparameter estimates for the restricted model, i.e. the model under the nullhypothesis.
The next theorem provides an explicit expression of the DIMSk for use in non-linear regression analysis.
125
Theorem 6.1.2. Let DIMSk be given in Definition 6.1.2. Then
DIMSk =1
σ2
[2(
rkFFFTk g+DIM
θθθ ,kG(r⊗g))
−DIMθθθ ,k
(G(FFFT g⊗g
)+G∗
(g⊗FFFT g
))−(gTFFFkFFFT
k g)−S(θθθ)
r2kn
],
where r = yyy− fff (XXX ,θθθ) and FFF = FFF(θθθ) is defined (6.15). Moreover, G and G∗are defined as
G = G(θθθ) =dFFF(θθθ)
dθθθ, G∗ = G∗(θθθ) =
dFFFT (θθθ)
dθθθ, (6.18)
respectively, and g =(FFFFFFT
)−1 FFF r.
Proof. Let FFF =FFF(θθθ(ωk)), W = W(ωk) and e = yyy− fff (XXX) = yyy− fff (XXX ,θθθ(ωk)).In (6.17), let
a(ωk) = eT WFFFT (FFFWFFFT )−1FFFWe
and
b(ωk) = σ2(ωk).
When differentiating S(θθθ(ωk)) the quotient rule is used. Hence,
dS(θθθ(ωk))
dωk=
a′(ωk)b(ωk)−a(ωk)b′(ωk)
b2(ωk)
=a′(ωk)−S(θθθ(ωk))b′(ωk)
b(ωk),
where
a′(ωk) =da(ωk)
dωk, b′(ωk) =
db(ωk)
dωk.
First, the derivative of a(ωk) is considered. Let C = FFFWe and D = FFFWFFFT ,then
da(ωk)
dωk=
dCT
d(ωk)D−1C+
dD−1
d(ωk)(C⊗C)+
dCd(ωk)
D−1C
= 2(
dCd(ωk)
D−1C)+
dD−1
d(ωk)(C⊗C) .
126
The derivative of C with respect to ωk is
dCdωk
=d
dωkFFFWe
=dFFFdωk
(We⊗Iq)+dWdωk
(e⊗FT )+ de
dωkWFFFT . (6.19)
Applying the chain rule, defined in Appendix A, to (6.19) gives
dθθθ(ωk)
dωk
(dFFF
dθθθ(ωk)(We⊗ Iq)−
d fff (XXX)
dθθθ(ωk)WFFFT
)+dddT
k eee ⊗ dddTk FFFT (6.20)
since
dWdωk
= dddTk ⊗dddT
k ,
where dddk is the kth column of the identity matrix of size n.
Next, the derivative of D with respect to ωk is considered. Using both the chainrule and the rule for differentiation of a matrix inverse (see Appendix A), thederivative of D−1 with respect to ωk is
dD−1
dωk=− dD
dωk
(D−1⊗D−1) .
Now,
dDdωk
=d
dωk
(FWFFFT )
=dFFFdωk
(WFFFT⊗ Iq
)+
dWdωk
(FFFT⊗FFFT )+dFFFT
dωk
(Iq⊗WFFFT )
=dθθθ(ωk)
dωk
(dFFF
dθθθ(ωk)
(WFFFT ⊗ Iq
)+
dFFFT
dθθθ(ωk)
(Iq⊗WFFFT ))
+dWdωk
(FFFT ⊗FFFT ) ,
and
dD−1
dωk=−
[dθθθ(ωk)
dωk
(dFFF
dθθθ(ωk)
(WFFFT⊗ Iq
)+
dFFFT
dθθθ(ωk)
(Iq⊗WFFFT )) (6.21)
+dWdωk
(FFFT ⊗FFFT )]((FFFWFFFT )−1⊗
(FFFWFFFT )−1
).
127
Consider evaluation at ωk = 1. We get that
θθθ(ωk = 1) = θθθ , yyy− fff (XXX ,θθθ(ωk = 1)) = rrr,
the parameter estimates and the residuals from the unperturbed model (2.2),respectively. Moreover,
FFF =FFF(θθθ(ωk = 1)) =FFF(θθθ), GGG =GGG(θθθ(ωk = 1)) =GGG(θθθ),
GGG∗ =GGG∗(θθθ(ωk = 1)) =GGG∗(θθθ),
are the matrices of derivatives evaluated for the parameter estimates from theunperturbed model (2.2).
The derivatives in (6.20) evaluated at ωk = 1 equals
ddωk
FFFW−1e∣∣∣∣ωk=1
= rkFFFTk −DIM
θθθ ,k
(FFFFFFT −G(r⊗ Iq)
),
sincedθθθ(ωk)
dωk
∣∣∣∣∣ωk=1
= DIMθθθ ,k,
and the derivative in (6.21) evaluated at ωk = 1 equals
d(FFFW−1FFFT )−1
dωk
∣∣∣∣ωk=1
= −[DIM
θθθ ,k
(G(FFFT ⊗ Iq
)+G∗
(Iq⊗FFFT ))
+ dddTk FFFT ⊗ dddT
k FFFT]((
FFFFFFT )−1⊗(FFFFFFT )−1
),
since
dFFF
dθθθ(ωk)
∣∣∣∣∣ωk=1
=GGG,dFFFT
dθθθ(ωk)
∣∣∣∣∣ωk=1
=GGG∗.
Let g =(FFFFFFT
)−1 FFF r, then
da(ωk)
dωk
∣∣∣∣ωk=1
= 2(
rkFFFTk −DIM
θθθ ,k
(FFFFFFT −G(r⊗ Iq)
))g
−[DIM
θθθ ,k
(G(FFFT ⊗ Iq
)+G∗
(Iq⊗FFFT ))
+ dddTk FFFT ⊗ dddT
k FFFT ]((FFFFFFT )−1⊗(FFFFFFT )−1
)(FFF r⊗FFF r)
= 2(
rkFFFTk +DIM
θθθ ,kG(r⊗ Iq))
g
−[DIM
θθθ ,k
(G(FFFT ⊗ Iq
)+G∗
(Iq⊗FFFT ))
+ dddTk FFFT ⊗ dddT
k FFFT ](g⊗g) .
128
In the expression above, DIMθθθ ,kFFFFFFT g = 0. This is due to the fact that
the normal equations for estimating θθθ under the restriction that θq = 0is set to zero, the last element in DIM
θθθ ,k is equal to zero and
DIMθθθ ,kFFFFFFT
(FFFFFFT
)−1 FFF r = DIMθθθ ,kFFF r = 0.
Now the derivative of the variance term needs to be calculated. The maximumlikelihood estimator of σ2(ωk) under the restriction that θq = 0 is
σ2(ωk) =
1n
(eT We
).
Using the product rule and the chain rule (see Appendix A), the derivative ofσ2(ωk) with respect to ωk is the following
dσ2(ωk)
dωk=
1n
ddωk
eT We =1n
(2
dedωk
We+dWdωk
(e⊗ e))
=1n
(dWdωk
(e⊗ e)−2d fff (XXX ,θθθ(ωk))
dωkWe
)
=1n
((dT
k ⊗dTk)(e⊗ e)
)− 1
n
(2
dθθθ(ωk)
dωk
(d fff (XXX ,θθθ(ωk))
dθθθ(ωk)We
))
=1n
(e2
k−2dθθθ(ωk)
dωk
(d fff (XXX ,θθθ(ωk))
dθθθ(ωk)We
)). (6.22)
Evaluating (6.22) at ωk = 1 we get
dσ2(ωk)
dωk
∣∣∣∣ωk=1
=1n
(r2
k −2DIMθθθ ,kFr
)=
r2kn,
since DIMθθθ ,kFr = 0.
129
Finally, the expression for DIMSk is given by
DIMSk =1
σ2
[2(
rkFFFTk +DIM
θθθ ,kG(r⊗ Iq))
g
−(
DIMθθθ ,k
(G(FFFT ⊗ Iq
)+G∗
(Iq⊗FFFT ))+dddT
k FFFT ⊗ dddTk FFFT
)× (g⊗g)−S(θθθ)
r2kn
]=
1σ2
[2(
rkFFFTk g+DIM
θθθ ,kG(r⊗g))
−DIMθθθ ,k
(G(FFFT g⊗g
)+G∗
(g⊗FFFT g
))− gTFFFkFFFT
k g−S(θθθ)r2
kn
].
�
In Theorem 6.1.2 we observe that FFFTk g = ∑
nj=1 pk j r j and that
FFFTk g⊗FFFT
k g =(∑
nj=1 pk j r j
)2, where pk j is the kth row and the jth column ofthe tangent plane projection matrix PFFF = FFFT (FFFFFFT )−1FFF , and where pkk is de-fined in (3.7). As in the linear regression case, the DIMSk is a function of theresiduals under the null hypothesis, the score test statistic and the leverages ofthe kth observation, since the leverage of the kth observation is pkk. However,the expression is more complicated due to the fact that we have to considerthe second derivative of the expectation function. Also, the influence measure(DIM) for the parameter estimates under the null hypothesis has a more com-plicated expression in the nonlinear regression case, compared to the linearregression case.
As discussed in Section 5.1.3, an apparent benefit of the approach used to con-struct DIMSk is that when differentiating various quantities with respect to ωkwe evaluate at ωk = 1. As a consequence, the resulting quantities in the expres-sion of DIMSk, first obtained from the perturbed model, is now independentof the weight, ωk, and equals the quantities for the unperturbed model. If wewould evaluate these derivatives at any value other than one, e.g at ωk→ 0, theexpression of DIMSk would become more complicated. Moreover, for each kof interest we would need to re-estimate the model, since then the parameterestimate would be a function of the weight, e.g. θθθ(ωk→ 0).
The signs of the values of the DIMSk provide important information. A posi-tive value of the DIMSk means that the kth observation has positive influenceon the score test statistic, i.e. the presence of this observation is increasing
130
the value of the score test statistic. Similarly, the kth observation exercises anegative influence on the score test statistic if the value of DIMSk is negative.This means that the presence of the kth observation is reducing the score teststatistic.
To illustrate the components of DIMSk given in Theorem 6.1.2 we will give asmall technical example.
Example 6.1: An illustration the components of the DIMSk
Consider the same model as in Example 5.1, where
fff (XXX ,θθθ) =(
θ1x1θ2+x1
, θ1x2θ2+x2
, θ1x3θ2+x3
)T.
Let the hypothesis of interest be H0 : θ2 = 0 and the vector of estimated pa-
rameters under the null hypothesis be θθθ =(
θ1,0)T
. Of course, there is nopractical interest in testing the hypothesis H0 : θ2 = 0, since under the nullhypothesis the expectation function equals a constant. However, this exampleis constructed to display the components of DIMSk, and for this purpose theexample works well.
From Theorem 6.1.2 we can see that the components of DIMSk are FFF , r, G,G∗ and DIM
θθθ ,k. Now, let us describe these matrices.
For this particular test and model, the vector of residuals that results fromestimating the model yyy = fff (XXX ,θθθ)+εεε under the null hypothesis is given by
r = yyy− θ113.
The first row of the 2×3 matrix FFF(θθθ) is the following
ddθ1
fff (XXX ,θθθ)
∣∣∣∣θθθ=θθθ
= 1T3 ,
its second row equals
ddθ2
fff (XXX ,θθθ)
∣∣∣∣θθθ=θθθ
=(− θ1
x1− θ1
x2− θ1
x3
),
and
FFFTk (θθθ) =
(d fk(θθθ)
dθ1, d fk(θθθ)
dθ2
)=(
1, − θ1xk
).
131
The influence measure DIMθθθ ,k for the parameter estimates under the null hy-
pothesis is given by
DIMθθθ ,k =
(DIM
θ1,k, DIM
θ2,k
)=(
DIMθ1,k
, 0).
The q×nq matrix G(θθθ) = dFFF(θθθ)dθθθ
∣∣∣θθθ=θθθ
is here
G(θθθ) =
d2 f1(θθθ)
dθ 21
d2 f1(θθθ)
dθ1dθ2
d2 f2(θθθ)
dθ 21
d2 f2(θθθ)
dθ1dθ2
d2 f3(θθθ)
dθ 21
d2 f3(θθθ)
dθ1dθ2
d2 f1(θθθ)
dθ2dθ1
d2 f1(θθθ)
dθ 22
d2 f2(θθθ)
dθ2dθ1
d2 f2(θθθ)
dθ 22
d2 f3(θθθ)
dθ2dθ1
d2 f3(θθθ)
dθ 22
.
In the matrix G(θθθ)
d2 fi(θθθ)
dθ 21
= 0,d2 fi(θθθ)
dθ 22
=2θ1
x2i
and
d2 fi(θθθ)
dθ1dθ2=
d2 fi(θθθ)
dθ2dθ1=− xi
(θ2 + xi)2=− 1
xi,
so that
G(θθθ) =
0 − 1x1
0 − 1x2
0 − 1x3
− 1x1
2θ1x2
1− 1
x2
2θ1x2
2− 1
x3
2θ1x2
3
.
Similarly, the q×nq matrix G∗(θθθ) = dFFFT (θθθ)dθθθ
∣∣∣θθθ=θθθ
and thus
G∗(θθθ) =dFFFT (θθθ)
dθθθ
=
d2 f1(θθθ)
dθ 21
d2 f2(θθθ)
dθ 21
d2 f3(θθθ)
dθ 21
d2 f1(θθθ)
dθ2dθ1
d2 f2(θθθ)
dθ2dθ1
d2 f3(θθθ)
dθ2dθ1
d2 f1(θθθ)
dθ2dθ1
d2 f2(θθθ)
dθ2dθ1
d2 f3(θθθ)
dθ2dθ1
d2 f1(θθθ)
dθ 22
d2 f2(θθθ)
dθ 22
d2 f3(θθθ)
dθ 22
=
(0 0 0 − 1
x1− 1
x2− 1
x3
− 1x1− 1
x2− 1
x3
2θ1x12
2θ1x22
2θ1x32
).
In the next section a continuation of the numerical example in Section 4.2.2will be given. In this example we illustrate how the influence diagnostic,DIMSk, can be used together with the added parameter plot.
132
6.1.3 Numerical example
This numerical example illustrates how the influence diagnostic DIMSk can beused in a practical situation. We continue the numerical example in Section4.2.2, where we fitted the modified Michaelis-Menten model (4.23) under thenull hypothesis, H0 : θ4 = 0. The added parameter plot for θ4 is presented inFigure 4.2. By inspection of the scatter in the plot we concluded that the 1stand 2nd observation were a bit far from the rest and that these observationscould be influential observations. We will now assess the influence of all theobservations in the data set by using the influence measure DIMSk.
The values of DIMSk are computed for k = 1, . . . ,23 and the results are pre-sented in Figure 6.1.
*
*
** *
* * *
*
*
*
** * * * *
* *
*
*
*
*
5 10 15 20
−1.
5−
0.5
0.5
1.5
2
34 5 9
1213 17
1819 21
23
1
6 7 8
10
11141516
2022DIM
Sk
Figure 6.1: A plot of DIMSk against the observation number, where DIMSkis the diagnostic measure for assessing the influence of the observations on thescore test statistic, given in Definition 6.1.2. The data used are presented in Table4.1.
The 1st and 2nd observation have the largest absolute values of the DIMSk.All observations, the 1st and 2nd observations excluded, have values of the in-fluence measure within +/-0.72, whereas DIMS1 =−1.68 and DIMS2 = 1.32.Relative to the other observations, the 1st and 2nd observations clearly havemore influence on the outcome of the score testing procedure.
133
The signs of the values of DIMSk can also give us some additional information.A negative value of the DIMSk tells us that the presence of the kth observationis decreasing the value of the score test statistic. Recall that we noted in the nu-merical example of Section 4.2.2, that the score test statistic was equal to 1.67with a corresponding p−value of 0.20 when all observations were included inthe analysis. When the 1st observation is removed, the value increases to 4.32with accompanying p−value of 0.04. Thus the presence of the 1st is decreas-ing the value of the score test statistic. This is also depicted by a negative valueof the DIMSk. For this particular example, the 1st observation is very influen-tial on the score test statistic. If this observation was not present in the analysiswe would actually reject the null hypothesis of H0 : θ4 = 0 on a 5 percent levelof significance.
A positive value of the DIMSk means that the presence of the kth observationis increasing the value of the score test statistic. Observation 2 is thus con-tributing to the value of the score test statistic making it larger. If the 2ndobservation were to be removed from the analysis, the value of the score teststatistic decreases to 0.41.
The results of this numerical example will be further discussed in the next sec-tion, where we will assess the influence of multiple observations on the scoretest statistic.
6.2 Assessment of influence of multiple observations
In this section we present the obtained results concerning the assessment ofinfluence from multiple observations on the score test statistic, given in (2.27).A diagnostic measure is proposed, which is a generalization of the measureDIMSk, derived in the previous section, Section 6.1.2.
Let us assume that K is the subset containing the indices of the observationsfor which we would like to assess influence. In order to derive the measureDIMSK , for assessing the influence of multiple observations on the score teststatistic, consider the nonlinear regression model (2.2) and the same null hy-pothesis (6.13) as in Section 6.1.2, i.e. H0 : θq = 0.
134
Moreover, consider the perturbed nonlinear model, also given in (5.30),
yyyω = fff (XXX ,θθθ)+εεεω , (6.23)
where εεεω∼Nn(000,σ2W−1(ωωω)), W(ωωω) : n×n is a diagonal weight matrix, withdiagonal elements ωωω = (ω1, . . . ,ωn)
T and where 0 < ωk ≤ 1, for k = 1, . . . ,n.
Similar to the previous section, we utilize the score test statistic evaluated forthe estimates from the perturbed model (6.23) and define the DIMSK as follows
Definition 6.2.1. The DIMSK , that measures the influence of the observationswith indices in the subset K, on the score test statistic, is defined as the follow-ing derivative
DIMSK = `T dS(θθθ(ωωω))
dωωω
∣∣∣∣∣ωωω=111n
,
where ` : n× 1 is a vector with nonzero components in the rows with indicesin K and where S(θθθ(ωωω)) is the score test statistic evaluated for the estimatesfrom the perturbed model (6.23) under the restriction θq = 0.
With the same reasoning as in the previous section, replacing ωk with ωωω , thescore test statistic, evaluated for the parameter estimates from the perturbednonlinear model (6.23), equals
S(θθθ(ωωω)) =1
σ2(ωωω)eeeT WFFFT (FFFWFFFT )−1
FFFWeee, (6.24)
where θθθ(ωωω) is the estimate of θθθ from the perturbed model (6.23) under the re-striction that θq = 0. Also, FFF =FFF(θθθ(ωωω)), W = W(ωωω), e = yyy− fff (XXX ,θθθ(ωωω)) andσ2(ωωω) = 1
n
(eT We
).
The next theorem provides an explicit expression of the DIMSK , which char-acterizes the influence of multiple observations on the score test statistic.
Theorem 6.2.1. Let DIMSK be given in Definition 6.2.1. Then
DIMSK =1
σ2
[2(UUU∗(rrr⊗FFFT )+DIM
θθθG(r⊗ Iq)
)g
−(DIM
θθθ
(G(FFFT ⊗ Iq
)+G∗
(Iq⊗FFFT ))+UUU∗
(FFFT ⊗FFFT ))
×(g⊗g)− S(θθθ)n
UUU∗ (rrr⊗ rrr)
],
135
where r = yyy− fff (XXX ,θθθ) and FFF =FFF(θθθ) is defined in (6.15). The matrices G andG∗ are defined in (6.18) and the q-vector g =
(FFFFFFT
)−1 FFF r.
Moreover, UUU∗ : n×n2 has row vectors uuuTi such that
uuui = dddi⊗dddi (6.25)
for i = 1, . . . ,n and dddi is the ith column of the identity matrix of size n andDIM
θθθ: n×q is defined as
DIMθθθ=
DIM
θθθ ,1DIM
θθθ ,2...
DIMθθθ ,n
,
where DIMθθθ ,k is given in Definition 6.1.3. The last column of DIM
θθθhas all
elements equal to zero.
Proof. Let FFF = FFF(θθθ(ωωω)), W = W(ωωω) and e = yyy− fff (XXX) = yyy− fff (XXX ,θθθ(ωωω)).In (6.24), let
a(ωωω) = eT WFFFT (FFFWFFFT )−1FFFWe
and
b(ωωω) = σ2(ωωω).
When differentiating S(θθθ(ωωω)), the quotient rule is used, hence
dS(θθθ(ωωω))
dωωω=
a′(ωωω)b(ωωω)−a(ωωω)b′(ωωω)
b2(ωωω)
=a′(ωωω)−S(θθθ)b′(ωωω)
b(ωωω)(6.26)
where
a′(ωωω) =da(ωωω)
dωωω, b′(ωωω) =
db(ωωω)
dωωω.
First, the derivative of a(ωωω) is considered. Let C = FFFWe and D = FFFWFFFT ,then
da(ωωω)
dωωω=
dCT
d(ωωω)D−1C+
dD−1
d(ωωω)(C⊗C)+
dCd(ωωω)
D−1C
= 2(
dCd(ωωω)
D−1C)+
dD−1
d(ωωω)(C⊗C) .
136
The derivative of C with respect to ωωω equals
dCdωωω
=d
dωωωFFFWe =
dFFFdωωω
(We⊗Iq)+dWdωωω
(e⊗FT )+ de
dωωωWFFFT . (6.27)
Applying the chain rule, defined in Appendix A, to (6.27) gives
dθθθ(ωωω)
dωωω
(dFFF
dθθθ(ωωω)(We⊗ Iq)−
d fff (XXX)
dθθθ(ωωω)WFFFT
)+UUU∗
(eee ⊗ FFFT ) , (6.28)
since
dWdωωω
=UUU∗,
where UUU∗ is defined in (6.25).
Next, the derivative of D with respect to ωωω is considered. Using the rule fordifferentiation of a matrix inverse (see Appendix A), the derivative of D−1 withrespect to ωωω is given by
dD−1
dωωω=−dD
dωωω
(D−1⊗D−1) .
Now,
dDdωωω
=d
dωωω
(FFFWFFFT )
=dFFFdωωω
(WFFFT ⊗ Iq
)+
dWdωωω
(FFFT ⊗FFFT )+ dFFFT
dωωω
(Iq⊗WFFFT )
=dθθθ(ωωω)
dωωω
(dFFF
dθθθ(ωωω)
(WFFFT⊗Iq
)+
dFFFT
dθθθ(ωωω)
(Iq⊗WFFFT ))
+dWdωωω
(FFFT⊗FT ) ,
and
dD−1
dωωω=−
[dθθθ(ωωω)
dωωω
(dFFF
dθθθ(ωωω)
(WFFFT⊗Iq
)+
dFFFT
dθθθ(ωωω)
(Iq⊗WFFFT )) (6.29)
+dWdωωω
(FFFT ⊗FFFT )((FFFWFFFT )−1⊗
(FFFWFFFT )−1
)].
137
Now, θθθ(ωωω = 111n) = θθθ and yyy− fff (XXX ,θθθ(ωωω = 111n)) = rrr, FFF =FFF(θθθ), G = G(θθθ) andG∗ = G∗(θθθ). The derivatives in (6.28) and (6.29) evaluated at ωωω = 111n become
ddωωω
FFFWe∣∣∣∣ωωω=111n
=UUU∗(rrr⊗FFFT )− dθθθ(ωωω)
dωωω
∣∣∣∣∣ωωω=111n
(FFFFFFT −G(r⊗ Iq)
),
d(FFFWFFFT )−1
dωωω
∣∣∣∣ωωω=111n
= −
[dθθθ(ωωω)
dωωω
∣∣∣∣∣ωωω=111n
(G(FFFT⊗Iq
)+G∗
(Iq⊗FT ))
+UUU∗(FFFT ⊗FFFT )]×((FFFFFFT )−1⊗
(FFFFFFT )−1
).
We know from the proof of Corollary 5.2.2 that
dθθθ(ωωω)
dωωω
∣∣∣∣∣ωωω=111n
= DIMθθθ=
DIM
θθθ ,1DIM
θθθ ,2...
DIMθθθ ,n
,
where DIMθθθ ,k is given in Definition 6.1.3 and thus, the last column of DIM
θθθ
has all elements equal to zero.
Now, let g =(FFFFFFT
)−1 FFF r. Then,
da(ωωω)
dωωω
∣∣∣∣ωωω=111n
= 2(UUU∗(rrr⊗FFFT )−DIM
θθθ
(FFFFFFT −G(r⊗ Iq)
))g
−[DIM
θθθ
(G(FFFT ⊗ Iq
)+G∗
(Iq⊗FFFT ))
+UUU∗(FFFT ⊗FFFT )](g⊗g)
= 2(UUU∗(rrr⊗FFFT )+DIM
θθθ(G(r⊗ Iq))g
)(6.30)
−[DIM
θθθ
(G(FFFT ⊗ Iq
)+G∗
(Iq⊗FFFT ))
+UUU∗(FFFT ⊗FFFT )](g⊗g) .
In the expression above, DIMθθθFFFFFFT g = 000q. This is due to the fact that the
normal equations for estimating θθθ under the restriction that θq = 0 is set tozero and that the last column of DIM
θθθhas all elements equal to zero. Thus,
DIMθθθFFFFFFT
(FFFFFFT
)−1 FFF r = DIMθθθFFF r = 000q.
138
Now the derivative of the variance term needs to be calculated. The maximumlikelihood estimator of σ2(ωωω) under the null hypothesis, H0 : θq = 0, equals
σ2(ωωω) =
1n
(eT We
).
Using the product and the chain rule, see Appendix A, the derivative of σ2(ωωω)with respect to ωωω is the following
dσ2(ωωω)
dωωω=
1n
ddωωω
eT We
=1n
(2
dedωωω
We+dWdωωω
(e⊗ e))
=1n
(dWdωωω
(e⊗ e)−2d fff (XXX ,θθθ(ωωω))
dωωωWe
)
=1n(UUU∗ (e⊗ e))
− 1n
(2
dθθθ(ωωω)
dωωω
(d fff (XXX ,θθθ(ωωω))
dθθθ(ωωω)We
))
=1n
(UUU∗ (e⊗ e)−2
dθθθ(ωωω)
dωωω
(d fff (XXX ,θθθ(ωωω))
dθθθ(ωωω)We
)).
Evaluated at ωωω = 111n, θθθ(ωωω = 111n) = θθθ , yyy− fff (XXX ,θθθ(ωωω = 111)) = r and hence
dσ2(ωωω)
dωωω
∣∣∣∣ωωω=111n
=1n
(UUU∗ (rrr⊗ rrr)−2DIM
θFr)
(6.31)
=1n
UUU∗ (rrr⊗ rrr) ,
since DIMθ
Fr = 000q.
Now, inserting (6.30) and (6.31) in (6.26) we get
dS(θθθ(ωωω))
dωωω
∣∣∣∣∣ωωω=111n
=1
σ2
[2(UUU∗(rrr⊗FFFT )+DIM
θθθG(r⊗ Iq)
)g
−(DIM
θθθ
(G(FFFT ⊗ Iq
)+G∗
(Iq⊗FFFT ))+UUU∗
(FFFT ⊗FFFT )) (6.32)
×(g⊗g)− S(θθθ)n
UUU∗ (rrr⊗ rrr)
],
139
and `T dS(θθθ(ωωω))dωωω
∣∣∣ωωω=111n
equals the expression in Theorem 6.2.1.
This completes the proof.�
Corollary 6.2.1. The influence measure DIMSK is a linear combination of theinfluence measures DIMSk, given in Definition 6.1.2, for all k contained in thesubset K.
Proof. Now, consider
`T dS(θθθ(ωωω))
dωωω
∣∣∣∣∣ωωω=111n
,
where ` : n×1 is a vector with nonzero entries in the rows with indices in K.
Pre-multiplying (6.32) by `T , we need to consider the following terms
`TUUU∗(rrr⊗FFFT ) , `T DIM
θθθG(r⊗ Iq) ,
`T DIMθθθ
G(FFFT ⊗ Iq
), `T DIM
θθθG∗(Iq⊗FFFT ) ,
`TUUU∗(FFFT ⊗FFFT ) , `TUUU∗ (rrr⊗ rrr) .
Firstly, we evaluate the terms containing `TUUU∗ : 1×n2. We have that
`TUUU∗ =(
dddT1 | dddT
2 | . . . | dddTn),
where dddi is the ith column of the identity matrix of size n, for all i contained inthe subset K and dddi = 000n for all i not contained in K.
From this, it follows that
`TUUU∗(rrr⊗FFFT ) = ∑
i∈K`iriFFFT
i , (6.33)
`TUUU∗(FFFT ⊗FFFT ) = ∑
i∈K`iFFF iFFFT
i , (6.34)
`TUUU∗ (rrr⊗ rrr) = ∑i∈K
`iri. (6.35)
140
Secondly, we observe that
`T DIMθθθ= `T
DIM
θθθ ,1DIM
θθθ ,2...
DIMθθθ ,n
= ∑i∈K
`iDIMθθθ ,i, (6.36)
where DIMθθθ ,i : 1×q is given in Definition 6.1.3.
Assuming, without loss of generality, that K = {k}, then we have that
`TUUU∗(rrr⊗FFFT ) = `krkFFFT
k ,
`T DIMθθθ
G(r⊗ Iq) = `kDIMθθθ ,kGGG(r⊗ Iq) ,
`T DIMθθθ
G(FFFT ⊗ Iq
)= `kDIM
θθθ ,kGGG(FFFT ⊗ Iq
),
`T DIMθθθ
G∗(Iq⊗FFFT ) = `kDIM
θθθ ,kGGG∗ (Iq⊗FFFT ) ,
`TUUU∗(FFFT ⊗FFFT ) = `kFFFkFFFT
k ,
`TUUU∗ (rrr⊗ rrr) = `kr2k .
Inserting the equalities above in the expression for DIMSK we arrive at
`T DIMSK =1
σ2
[2(`krkFFFT
k g+ `kDIMθθθ ,kGGG(r⊗ Iq)g
)−(`kDIM
θθθ ,k
(GGG(FFFT ⊗ Iq
)+GGG∗
(Iq⊗FFFT ))+ `kFFFkFFFT
k
)× (g⊗g)− S
n`kr2
k
].
which, if `k = 1, equals the expression in Theorem 6.1.2. Since the equali-ties in (6.33)-(6.36) are sums over all observations contained in the subset Kwe observe that, for a general subset K, the expression of DIMSK is a linearcombination of the DIMSk, given in Definition 6.1.2, for all k contained in thesubset K. This completes the proof.
�
From Corollary 6.2.1 we see that the DIMSK is a linear combination of theDIMSk, the influence measure used to assess the influence of single observa-tions on the score test statistic, for all k ∈ K. Thus, in order to assess the joint
141
influence of multiple observations on the score test statistic we only need toconsider the measure given in Theorem 6.1.2.
If observations with equal signs of the values of DIMSk will be consideredtogether, the joint influence that these observations exercise on the score teststatistic will be more extensive than when they are considered separately. Forinstance, two observations with positive influence on the score test statisticwill result in a much larger positive influence when they are considered jointly.Also, two observations with a negative influence on the score test statistic willresult in a larger negative influence when they are considered jointly. It is im-portant to remember that two observations with unequal signs of the DIMSkwill even out the value of the DIMSK , resulting in a value more close to zero,i.e. no joint influence.
In the next section, a numerical example illustrates how multiple observationscan influence the score test statistic.
6.2.1 Numerical example
We will continue with the numerical example given in Section 6.1.3 and wewill assess the influence of multiple observations on the score test statistic us-ing the diagnostic DIMSK . We know that the DIMSK is a linear combinationof the diagnostic DIMSk, therefore we only need to consider Figure 6.1 in Sec-tion 6.1.3 to assess the joint influence of the observations.
In Figure 6.1 we can see that the 1st and 10th observations are the observa-tions with the largest negative influence on the score test statistic. Mutually,they exercise quite a large negative influence on the score test statistic. Thevalue of DIMS1 is −1.68 and the value of DIMS10 is −0.72; hence, the valueof DIMSK = −2.40 when K = {1,10}. The presence of the 1st and 10th ob-servations are decreasing the value of the score test statistic. If both theseobservations are removed from the analysis the score test statistic equals 5.82with a corresponding p−value 0.02.
If we look at the scenario when the 1st and 2nd observation is consideredjointly we can expect that the joint influence will even out since these obser-vations have unequal signs of the DIMSk. The value of the influence mea-sure corresponding to the 2nd observation is DIMS2 = 1.32 and the resultingDIMSK =−0.37 when K = {1,2}. Separately, these two observations exercisequite a large influence on the score test statistic, but when considered mutuallythe joint influence is almost zero. The result of the testing procedure would not
142
change dramatically if these two observations were removed from the analy-sis. In fact, the score test statistic equals 2.58 with a corresponding p−value of0.11. This is a small increase from 1.67, i.e. the value of the score test statisticwhen all observations are present in the analysis.
143
144
7. Concluding remarks andfurther research
It is well known that not all observations play an equal role in determiningthe various results from a regression analysis. For instance, the character ofthe regression line may be determined by only a few observations, while mostof the data is somewhat ignored. Such observations that highly influence theresults of the analysis are called influential observations. It is beneficial, formany reasons, to be able to detect influential observations, see Chapter 3. Forthe linear regression model there is a vast collection of diagnostic tools to usefor identifying influential observations. The amount of literature and researchon influence analysis for nonlinear regression models is not as extensive as inthe linear regression case. With this dissertation we want to make a contribu-tion to influence analysis concerning various results of the nonlinear regres-sion analysis. In particular, we focus on the task of identifying observationswith substantial influence on the parameter estimates of a nonlinear regressionmodel and on the score test statistic, when testing a hypothesis that a specificparameter in the nonlinear regression model equals zero.
The main contributions of this thesis are as follows:
• Two different diagnostic measures for assessing the influence of sin-gle observations on the parameter estimates in the nonlinear regressionmodel (2.2) are proposed. The first measure, DIM
θθθ ,k, is to be used whenwe are interested in assessing the influence of an observation on thewhole vector of parameter estimates. The explicit expression of thismeasure is given in Theorem 5.1.2. The second measure, for assess-ing the influence of an observation on a specific parameter estimate isdenoted DIM
θ j,kand the explicit expression of the measure is given in
Theorem 5.1.3 (Aim 1).
• We extend the ideas and techniques used to assess the influence of sin-gle observations, to multiple observations, on the parameter estimates inthe nonlinear regression model (2.2). In correspondence with the firstcontribution, we present two measures: One measure for assessing the
145
influence of multiple observations on the whole vector of parameter esti-mates, DIM
θθθ ,K , and one measure for assessing the influence of multipleobservations on a specific parameter estimate, DIM
θ j,K. These measures
are presented in Theorem 5.2.1 and 5.2.2, respectively. The influencethat multiple observations exercise on the parameter estimates in thiscase are referred to as joint influence, since we consider the observa-tions simultaneously when assessing the influence (Aim 2).
• As opposed to joint influence, multiple observations can exercise whatwe refer to as conditional influence on the parameter estimates. Con-ditional influence arises if an observation is not identified as influentialunless another observation is deleted first. Thus, an influence measurefor assessing the influence of the kth observation, given that the ith ob-servation is deleted, is proposed. The measure is denoted DIM
θθθ (i),kand
its explicit expression is given in Theorem 5.2.4 (Aim 3).
• We develop a graphical tool for visually identifying observations that areinfluential on the score test statistic, when testing the null hypothesis ofH0 : θq = 0, where θq is a parameter in the nonlinear regression model(2.2). This graphical tool is referred to as the added parameter plot andit is presented in Definition 4.2.1 (Aim 4).
• The added parameter plot is for explorative purposes only. In order toquantify the influence of the observations on the score test statistic wepropose two influence measures. The first measure is to be used whenassessing the influence of a single observation on the score test statis-tic, denoted DIMSk. The explicit expression of this measure is given inTheorem 6.1.2. Moreover, we propose a measure for assessing the influ-ence of multiple observations, jointly, on the score test statistic, denotedDIMSK , presented in Theorem 6.2.1 (Aim 5).
In general, we are proud to propose our new measures and diagnostic tools,since they add to the research of influence analysis in nonlinear regression.With this thesis, we give practitioners more approaches to chose from whenconducting influence analysis, and hence more flexibility. However, we wantto highlight some of the contributions that we feel particularly strong about:Firstly, the use of the proposed marginal influence measures, DIM
θ j,kand
DIMθ j,K
, provides the opportunity to assess the influence of observations ona specific parameter estimate. There exist diagnostic measures for assessingthe influence of observations on a specific parameter estimate in the linear re-gression model, but it has not yet been done for parameter estimates in thenonlinear regression model. Secondly, estimating the parameters in nonlinear
146
regression models are complicated since there is generally no closed form ofthe estimators. Instead, iterative methods must be used to find the estimates. Tomake a comparison, consider adopting the case-deletion approach for assess-ing the influence of observations on the parameter estimates (or other statisticswhich are functions of the parameter estimates). With this method we needto, iteratively, find the estimates for each observation that is deleted. This canbecome an overwhelming task. Our proposed approach to influence analysisreduces the burden with additional iterations, since we only need to find theestimates of the parameters once. Moreover, after scrutinizing the literature oninfluence analysis in nonlinear regression, we have not yet seen any researchresults on how one can identify observations that are influential on the out-come of a hypothesis testing procedure. With the proposed results, the addedparameter plot and the diagnostic measures DIMSk and DIMSK , we give an-other view of influence analysis in nonlinear regression, since most research isfocused on the parameter estimates.
Of course, we are only able to cover a fraction of all there is to discover in thearea of influence analysis in nonlinear regression. There are still many thingsone can do to extend the work done in this thesis. The following are threeexamples of directions for future work:
• In this thesis we do not discuss what constitutes a substantially influ-ential observation. We rather put the results of the computed influencemeasures in relation to each other and rely on the judgment of the re-searcher or practitioner. A further task could be to develop cut-offs, orthresholds, that determine when an observation is substantially, or sig-nificantly, influential. One idea is to use the bootstrap method to accom-plish this.
• We are aware that Rao’s score test (see Chapter 2) is asymptoticallyχ2-distributed, so that larger samples are needed in order to get reliablep-values. An intriguing task would be to examine how the influenceanalysis, and the use of the proposed methods, are affected as the samplesize grows.
• Since nonlinear regression models can differ greatly, a future task couldbe to customize the results obtained in this thesis to a specific nonlinearregression model, such as the Michaelis-Menten model.
147
148
Appendix A
Matrix derivative
In this section rules for matrix differentiation are presented.
Definition. Let the elements of Y∈Rr×s be functions of X∈Rp×q. The matrixdYdX ∈ Rpq×rs is called matrix derivative of Y by X in a set A, if the partialderivative dykl
dxi jexist, are continues in A and
dYdX
=d
dXvecT Y,
where
ddX
=
(d
dx11, . . . ,
ddxp1
,d
dx12, . . . ,
ddxp2
, . . . ,d
dx1q, . . . ,
ddxpq
)T
.
Properties of the matrix derivative in the definition is presented in the follow-ing table, where Z : s× t is a function of Y, and where A and B are matrices ofconstants and of proper size.
Differentiated function Derivative
Z = Z(Y), Y = Y(X) dZdX = dY
dXdZdY
Y = AXB dYdX = B⊗AT
Z = AYB dZdX = dY
dX(B⊗AT
)W = YZ dW
dX = dYdX (Z⊗ Ir)+
dZdX(It ⊗YT
)W = RYZ, R ∈ Rp×r dW
dX = dRdX (YZ⊗ Ip)+
dYdX(Z⊗RT )+ dZ
dX
(It⊗(RY)T
)Y−1 dY−1
dX =−dYdX(Y−1⊗Y−1)
149
150
Sammanfattning
Alla observationer är inte lika viktiga för resultaten från en regressions analys.I de mest extrema fall kan en eller två observationer helt bestämma, till ex-empel, värdet på parameterskattningarna, medan resten av datat till stora delarignoreras. Sådana observationer, som har stort inflytande på inferensen, kallasinflytelserika och att kunna identifiera inflytelserika observationer är av storvikt.
Avhandlingen erbjuder metoder för att identifiera inflytelserika observationerdå man arbetar med en icke-linjär regressionsmodell. Detta uppnås genom attmått konstrueras, vilka mäter inflytandet av en eller flera observationer på pa-rameterskattningarna. Metoden som används för att konstruera dessa mått ärhämtad från influensanalys inom linjär regression och kallas deriveringsmeto-den.
Hypotesprövning är en viktig del av den statistiska inferensen och även resul-tatet från en hypotesprövning kan till stor del påverkas av en eller flera bety-delsefulla observationer. En intressant aspekt av influensanalys är därför hurde individuella observationerna påverkar resultatet från en hypotesprövning. Iavhandlingen ges flera metoder för att identifiera inflytelserika observationerpå teststatistikan, då score-testet används med nollhypotesen att en parameter iden icke-linjär regressionsmodellen är lika med noll. Med hjälp av deriverings-metoden konstruerar vi mått som mäter inflytandet av en eller flera observa-tioner, och på så sätt kan inflyteserika observationer identifieras. Utöver dettakonstruerar vi ett grafiskt hjälpmedel som kan användas för att visuellt identi-fiera observationer som har stort inflytande på teststatistikan för score-testet.
References
[1] Alfons, A., Croux, C. & Gelper, S. (2013). Sparse least trimmed squaresregression for analyzing high-dimensional large data sets. The Annals of AppliedStatistics, 7, 226-248.
[2] Andrews, D.F. & Pregibon, D. (1978). Finding the outliers that matter. Journalof the Royal Statistical Society. Series B, 40, 85-93.
[3] Atkins, G.L. & Nimmo, I.A. (1975). A comparison of seven methods for fittingthe Michaelis-Menten equation. Biochemical Journal, 149, 775-777.
[4] Atkinson, A.C. (1982). Regression diagnostics, transformations and constructedvariables. Journal of the Royal Statistical Society. Series B, 44, 1-36.
[5] Atkinson, A.C. (1985). Plots, Transformations and Regression. Clarendon,Oxford.
[6] Atkinson, A.C. (1986). Masking unmasked. Biometrika, 73, 533-541.
[7] Barnes, T.J. (1998). The history of regression: actors, networks, machines andnumbers. Environment and Planning A, 30, 203-223.
[8] Bates, D.M. & Watts, D.G. (1988). Nonlinear Regression Analysis and ItsApplications. Wiley, New Jersey.
[9] Behnken, D.W. & Draper, N.R. (1972). Residuals and their variance patterns.Technometrics, 14, 101-111.
[10] Belsley, D.A., Kuh, E. & Welsch, R.E. (1980). Regression Diagnostics: Identi-fying Influencial Data and Sources of Collinearity. Wiley, New Jersey.
[11] Beyaztas, U. & Alin, A. (2014). Sufficient jackknife-after-bootstrap methodfor detection of influential observations in linear regression models. StatisticalPapers, 55, 1001-1018.
[12] Briggs, G.E. & Haldane, J.B.S. (1925). A note on the kinetics of enzyme action.Biochemical Journal, 19, 338-339.
[13] Bulmer, M. (2003). Francis Galton: Pioneer of Heredity and Biometry. JohnHopkins University Press, Baltimore.
[14] Chakraborty, B., Bhattacharya, S., Basu, A., Bandyopadhyay, S. & Bhattachar-jee, A. (2014). Goodness-of-fit testing for the Gompertz growth curve model.Metron, 72, 45-64.
[15] Chatterjee, S. & Hadi, A.S. (1986). Influential observations, high leveragepoints, and outliers in linear regression. Statistical Science, 1, 379-393.
[16] Chatterjee, S. & Hadi, A.S. (1988). Sensitivity Analysis in Regression. Wiley,New York.
[17] Chen, C-F. (1983). Score tests for regression models. Journal of the AmericanStatistical Association, 78, 158-161.
[18] Chen, C-F. (1985). Robustness aspects of score tests for generalized linear andpartially linear regression models. Technometrics, 27, 277-283.
[19] Cook, R.D. (1977). Detection of influential observation in linear regression.Technometrics, 19, 16-18.
[20] Cook, R.D. (1986). Assessment of local influence. Journal of the Royal Statisti-cal Society. Series B, 48, 133-169.
[21] Cook, R.D. (1987). Parameter Plots in Nonlinear Regression. Biometrika, 74,669-677.
[22] Cook, R.D. (1998). Regression Graphics. Wiley, New York.
[23] Cook, R.D. & Weisberg, S. (1980). Characterizations of an emprical influencefunction for detecting influential cases in regression. Technometrics, 22, 495-508.
[24] Cook, R.D. & Weisberg, S. (1982). Residuals and Influence in Regression.Chapman & Hall, New York.
[25] Dette, H. & Kunert, J. (2014). Optimal designs for the Michaelis-Menten modelwith correlated observations. Statistics: A Jounal of Theoretical and AppliedStatistics, 48, 1254-1267.
[26] Ezekiel, M. (1924). A method for handling curvilinear correlation for anynumber of variables. Journal of the American Statistical Association, 19,431-453.
[27] Galea, M., Paula, G.A. & Cysneiros, F.J.A. (2005). On diagnostics in symmetri-cal nonlinear models. Statistics and Probability Letters, 73, 459-467.
[28] Gallant, A.R. (1987). Nonlinear Statistical Models. Wiley, New York.
[29] Gut, A. (1995). An intermediate Course in Probability. Springer, New York.
[30] Hadi, A.S. (1992). A new measure of overall potential influence in linearregression. Computational Statistics and Data Analysis, 14, 1-27.
[31] Hamilton, D. (1986). Confidence regions for parameter subsets in nonlinearregression. Biometrika, 73, 57-64.
[32] Hamilton, D. & Wiens, D. (1987). Correction factors for F ratios in nonlinearregression. Biometrika, 74, 423-425.
[33] Hampel, F.R. (1974). The influence curve and its role in robust estimation.Journal of the American Statistical Association, 69, 383-393.
[34] Hoaglin, D.C. & Welsch, R.E. (1978). The hat matrix in regression and ANOVA.The American Statistician, 32, 17-22.
[35] Huber, P.J. (1972). The 1972 Wald lecture robust statistics: A review. TheAnnals of Mathematical Statics, 43, 1041-1067.
[36] Johnson, B.W. & McCulloch, R.E. (1987). Added-Variable plots in linearregression. Technometrics, 29, 427-433.
[37] Kollo, T. & von Rosen, D. (2010). Advanced Multivariate Statistics withMatrices. Springer, Dordrecht.
[38] Lawrence, A.J. (1995). Deletion influence and masking in regression. Journalof the Royal Statistical Society. Series B (Methodological), 57, 181-189.
[39] Lee, A.H., Xiang, L. & Fung, W.K. (2004). Sensitivity of score tests forzero-inflation in count data. Statistics in Medicine, 23, 2757-2769.
[40] Lemonte, A.J. & Patriota, A.G. (2011). Influence diagnostics in Birnbaum-Saunders nonlinear regression models. Journal of Applied Statistics, 38,871-884.
[41] Li, B. (2001). Sensitivity of Rao’s score test, the Wald test and the likelihoodratio test to nuisance parameters. Journal of Statistical Planning and Inference,97, 57-66.
[42] Lustbader, E.D. & Moolgavkar, S.H. (1985). A diagnostic statistic for the scoretest. Journal of the American Statistical Association, 80, 375-379.
[43] Markatou, M. & Manos, G. (1996). Robust tests in nonlinear regression models.Journal of Statistical Planning and Inference, 55, 205-217.
[44] Michaelis, L. & Menten, M.L. (1913). Die kinetik der invertinwirkung. Bio-chemische Zeitschrift, 49, 333-369.
[45] Mosteller, F. & Tukey, J.W. (1977). Data Analysis and Linear Regression.Addison-Wesley, Reading.
[46] Neyman, J. & Pearson, E.S. (1928). On the use and interpretation of certain testcriteria. Biometrika, 20A, 175-240, 263-294.
[47] Nocedal, J. & Wright, S.J. (2006). Numerical Optimization. Springer, New York.
[48] Nurunnabi, A.A.M., Hadi, A.S. & Imon, A.H.M.R. (2014). Procedures for theidentification of multiple influential observations in linear regression. Journal ofApplied Statistics, 41, 1315-1331.
[49] Park, H., Sakaori, F. & Konishi, S. (2014). Robust sparse regression and tuningparameter selection via the efficient bootstrap information criteria. Journal ofStatistical Computation and Simulation, 84, 1596-1607.
[50] Pasaribu, U.S. (1999). Statistical assumptions underlying the fitting of theMichaelis-Menten equation. Journal of Applied Statistics, 26, 327-341.
[51] Peña, D. & Yohai, V.J. (1995). The detection of influential subsets in linearregression by using an influence matrix. Journal of the Royal Statistical Society.Series B, 57, 145-156.
[52] Poon, W. & Poon, Y.S. (2001). Conditional local influence in case-weightslinear regression. British Journal of Mathematical and Statistical Psychology,54, 177-191.
[53] Rao, C.R. (1948). Large sample tests of statistical hypotheses concerningseveral parameters with applications to problems of estimation. MathematicalProceedings of the Cambridge Philosophical Society, 44, 50-57.
[54] Ritchie, R.J. & Prvan, T. (1996). Current statistical methods for estimating theKm and Vmax of Michaelis-Menten kinetics. Biochemical Education, 24, 196-206.
[55] Ross, W.H. (1987). The Geometry of case deletion and the assessment ofinfluence in nonlinear regression. The Canadian Journal of Statistics, 15,91-103.
[56] Rousseeuw, P.J. & Leroy, A.M. (1987). Robust Regression and Outlier Detec-tion. Wiley, New York.
[57] Seber, G.A.F. & Wild, C.J. (2003). Nonlinear Regression. Wiley, New Jersey.
[58] Sen, A. & Srivastava, M. (1990). Regression Analysis: Theory, Methods, andApplications. Springer, New York.
[59] Stanley, W. & Miller, M. (1979). Measuring technological change in jet fighteraircraft. Report No. R-2249-AF, Rand Corp., Santa Monica, CA.
[60] Stiegler, S.M. (1986). The History of Statistics: The Measurement of Uncer-tainty Before 1900. Harvard University Press, Cambridge.
[61] Stiegler, S.M. (1989). Francis Galton’s account of the invention of correlation.Statistical Science, 4, 73-79.
[62] St. Laurent, R.T. & Cook, R.D. (1992). Leverage and superleverage in nonlinearregression. Journal of the American Statistical Association, 87, 985-990.
[63] St. Laurent, R.T. & Cook, R.D. (1993). Leverage, local influence and curvaturein nonlinear regression. Biometrika, 80, 99-106.
[64] Vanegas, L.H. & Cysneiros, F.J.A. (2010). Assessment of diagnostic proceduresin symmetrical nonlinear regression models. Computational Statistics and DataAnalysis, 54, 1002-1016.
[65] Vanegas, L.H., Rondón, L.M. & Cysneiros, F.J.A. (2012). Diagnostic pro-cedures in Birnbaum-Saunders nonlinear regression models. ComputationalStatistics and Data Analysis, 56, 1662-1680.
[66] Vanegas, L.H., Rondón, L.M. & Cysneiros, F.J.A. (2013). Assessing robustnessof inference in symmetrical nonlinear regression models. Comunication inStatistics - Theory and Methods, 42, 1692-1711.
[67] Wald, A. (1943). Tests of statistical hypotheses concerning several parameterswhen the number of observations is large. Transactions of the Americal Matem-atical Society, 54, 426-482.
[68] Wang, D.Q. & Critchley, F. (2000). Multiple deletion measures and conditionalinfluence in regression model. Communication in Statistics - Theory andMethods, 29, 2391-2404.
[69] Zwietering, M.H., Jongenburger, I., Rombouts, F.M. & van’t Riet, K. (1990).Modeling of the bacterial growth curve. Applied and Environmental Microbiol-ogy, 56, 1875-1881.