i owa s tate u niversity department of animal science use of proc glm to analyze experimental data...
TRANSCRIPT
IOWA STATE UNIVERSITYDepartment of Animal Science
Use of Proc GLM to Analyze Experimental Data
Animal Science 500
Lecture No.
October , 2010
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLM
u The GLM procedure uses the method of least squares to fit general linear models.
u Among the statistical methods available in PROC GLM are:n Regression, n Analysis of variance, n Analysis of covariance, n Multivariate analysis of variance (MANOVA), n and partial correlation.
SAS/STAT(R) 9.22 User's Guide
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLM
u PROC GLM analyzes data within the framework of general linear models.
u PROC GLM handles models relating one or several continuous dependent variables to one or several independent variables. n The independent variables can be either classification
variables, which divide the observations into discrete groups, or continuous variables.
n Thus, the GLM procedure can be used for many different analyses, including the following:
SAS/STAT(R) 9.22 User's Guide
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLMn Thus, the GLM procedure can be used for many
different analyses, including the following: l simple regression l multiple regression l analysis of variance (ANOVA), especially for unbalanced data l analysis of covariance l response surface models l weighted regression l polynomial regression l partial correlation l multivariate analysis of variance (MANOVA) l repeated measures analysis of variance
SAS/STAT(R) 9.22 User's Guide
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLMu PROC GLM enables you to specify any degree of
interaction (crossed effects) and nested effects.n It also provides for polynomial, continuous-by-class, and
continuous-nesting-class effects.
u Through the concept of estimability, the GLM procedure can provide tests of hypotheses for the effects of a linear model regardless of the number of missing cells or the extent of confounding.
u PROC GLM displays the sum of squares (SS) associated with each hypothesis tested and, upon request, the form of the estimable functions employed in the test. PROC GLM can produce the general form of all estimable functions.
SAS/STAT(R) 9.22 User's Guide
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLMu The REPEATED statement enables you to specify
effects in the model that represent repeated measurements on the same experimental unit for the same response, providing both univariate and multivariate tests of hypotheses.
u The RANDOM statement enables you to specify random effects in the model; expected mean squares are produced for each Type I, Type II, Type III, Type IV, and contrast mean square used in the analysis. Upon request, tests that use appropriate mean squares or linear combinations of mean squares as error terms are performed.
SAS/STAT(R) 9.22 User's Guide
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLMu The ESTIMATE statement enables you to specify an
vector for estimating a linear function of the parameters .
u The CONTRAST statement enables you to specify a contrast vector or matrix for testing the hypothesis that . When specified, the contrasts are also incorporated into analyses that use the MANOVA and REPEATED statements.
u The MANOVA statement enables you to specify both the hypothesis effects and the error effect to use for a multivariate analysis of variance.
SAS/STAT(R) 9.22 User's Guide
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLMu PROC GLM can create an output data set containing
the input data set in addition to predicted values, residuals, and other diagnostic measures.
u PROC GLM can be used interactively. After you specify and fit a model, you can execute a variety of statements without recomputing the model parameters or sums of squares.
SAS/STAT(R) 9.22 User's Guide
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLMu For analysis involving multiple dependent variables
but not the MANOVA or REPEATED statements, a missing value in one dependent variable does not eliminate the observation from the analysis for other dependent variables. PROC GLM automatically groups together those variables that have the same pattern of missing values within the data set or within a BY group. This ensures that the analysis for each dependent variable brings into use all possible observations.
SAS/STAT(R) 9.22 User's Guide
IOWA STATE UNIVERSITYDepartment of Animal Science
Estimable Function
u Often see an error in SAS non-est.
u What does this mean?
IOWA STATE UNIVERSITYDepartment of Animal Science
Estimability
u Generalized inverses are used to obtain solutions for effects in general linear models. n There are many generalized inverses.n Many different sets of solutions are possible.
u Estimable are unique and don’t depend on the generalized inverse used to obtain solutions.
u To analyze data properly, that is answer the hypothesis being tested, the scientist should know what function of the parameters in the model are being estimated.
IOWA STATE UNIVERSITYDepartment of Animal Science
Estimability
u The hypothesis being tested is NOT the absolute values for a level of a factor in the model.
u Usually asking or hypothesizing that two means are different or some treatment is different from a control.
u Hence the differences are estimable function NOT the values (solutions) for any of the functions.
IOWA STATE UNIVERSITYDepartment of Animal Science
The General Linear Model
u The main effects general linear model can be parameterized as
Yij = µ + αi + bj + εij
Where
Y observation for ith α,
µ is the overall mean (unknown fixed parameter),
αi effect of the ith value of α (αi - µ),
bj effect of the jth value of b (bj - µ), and
εij is the experimental error N(0,δ2)
IOWA STATE UNIVERSITYDepartment of Animal Science
The General Linear Model
u In matrix terminology, the general linear model may be expressed as
u Y = Xβ + ε
where
Y the observed data vector,
X the design matrix,
β is a vector of unknown fixed effect parameters, and
ε is the vector of errors
IOWA STATE UNIVERSITYDepartment of Animal Science
Programming the General Linear Model
u In the GLM procedure, one saves the data set plus the residuals, predicted values, and studentized residuals with an output statement in a data set called resdat.
PROC GLM;
class machine operator;
Model yield=machine|operator;
output out=resdat r=resid p=pred
student=stdres rstudent=rstud
cookd=cksd h=lev;
IOWA STATE UNIVERSITYDepartment of Animal Science
Assumptions of the general linear model
u E (ε) = 0
u var(ε) = σ2 I
u var(Y) = σ2 I
u E(Y ) = Xβ
IOWA STATE UNIVERSITYDepartment of Animal Science
Assumptions of the Linear Regression Model1.Linear Functional form
2.Fixed independent variables
3.Independent observations
4.Representative sample and proper specification of the model (no omitted variables)
5.Normality of the residuals or errors
6.Equality of variance of the errors (homogeneity of residual variance)
7.No multicollinearity
8.No autocorrelation of the errors
9.No outlier distortion
IOWA STATE UNIVERSITYDepartment of Animal Science
Explanation of the Assumptions1. Linear Functional form
n Does not detect curvilinear relationships
2. The Observations are Independent observationsn Representative sample from some larger populationn If the observations are not independent results in an autocorrelation which inflates the
t and r and f statistics which in turn distorts the significance tests
3. Normality of the residualsn Permits proper significance testing similar to ANOVA and other statistical procedures
4. Equal variance (or no heterogenous variance)n Heteroskedasticity precludes generalization and external validityn This too distorts the significance tests being used
5. Multicollinearity (many of the traits exhibit collinearity)n Biases parameter estimation. n Can prevent the analysis from running or converging (getting your answers)
6. Severe or several outliers will distort the results and may bias the results. n If outliers have high influence and the sample is not large enough, then they may
serious bias the parameter estimates
IOWA STATE UNIVERSITYDepartment of Animal Science
SAS test for residual normality
Proc univariate data=resdat normal plot;
var resid;
Run;
Quit;
IOWA STATE UNIVERSITYDepartment of Animal Science
Graphically examining residuals for homogeneity
Proc gplot data=resdat;
plot resid * pred;
Run;
Quit;
Analysis for lack of pattern;
IOWA STATE UNIVERSITYDepartment of Animal Science
Testing for outliers
Proc freq data=resdat;
tables stdres cksd;
Run;
Quit;
1. Look for standardized residuals greater than 3.5 or less than – 3.5
2. And look for high Cook’s D (greater than 4*p/(n-p-1).
IOWA STATE UNIVERSITYDepartment of Animal Science
Class Statement
u Variables included in the CLASS statement referred to as class variables.
u Specifies the variables whose values define the subgroup combinations for the analysis.n Represent various level of some factors or effects
l Treatment (1,….n)l Season (spring, summer, fall, and winter coded 1 through 4)l Breedl Colorl Sexl Linel Dayl Laboratory
IOWA STATE UNIVERSITYDepartment of Animal Science
Evaluating outliers
1.Check coding to spot typos
2. Correct typos
3. If observational outlier is correct,
Examine the dffits option to see determine how much influence the outlier has on the fitting statistics.
This will show the standardized influence of the observation on the fit.
If the influence of the outlier is bad, then consider removal making it a missing observation ( . )
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLM Syntax
PROC GLM <options> ;
CLASS variables </ option> ;
MODEL dependent-variables=independent-effects </ options> ;
IOWA STATE UNIVERSITYDepartment of Animal Science
Statement Must Precede... Must Follow... ABSORB First RUN statement BY First RUN statement CLASS MODEL statement
CONTRAST MANOVA, REPEATED,
MODEL statement
or RANDOM statement
ESTIMATE MODEL statement FREQ First RUN statement ID First RUN statement LSMEANS MODEL statement MANOVA CONTRAST or MODEL statement MEANS MODEL statement
MODEL CONTRAST, ESTIMATE,
CLASS statement
LSMEANS, or MEANS
statement OUTPUT MODEL statement RANDOM CONTRAST or MODEL statement
REPEATED CONTRAST, MODEL,
or TEST statement TEST MANOVA or MODEL statement
REPEATED statement
WEIGHT First RUN statement
Positional Requirements for PROC GLM Statements
IOWA STATE UNIVERSITYDepartment of Animal Science
Statement DescriptionABSORB Absorbs classification effects in a model
BY Specifies variables to define subgroups for the analysis
CLASS Declares classification variables
CONTRAST Constructs and tests linear functions of the parameters
ESTIMATE Estimates linear functions of the parameters FREQ Specifies a frequency variable ID Identifies observations on output LSMEANS Computes least squares (marginal) means MANOVA Performs a multivariate analysis of variance
MEANS Computes and optionally compares arithmetic means
MODEL Defines the model to be fit
OUTPUT Requests an output data set containing diagnostics for each observation
RANDOM Declares certain effects to be random and computes expected mean squares
REPEATED Performs multivariate and univariate repeated measures analysis of variance
STORE Requests that the procedure save the context and results of the statistical analysis into an item store
TEST Constructs tests that use the sums of squares for effects and the error term you specify
WEIGHT Specifies a variable for weighting observations
Statements in the GLM Procedure
IOWA STATE UNIVERSITYDepartment of Animal Science
Class Variables
u Are usually things you would like to account for in your model
u Can be numeric or character
u Can be continuous values
u They are generally not used in regression analysesn What meaning would they have
IOWA STATE UNIVERSITYDepartment of Animal Science
Class Statement Optionsu Ascending sorts class variable in ascending order
u Descending sorts class variable in descending order
Other options with the Class statement generally related to the procedure (PROC) being used and thus will not cover them all
IOWA STATE UNIVERSITYDepartment of Animal Science
Discrete Variables
u A discrete variable is one that cannot take on all values within the limits of the variable. n Limited to whole numbersn For example, responses to a five-point rating scale can
only take on the values 1, 2, 3, 4, and 5. n The variable cannot have the value 1.7. A variable such
as a person's height can take on any value.
Discrete variables also are of two types:1. unorderable (also called nominal variables)
2. orderable (also called ordinal)
IOWA STATE UNIVERSITYDepartment of Animal Science
Discrete Variablesu Data sometimes called categorical as the
observations may fall into one of a number of categories for example: n Any trait where you score the value
l Lameness scoresl Body condition scoresl Soundness scoring
Reproductive Feet and leg
l Behavioral traits Fear test Back test Vocal scores
l Body lesion scores
IOWA STATE UNIVERSITYDepartment of Animal Science
Discrete Variablesu When do discrete variables become continuous
or do they?
u What is a trait like number born alive considered discrete or continuous?
IOWA STATE UNIVERSITYDepartment of Animal Science
Example Variables
Data:
The dependent variable (what is being measured) is aerial biomass
and there are five substrate measurements: (These are the independent variables) 1. Salinity,
2. Acidity,
3. Potassium,
4. Sodium, and Zinc.
IOWA STATE UNIVERSITYDepartment of Animal Science
Covariates
u a covariate is a independent variable that contribute variation to the dependent variable of interest.
u The research wants to account for the covariate differences that occurs for each observation.
u A covariate may be of direct interest or it may be a confounding or interacting type of variable
IOWA STATE UNIVERSITYDepartment of Animal Science
Covariates
u Examples
Weight of animal at measurement
Age of animal at measurement
Age of animal at weaning
Parity of sow for number born alive and weaning weight
Days of lactation for milk weight
IOWA STATE UNIVERSITYDepartment of Animal Science
Covariates
u Covariate may influence the dependent variable in the following waysn Linear covariaten Quadratic covariaten Cubic covariate
IOWA STATE UNIVERSITYDepartment of Animal Science
Covariates
u Check to be sure your covariate is significant
u If the linear is significant, test the quadratic
u If the linear and quadratic are significant sources of variation test the cubic
u How do you do that?
IOWA STATE UNIVERSITYDepartment of Animal Science
Covariates
u How do you do that? n Linear include the variable name in the model not listed
in the class statement.n Example weightn Quadratic the variable name is included as follows
weight*weightn Cubic the variable name is included as follows
weight*weight*weight
IOWA STATE UNIVERSITYDepartment of Animal Science
Covariates
u Covariate may influence the dependent variable in the following waysn Linear covariate
l Independent covariate affects the dependent variable in a linear manner
n Quadratic covariatel Independent covariate affects the dependent variable in a linear
quadratic mannerl Indicates there is an inflection point (and only one)
n Cubic covariatel Independent covariate affects the dependent variable in a linear
cubic mannerl Indicates there are two inflection points
IOWA STATE UNIVERSITYDepartment of Animal Science
Covariates
u Covariate may influence the dependent variable in the following waysn Linear covariate
l Independent covariate affects the dependent variable in a linear manner
n Dependent variable increase or decreases at a constant rate
IOWA STATE UNIVERSITYDepartment of Animal Science
Covariatesu Covariate may influence the dependent
variable in the following waysn Quadratic covariate
l Independent covariate affects the dependent variable in a linear quadratic manner
l Indicates there is an inflection point (and only one)
n The dependent variable increases (or decreases) to some point and then either increases at an increasing rate (decreases at an increasing rate) or increases at a decreasing rate (or decreases at a decreasing rate)
n Or could be a directional change – to some point the dependent variable increases and then after another point the dependent variable response decreases or vise versa
IOWA STATE UNIVERSITYDepartment of Animal Science
Covariatesn Cubic covariate
l Independent covariate affects the dependent variable in a linear cubic manner
l Indicates there are two inflection points
n Essentially the same as quadratic but the changes can occur at an additional point
n The dependent variable increases (or decreases) to some point and then either increases at an increasing rate (decreases at an increasing rate) or increases at a decreasing rate (or decreases at a decreasing rate)
n Or could be a directional change – to some point the dependent variable increases and then after another point the dependent variable response decreases or vise versa
IOWA STATE UNIVERSITYDepartment of Animal Science
Model Development and Selection of Variables
Example:
The general problem addressed is to identify important soil characteristics influencing aerial biomass production of marsh grass, Spartina alterniflora.
IOWA STATE UNIVERSITYDepartment of Animal Science
Example Data Origination (Dr. P. J. Berger)
Data: The data were published as an exercise by Rawlings (1988) and originally appeared as a study by Dr. Rick Linthurst, North Carolina State University (1979). The purpose of his research was to identify the important soil characteristics influencing aerial biomass production of the marsh grass, Spartina alterniflora in the Cape Fear Estuary of North Carolina. The design for collecting data was such that there were three types of Spartina vegetation, in each of three locations, and five random sites within each location vegetation type.
IOWA STATE UNIVERSITYDepartment of Animal Science
Example Datau Objective:
u Find the substrate variable, or combination of variables, showing the strongest relationship to biomass.
Or,
u From the list of five independent variables of salinity, acidity, potassium, sodium, and zinc, find the combination of one or more variables that has the strongest relationship with aerial biomass.
u Find the independent variables that can be used to predict aerial biomass.
IOWA STATE UNIVERSITYDepartment of Animal Science
Example Datau Class vegetative_type location sites
n Recall 3 vegetative types evaluatedn Recall 3 locations where tests occurredn Recall 5 sites within each location
u Model
u Biomass = vegetative_type location site(location) vegetative_type*location salinity acidity potassium sodium zinc;
IOWA STATE UNIVERSITYDepartment of Animal Science
Example Datau Model
u Biomass = vegetative_type location site(location) vegetative_type*location salinity acidity potassium sodium zinc;
u Would need to examine assuming each linear affect was signficantn salinity*salinityn salinity*salinity*salinityn acidity*acidityn acidity*acidity*acidity,n Etc.
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLM Exampleu Example Strawberry yield is modeled as a function of
strawberry variety, type of fertilizer, and their interaction.
PROC GLM DATA=berry;
CLASS fertiliz variety;
MODEL yield=fertiliz variety Fertiliz*variety / SOLUTION;
LSMEANS fertiliz variety;
Run;
Quit;u The SOLUTION statement is useful for showing the relative effect
sizes.
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLM Example Output
General Linear Models Procedure
Class Level Information
FERTILIZ 2 K N
VARIETY 2 Red Sweet
Number of observations in data set = 24This section lets us verify that we have two fertilizers and two varieties of interest, and that there are 24 observations in the data. Information about missing observations is also printed here, if applicable.
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLM Example OutputDependent Variable: YIELD
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 0.87166667 0.29055556 2.59 0.0816
Error 20 2.24666667 0.11233333
Corrected Total 23 3.11833333
R-Square C.V. Root MSE YIELD Mean
0.279530 3.790707 0.3351617 8.8416667This section shows the ANOVA table, with degrees of freedom (DF), sums of squares, and an F value which tests whether any of the terms in the model are significant. The C. V. (coefficient of variation) is (root MSE/mean yield)(100%). R-Square is the model sum of squares divided by total sum of squares. This is commonly used to evaluate how well the model fits the data, but it should not be the only criterion of fit that you examine.
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLM Example OutputSource
DF Type I SS Mean Square F Value Pr > F
FERTILIZ 1 0.37500000 0.37500000 3.34 0.0826
VARIETY 1 0.48166667 0.48166667 4.29 0.0515
FERT*VAR 1 0.01500000 0.01500000 0.13 0.7186
Source DF Type III SS Mean Square F Value Pr > F
FERTILIZ 1 0.37500000 0.37500000 3.34 0.0826
VARIETY 1 0.48166667 0.48166667 4.29 0.0515
FERT*VAR 1 0.01500000 0.01500000 0.13 0.7186
SAS presents Type I and Type III sums of squares and F statistics for their significance under a particular set of assumptions; namely, that fertilizer and variety should be modeled with fixed effects, and that the random error terms satisfy their requirements.
The F test statistics shown here are not always the proper results to interpret! This depends on the design of the experiment.
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLM Example Outputu The Type I sums of squares are also called sequential sums of
squares. Here, they test: 1. Whether fertilizer is a significant predictor
2. Whether variety is significant when considered in addition to fertilizer
3. Whether the interaction is significant when considered in addition to both fertilizer and variety.
u The Type III sums of squares are also called partial sums of squares. Here, they test:1. Assuming that the combinations of fertilizers and varieties are different
from each other, do they show consistent trends for fertilizers to be different from each other?
2. Assuming that the combinations of fertilizers and varieties are different from each other, do they show consistent trends for varieties to be different from each other?
3. Knowing that fertilizers and varieties could be different from each other, is the difference between fertilizers the same for both varieties?
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLM Example Output
u Because the experiment is balanced, both Type I and Type III sums of squares are identical.
u Usually, the Type III sums of squares are used for inference, although the Type I sums of squares are used in specific situations.
u SAS can calculate Type II and Type IV sums of squares as well.
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLM Example Output
u Solution option used after the model statement (i.e. /solution;)
Parameter EstimateT for H:0 Parameter=0 Prob > |T|
Std. Error of Estimates
INTERCEPT 9.13 B 66.75 0.001 0.137
FERTILIZ - K 0.30 B -1.55 0.137 0.194
N 0.00 B . . .
Variety Red -0.33 B
Sweet 0.00 B . . .
Fert x Var K Red 0.10 B 0.37 0.719 0.274
K Sweet 0.00 B . . .
N Red 0.00 B . . .
K Sweet 0.00 B . . .
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLM Example Outputu There are many ways to estimate effects in a linear model with
categorical predictors (fixed effects).
u SAS chooses to do so by alphabetizing the levels of each factor, then assigning an effect size of zero to the last alphabetically-ordered level of each factor and its interactions.
u To predict the response for, say, Fertilizer K for the Red variety, use the equation (Intercept) + (K effect) + (Red effect) + (K*Red interaction effect), or 9.13 - 0.30 - 0.33 + 0.10 = 8.60.
u The t-test values listed on the right can be used to test if certain parameters are significantly different from zero; n in this case, they compare the levels of each factor to the last alphabetically-ordered
level (which is forced to be zero).
u The SOLUTION statement is useful for determining how treatment effects can be contrasted or estimated within PROC GLM.
IOWA STATE UNIVERSITYDepartment of Animal Science
PROC GLM Example Examining the Error valuesu An analysis of a general linear model should
include a check of the assumptions about the random error terms.
u To do this in PROC GLM, you must use an OUTPUT statement.
u The following statements show how to produce a residual plot for the model above.