comment upon such capabilities more than - sascommunity imrey.pdf · this paper will review the...

14
SAS SOFTWARE FOR LOG-LINEAR MOOELS Peter B. University of Illinois 1. Introduction This paper will review the capabilities of current SAS*software for log-linear model analyses of counted data and extensively dis- cuss, using a variety of examples, the enhance- ments to these capabilities incorporated in the SAS Version 5.0 procedure CATMOD, which replaces FUNCAT from Version 4 releases. The relation- ships of various procedures available from SAS Institute and, to some extent, those from other vendors, will be remarked upon. Cap- abil ities of PROC CATMOD and aspects of its syntax that seem important, but are not highlighted in the forthcoming documentation, will receive attention here as will limita- tions of the new software. Although the collection of software mentioned here provides capabilities far more general than log-linear model analysis, no attempt will be made to comment upon such capabilities more than peripherally or for contextual purposes. No attempt is made to provide a comprehensive treatise on categorical data analysis using SAS software. It is hoped that this paper will inform the reader wishing to apply log- linear model analytic techniques to SAS data sets ,using SAS software, and that portions of it will serve as useful adjuncts to SAS Institute documentation. 2. General Framework for Log-l inear f40del ing It is assumed that available data may be structured as a contingency table of counts and underlying probabilities, as in Table 1. Rows represent different physical or con- ceptual populations from which sampling is reasonably modeled as product-multinomial, implying response independence for different subjects sampled from the same population and independent sampling of subjects from different populations. Columns represent categories into exactly one of which each response must be classified. The population by response tabulation has- s populations and r responses. with n.. representing the observed data lJ count of response j in population i, and 1T •• symbolizing the probability of a random lJ individual from population i exhibiting response category j. The s populations may relate to one another through structure imposed by the nature and levels of defining variables. For instance, they may correspond to a factorial cross-classffication of levels of several experimental factors. possibly all of sub- stantive interest or some of which may represent strata defined by other variables to be controlled for in the analysis. Populations may correspond to ordered or scaled levels of experimental factors, such as doses of a test compound in a bioassay. Similarly. the r response categories may correspond to cross- classification of several dimensions of 1006 response, on which sampled units are jointly observed. These response variables may be nominal, ordinal, or scaled, or the set of variables may contain some of each type. When multiple dimensions of the variables defining the r response categories correspond to observation of the same response under several conditions or at different times, then the term "repeated measurement" is used to describe the array of responses. When a repeated measurement response array exists in each of several populations differentiated by of one Or more experimental or observatlonal factors, the s*r table is called a "split- plot" categorical data These entirely consistent statlstically wlth thelr counterparts in the terminology of classical analysis-of-variance of continuous measurement data. Categorical data analysis software . varies in the degree of flexibility in handllng the various types of structure that may exist in the s populations and/or the r responses of the underlying table. Programs which may be able to incorporate scaling in the popula- tion structure may provide less opportunity for employing scaling of response categories efficiently in an analysis. Programs which easily handle factorially-structured response categories involving several different variables may not be equlpped to address the specific scientific and, hence. analytic issues that arise when the cross- classified response dimensions represent different levels of underlying split-plot Table 2 represents an example of a data array incorporating several types of structure. Si x populations, corresponding to a 2 x 3 cross-classification of Age range by Town, are represented. TO\'Jn is a nominal variable. while Age is ordinal and subject scaling in analysis. The response 1S a Slxteen category (r = 16) indicator structured as a 24 cross-classification corresponding to the presence of any root caries in each of four quadrants of the mouth of each subject study. Thus, this is a split-plot categorlcal data array. The comparison of quadrants within the mouth is of interest, especially as to whether frequency of root caries varies between mandibularand maxillary tooth surfaces. Of primary interest is any community effect, . which might be attributable to the substantlal difference in natural water fluoride level . between towns. While this example is one WhlCh would not usually be addressed by log-linear model analysis, we will return to it in that context subsequently. For purposes of this paper, a mathematical framework for log-linear modeling is now pre- sented. A log-linear model is defined as a structural equation

Upload: vuongthuy

Post on 12-Mar-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

SAS SOFTWARE FOR LOG-LINEAR MOOELS

Peter B. Imrey~ University of Illinois

1. Introduction This paper will review the capabilities of

current SAS*software for log-linear model analyses of counted data and extensively dis­cuss, using a variety of examples, the enhance­ments to these capabilities incorporated in the SAS Version 5.0 procedure CATMOD, which replaces FUNCAT from Version 4 releases. The relation­ships of various procedures available from SAS Institute and, to some extent, those from other vendors, will be remarked upon. Cap-abil ities of PROC CATMOD and aspects of its syntax that seem important, but are not highlighted in the forthcoming documentation, will receive attention here as will limita­tions of the new software. Although the collection of software mentioned here provides capabilities far more general than log-linear model analysis, no attempt will be made to comment upon such capabilities more than peripherally or for contextual purposes. No attempt is made to provide a comprehensive treatise on categorical data analysis using SAS software. It is hoped that this paper will inform the reader wishing to apply log­linear model analytic techniques to SAS data sets ,using SAS software, and that portions of it will serve as useful adjuncts to SAS Institute documentation.

2. General Framework for Log-l inear f40del ing It is assumed that available data may be

structured as a contingency table of counts and underlying probabilities, as in Table 1. Rows represent different physical or con-ceptual populations from which sampling is reasonably modeled as product-multinomial, implying response independence for different subjects sampled from the same population and independent sampling of subjects from different populations. Columns represent categories into exactly one of which each response must be classified. The population by response tabulation has- s populations and r responses. with n.. representing the observed data

lJ count of response j in population i, and 1T •• symbolizing the probability of a random lJ

individual from population i exhibiting response category j. The s populations may relate to one another through structure imposed by the nature and levels of defining variables. For instance, they may correspond to a factorial cross-classffication of levels of several experimental factors. possibly all of sub­stantive interest or some of which may represent strata defined by other variables to be controlled for in the analysis. Populations may correspond to ordered or scaled levels of experimental factors, such as doses of a test compound in a bioassay. Similarly. the r response categories may correspond to cross­classification of several dimensions of

1006

response, on which sampled units are jointly observed. These response variables may be nominal, ordinal, or scaled, or the set of variables may contain some of each type. When multiple dimensions of the variables defining the r response categories correspond to observation of the same response under several conditions or at different times, then the term "repeated measurement" is used to describe the array of responses. When a repeated measurement response array exists in each of several populations differentiated by l~vels of one Or more experimental or observatlonal factors, the s*r table is called a "split­plot" categorical data s~t. These ~erms a~e entirely consistent statlstically wlth thelr counterparts in the terminology of classical analysis-of-variance of continuous measurement data. Categorical data analysis software . varies in the degree of flexibility in handllng the various types of structure that may exist in the s populations and/or the r responses of the underlying table. Programs which may be able to incorporate scaling in the popula­tion structure may provide less opportunity for employing scaling of response categories efficiently in an analysis. Programs which easily handle factorially-structured response categories involving several conce~tually different variables may not be equlpped to address the specific scientific and, hence. analytic issues that arise when the cross­classified response dimensions represent different levels of underlying split-plot factors~

Table 2 represents an example of a data array incorporating several types of structure. Si x populations, corresponding to a 2 x 3 cross-classification of Age range by Town, are represented. TO\'Jn is a nominal variable. while Age is ordinal and subject to.reaso~able scaling in analysis. The response 1S a Slxteen category (r = 16) indicator structured as a 24 cross-classification corresponding to the presence of any root caries in each of four quadrants of the mouth of each subject und~r study. Thus, this is a split-plot categorlcal data array. The comparison of quadrants within the mouth is of interest, especially as to whether frequency of root caries varies between mandibularand maxillary tooth surfaces. Of primary interest is any community effect, . which might be attributable to the substantlal difference in natural water fluoride level . between towns. While this example is one WhlCh would not usually be addressed by log-linear model analysis, we will return to it in that context subsequently.

For purposes of this paper, a mathematical framework for log-linear modeling is now pre­sented. A log-linear model is defined as a structural equation

Page 2: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

.\

where n is a vector of probabilities (such as all those in Table 1); ~. is a vector of unknown model parameters; X is a known umodel matrix u expressing the model subspace through a selected coordinate system (parametrization); ~ is the elem~nt-wise matrix exponentia­tlon operator el J;

and 0-1 is a diagonal normalizing matrix "1l incorporating restrictions on sums of probabilities. specified by D = B{~(~)}, with E containing zeroes and ones. Three special cases of this model have been

found especially useful: Classical set-up: 1f is a "strung-out" vector

of probabilities internal to a multi-way contingency table with s populations and r response categories, such as Table 1. E is a block diagonal matrix 1~ 9 Is' This formulation incorporates all of what are conventionally termed log-linear models in the current literature. including the factorial models of Bishop, Fienberg and Holland (1975) and the ordinal models discussed by Goodman (1983) and Agresti (1984).

Proportional (cumulative) odds: E is a vector of 2(r - 1) Il stairstep n cumulative probabilities and their complements, from each of s populations. E = 1;' ~ 1s(r-l)"

Split-plot analysis: )! is a vector of k marginal distributions of a repeated response, repetitions of which constitute dimensions of a multiway contingency table, within each of s populations. R=1'0I. ,..., ..... r ...... sk

3. Current SAS Software for Log-linear Modeling Prior to the release of Version 5.0, SAS

software for log-linear modeling has consisted of the SAS/BASE product PROC FUN CAT [Sall (1982)]. the supplemental library author-supported PROC LOGIST [Harrell (1983)], and the PROC MATRIX MACRO CATMAX (Stokes and Koch) described in the 1982 volume of the Proceedings. This listing excludes user-generated programs which may have been locally produced using the IPF facility of PROC MATRIX. Each of the three major facilities for log-linear modeling has had substantial limitations from the perspective of the devotee of such analyses. PROC FUNCAT, a general functional modeling program for grouped data, incorporates as one branch of its capabi 1 iti es a 1 imited capacity for log­linear model fitting. In particular, PROC FUNCAT provides weighted (generalized) least-squares (WLS/GLS) or maximum likelihood (ML) analyses of IImultiple logitn models for which the same across-population model matrix applies, separately, to each £n(n/nr ) for all j.

1007

Thus, the design is in essence nested in the levels specified by j of the log (conditional odds of j vs. r) response function specified by ~n(n./n). Each ~n(n./n) is fitted with

J r J r its own set of parameters. No modeling of the conditional odds as they depend on the value of j is specifiable. The permissible models form a subset of classical hierarchical log-linear models, the nature of which has been clarified in some detail by Bishop (1969). Use of FUNCAT in this capacity has been somewhat hampered for users by a notation in the mathematical documentation which is at variance with the literature in employing conventional symbols to represent different matrices than the research articles underlying the analyses, and by errors in the descriptive documentation which confuse the design matrix for all responses with that for a single conditional odds, thus suggesting that the program has wider capabilities than it actually possesses.

Beyond the analyses just described, however, FliNCAT is capable of providing WLS fits and test statistics for other log-linear models for which the same model matrix applies to a subset of the collection of ~n(nj/nr)' and to proportional odds analogues of all such analyses. Such latter models would involve, as responses. some subset of the functions

~n( ".ni / ".nij ) for a selected set of t;:>:J .R.<J

cutpoints, modeled in parallel in terms of the population structure. Whereas for the classical log-linear model subset described earlier, FUNCAT makes available ML fitted parameters and cell probabilities along with a likelihood ratio test of fit [obtained via a Newton-Raphson (iterative WLS) computing algorithm with product-multinomial likelihood assumption], for the models described in this paragraph only WLS solutions are available.

PROC LOGIST provi des, for grouped or ungrouped data: i) multiple logistic regres­sion analyses for binary data; and i i) propor­tional odds analyses for ordered polytomies. In contrast to the models specifiable using PROC FUNCAT, the log cumulative odds from models for polytomies generated by PROC LOGIST share the same parameter sets. The model matrix consists of blocks, specifying the model for the log cumulative odds relative to a single cut-point, stacked vertically to model several response functions, rather than diagonally as in FUNCAT. For each type of model, PROC LOGIST generates ML estimates of parameters and an omnibus likelihood ratio test of significance for the model as a whole. Stepwise model selection is available. Significance of model parameters is evaluated using Wald tests comparing each to its standard error. estimated on the assumption of model val idity. For automated selection of variables for entry into a model. efficient score statistics are used, evaluated at the current model fitted values. As FUNCAT, LOGIST

Page 3: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

uses a Newton-Raphson algorithm to obtain its ML fits.

For dependent dichotomies to be related to multiple explanatory variables, LOGIST and FUNCAT are both powerful tools, with LOGIST possessing greater flexibility in terms of its stepwise fitting capabilities and its orienta­tion to ungrouped data. For dependent polytomies, LOGIST fits proportional odds models only, having no capability for classical log-linear modeling. Thus, LOGIST in no way attempts to be a general tool for the fitting of log-linear models to complex categorical data arrays.

However, such a general tool does exist within present SAS software, that tool being the PROC MATRIX MACRO CATMAX. CATt1AX f1ts any classical log-linear model to any suitable grouped categorical array. Its scope does not include proportional odds or repeated measure­ment analyses, but it does produce both WLS and ML solutions for classical models (using Newton-Raphson, as the other programs. to ob­tain the ML fit). CATMAX provides the omnibus likelihood-ratio test of model significance. and Wald tests of individual parameters and arbitrary sets of linear functions of them. A major and, for many uses and users, critical debility of CATMAX is its lack of a user­friendly front end. The data analyst must construct and input the appropriate model matrix, and any matrices specifying linear functions to be tested, using PROC MATRIX manipulations outside of CATMAX, and then arrange to move these matrices into the CATMAX code stream. Alternately, matrices can be generated and entered by hand. This inconvenience has made CATMAX, despite its generality, a tool for experts rather than a general purpose facility.

Thus, while it is clear that the three SAS programs available all are useful for the fitting of log-linear models, the most con­venient and accessible program for fitting of hierarchical log-linear models has, for many if not most SAS users, been PROC BMDP. While BMDP4F was not listed with SAS software above, its range of available classical models, extens­ive model selection diagnostics and automated sequencing aids, and ease of model specifica­tion has made it superior, as a general purpose log-linear fitting tool, to the available soft­ware internal to SAS[Brown(1981)]. More generally, when SAS software is compared to its primary competitors in this area, BMOP4F and SPSSX LOGLINEAR the following summary conclusions are apparent: i) FUN CAT is much more general for categorical data modeling on the whole, but rather less general and less automated for log-linear modeling specifically, than either major competitor; ii) LOGIST is a logistic regression and proportional cumula­tive odds program that does not attempt to be, and is not fairly evaluabl e as, a general log­linea~ modeling program; and iii) CATMAX is a more general program for grouped data than alternatives. but does not support the user with a front-end for simple model specification.

1008

Version 5.0 of SAS/BASE incorporates PROC CATMOD, a replacement/enhancement of PRDC FUNCAT to a general categorical data utility for grouped or ungrouped data, with full log-linear modeling capacity. Along with the power of FUNCAT, CATMDD incorporates the capabilities of CATMAX and many, but not all, of the functions of LOGIST. CATMOD lets the user fit hierarchical and non-hierarchical log­linear models, multiple logistic models, quasi­independence models, quasi-symmetry models. a variety of models for ordinal data, proportional odds models, and log-linear models for marginal probabilities generated by split-plot categor­ical data arrays. It incorporates a powerful, parsimonious syntax for easy specification of all hierarchical and many non-hierarchical models. CATMOD provides the flexibility of allowing the user to select from different parametrizations which may be appropriate for a log-linear model with one or more equivalent specifications as a multiple logistic model. Within CATMOD, WLS or ML procedures or both may be generated for the classical log-linear model set-up. as selected by the user. As the previous SAS programs, CATMDD uses a Newton-Raphson approach to obtain ML solutions. For proportional odds and split-plot or repeated measurement models. only WLS solutions are available. The log-linear modeling cap­abilities of CATMDD are fairly cleanly meshed with the general WLS functional modeling approach of Grizzle, Starmer and Koch (1969) and colleagues, forming within one program a tool for categorical analYSis with superior flexibility and user-friendliness. Neverthe­less, CATMOO and its documentation do have limitations. and the user attempting to learn this program may encounter these. In the hope of smoothing the way for some, a series of illustrative examples is provided below.

4. Examples 4.1. A 3 x 2 table from a clinical trial of

respiratory therapies Initially, a statistically trivial example

is used to illustrate aspects of the relation of CATMOD to FUNCAT, and the difference in how CATMOO views the same model in its log-linear vs. its logistic formulation. Table 3 indicates presence or absence of atelectasis, a common compl ication of surgery, in patients 24 hours post-abdominal surgery who each received one of three respiratory therapies designed to promote rapid recovery of lung function: cough and deep breathing exercises (COB), continuous positive airway pressure (CPAP), or incentive spirometry (IS). Are atel ectasis at 24 hours and mode of. therapy significantly associated? EXClusive of data entry and title statements, which will generally be omitted from presentations of code that follow, the CATMOD syntax for a full logit analYSis is: PROC CATMOD;

MODEL ATELECT=TRTMNT/ONEI~AY FREQ PROB XPX COV COVB CDRRB ML PRED=FREQ;

CONTRAST 'COB VS. CPAP' TRTMNT 1 -1; CONTRAST 'COB VS. IS' TRTMNT 2 1; CONTRAST 'CPAP VS. IS' TRTMNT 1 2;

Page 4: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

With the exception of the procedure name, these commands are identical to those of PROC FUNCAT. CATMOD interprets the data as a 3 population, 2 response contingency table and generates, for each population by default, the logit of absence of atelectasis. along with the full rank model matrix corresponding to the deviation parameterization of each therapyts variation from the mean of three logits. The response logits and the model matrix are shown in Table 3. The CONTRAST statements specify pairwise comparisons of treatments by Wald chi-square statistics. in terms of the deviation parameters in the model matrix which represent. respectively, CDB and CPAP vs. all treatments. As the model is saturated, CATMOD produces identical WLS and ML estimates which perfectly fit the data. A Wald statistic QW = 1.41, D.F. = 2, P = .49, is reported for the TRTMNT effect.

CATMOD may alternatively fit this model in its equivalent log-linear form. Substitute code for this approach follows.

MODEL TRTMNT*ATELECT= RESPONSE /ONEWAY FREQ PROB XPX COV CQVB CORRB-ML PRED=FREQ;

REPEATED/ RESPONSE =ATELECT iTRTMNT; CONTRAST 'IS MAIN eFFECT U-TERM'

RESPONSE 0 -1 -1; CONTRAST 'IS BY ATELECT U-TERM'

RESPONSE 0 0 0 -1 -1; CONTRAST 'COB VS. CPAP EFFECT'

RESPONSE 000 1 -1; CONTRAST 'COB VS. IS EFFECT'

RESPONSE 00021; CONTRAST 'CPAP VS. IS EFFECT'

RESPONSE 0 0 0 1 2; This syntax makes evident that CATMOO is operated in log-linear model mode through the combined use of the MODEL statement and a new REPEATED statement, coupled with a new keyword, _RESPONSE_. In this example, TRTMNT*ATELECT on the left-hand side of the model equation specifies that r = 6 response categories are to be constructed from all combinations of TRTMNT and ATELECT exhibited by the data. Since there are no other variables. CATMOD reads the data as a one-population (s = 1), 6 response table. By default in the absence of a RESPONSE statement, five log ratios of counts are generated from this table, comparing each of the first five counts to the sixth. The use of the keyword RESPONSE on the right-hand side of the model equation indicates that some of the variables used to define the r = 6 response categories on the left-hand side will also be used to produce columns of the model matrix fit to these five response log ratios. Which variables are to be used to construct what columns is specified by the _RESPONSE_= phrase following the slash in the subsequent REPEATED statement. Here, _RESPONSE_=ATELECTiTRTMNT specifies a model matrix with the ATELECT*TRTMNT interaction, as well as all lower order effects contained within, viz. ATELECT and TRTMNT main effects. The impact is to produce the saturated log­linear model for these data, equivalent to the logit model specified earlier.

1009

However, this formulation models five log ratios with five independent parameters, rather than three log its with three independent parameters in the logit formulation. The responses modeled, and the (full rank) model matrix constructed by CATf40D, are shown in Table 4. The matrix columns specify, respec­tively, an ATELECT main effect u-term, two TRTMNT main effect u-terms (cols. 2 and 3), and two interaction u-terms (cols. 4 and 5). Remaining u-terms are dependent upon these, and are specified by the first two CONTRAST statements. The effect of TRTr~NT on ATELECT is represented by the interaction deviation contrast u-terms, and pairwise comparisons among treatments are generated from these by the remaining three CONTRAST statements. The identical Wald Chi-square QW = 1.41 labeled as TRTMNT main effect in the logit formulation is here reported as the TRTMNT*ATELECT interaction test in the printed "ANOVA II table. The pairwise contrast statistics from the log-linear formula­tion are also identical to those from the logit analYSis, viz. 0.38, 1.41 and 0.37 for COB vs. CPAP, CDB vs. IS and CPAP vs. IS, each with D.F. = 1 and clearly non-significant.

Specification of different hierarchical analysis-of-variance models is simple using CATMOD syntax. As mentioned above~ vertical bars between effects spec; fy the correspond; n9 interaction and all lower order effects it contains. To specify the interaction term without lower order effects, * repl aces the!. Thus, TRTMNTiATELECT is equivalent to TRTMNT ATELECT TRTMNT*ATELECT. TRTMNT*ATELECT alone would define a model with a general mean column and two interaction columns with main effects constrained to zero. The syntax

MODEL TRTMNT*ATELECT= RESPONSE /ONEWAY FREQ PROB XPX COV CQVB CORRB-ML PRED=FREQ;

REPEATED/ RESPONSE =ATELECT TRTMNT: specifies a main effects (independence) model of treatment and clinical response. The lack­of-fit test for this model is the test of treat­ment effect. The WLS lack-of-fit test will be identical to the Wald statistic obtained from the saturated model but, since ML is requested, the likelihood ratio lack-of-fit statistic QL = 1.44, D.F. = 2, P = .49 is also generated. Predicted values for the five modeled logits, and for the six cell counts, are listed or may optionally be output in a SAS data set for model assessment. These fi tted log rat i as a"re shown in Table 4, and the fitted counts appear in Tabl e 3.

Before mov; ng to more compl ex ill ustrations of CATMOD. some comments on syntax are appro­priate. The code presented above repeats an array of options which generate useful documenta­tion and check output but may, of course, be deleted when unnecessary. The REPEATED statement and RESPONSE keyword are new SAS terms for which some explanation is helpful. The REPEATED statement appears not only in Version 5.0 CATMOD, but also in ANOVA and GLM. Its primary purpose is to designate and label situations where response dimensions themselves represent levels of factors to be

Page 5: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

included in an analysis~ such as in repeated measures or split-plot situations. In these cases, model matrix columns are generated corresponding to comparisons amongst dependent responses arising from within the same popula­tions. Since log-linear modeling involves the construction of analogous model matrix columns for multiple dependent log ratios, CATMOD uses the REPEATED statement to specify the form of a 10g-l in~ar model even when~ as in most cases, the data and modeZ do not involve repeated measurement or split-plot data. A technical statistical analogy has here been converted into an unfortunate terminologic red herring. The REPEATED statement is most easily under­stood as a code which fulfills two quite separate functions: i) tbe formulation of a repeated measurement or split-plot model; and ii) specification of a log-lin€ar model structure, whether or not there is any repeated measurement aspect to the situation.

The RESPONSE keyword is much easier to understand intuitTvely. It replaces, on the right-hand side of the MODEL statement, any variable or combination of variables used both in designating response categories and in form­ing the model matrix. Whenever such variables exist, the manner in which they contribute to the model is specified on the right-hand side of a RESPONSE = equation within a REPEATED statement, while RESPONSE in the MODEL state­ment represents all parameters they determine.

RESPONSE may appear only on the right-hand side of the MODEL statement, may appear with other variables, may be included in interactions with them, or nested within their levels. Other variables may not be nested in levels of

RESPONSE . - -4.2. A 24 table classifying motor vehicle

accldents with serious driver injury Table 5 is a cross-classification of North

Carolina motor vehicle accidents involving serious driver injury in 1973 or 1974 by Year and dichotomies of Speed, Time (of day), and Place (urban or rural). Exploration of this table is of interest because of changes in the driving context that took place in the 1973-4 period, including gasoline shortages and enact­ment of a national 55 mph speed limit, that might have selectively affected frequencies of certain types of accidents. A basic CATMOD request for a maximum-likelihood log-linear analysis of these data, on initial exploration for a suitable model, is:

MODEL SPEED*TIME*PLACE*YEAR= RESPONSE !ONEWAY NOPARM NOGLS ML; - -

This statement requests only an ML analysiS of the 24 table, with no printout of tests of individual fitted parameter values, but with marginal one-dimensional distributions reported as a data check. The resulting output includes these distributions, a report on the iterations of parameter estimates to convergence (so that fitted parameters themselves are reported as the last stage iteration), and an "ANOVA" chi-square table for the effects incorporated in the model. These are specified on a subsequent REPEATED statement, as desired. Thus,

1010

REPEATED! RESPONSE =SPEED [TINE [PLACE [YEAR; yields the saturated model, results of which suggest various approaches to model reduction. Several tried were:

REPEATED! RESPONSE =SPEED[TIME[PLACE SPEED [TT~lE [YEAR;-

REPEATED! RESPONSE =SPEED[TIMElpLACE SPEED [YEAR TIMETYEAR PLACE I YEAR;

REPEATED! RESPONSE =SPEED[TIMEIPLACE SPEED[YEAR TIMETYEAR;

Note that, in accord with other programs for fitting hierarchical ANOVA log-linear models, each mOdel is designated by listing only the sufficient statistics which specify the model. using an appropriate notation. Thus, in CATMOD, SPEED [TIME[PLACE represents the SPEED by TIME by PLACE three-way observed marginal distri­bution, which might be represented as STP. S*T*P, SPEED BY TIME BY PLACE or otherwise in another program. CATMOD allows the flexibil ity, however, of fitting non-hierarchical models by replacement of vertical bars by asterisks and incorporation, thusly, of only selected sets of main effects and interactions.

For these accident data, the last model fit was regarded as acceptable (Q~ = 4.69, D.F. = 5, P = .45 for lack-of-fit), with all parameters significant at p < .015. Fitted counts from this model are shown in Table 5. and the analysis of individual parameters produced by CATMOD displayed in Table 6. The SPEED*YEAR parameter reflects the overall 24% reduction in high-speed accidents with serious driver injury from 1973 to 1974. as compared to only a 10% reduction in lower speed accidents with serious driver injury across that period. The TIME*YEAR effect reflects the overall 21% reduction in daytime accidents of this type as compared to only a 3% reduction in nighttime,.accidents. These findings are compatible with hypotheses involv­ing reduced high-speed driving overall, and increased car-pool.ing and use of public trans­port leading to reduced daytime exposure in routine commuting. Other explanations are possible, of course, and nothing conclusive can be said in the absence of denominator data for this selected group of accidents involving serious driver injury. 4.3. A 23 repeated measures drug comparison

Table 7 gives data from a drug comparison trial in which 46 subjects received each of Drugs A. Band C under similar circumstances, with their joint responses noted. These data ha ve been analysed by many aut hors inc 1 ud i ng Koch, Imreyet al. (1976) and Koch, Landis et a1. (1977). ----rnitially. a no three-way interaction model is easily fit by CATMOD using the statements:

r~ODEL DRUGA*DRUGB*DRUGC= RESPONSE ; REPEATED! RESPONSE =DRUGA[DRUGB -

DRUGA[DRUGC DRUGB[DRUGC; Since ML is not specified, the fit defaults to WLS, with the goodness-of-fit statistics QW = 0.08, D.F. = 1, P = .78 reported by other authors. The IIANOVA II table for this model shows identical Wald chi-squares for DRUGA*DRUGC and DRUGB*DRUGC of 0.45, D.F. = 1, P = .50, but a DRUGA*DRUGB chi-square of 7.94,

Page 6: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

p = .005. These results support a model in which response to Drug C is independent of responses to Drug A and Drug B. and in which responses to these latter are associated. Examination of Table 7 reveals that the data are completely symmetric with respect to Drugs A and B. Thus, an appropriate model might incorporate terms. due to a Drug C main effect, with terms dependent on the number of responses (0, 1 or 2) to Drugs A and B, regardl ess of whi ch of these 1 atter drugs generates a response. Such a non-hierarchical log-linear model, with effects not specifiable as obvious main effects. interactions or nested effects of the dependent variable factors, may be designated by directly enter­ing the appropriate model matrix. The following MODEL statement will do the job:

MODEL DRUGA*DRUGB*DRUGC=(2 2 0, 0 2 0, 2 2 -2, 0 2 -2, 2 2 -2, 0 2 -2, 2 0 0) (1='C',2='A OR B',3='A AND B')/ONEWAY FREQ PROB XPX COY COVB CORRB ML PRED=FREQ;

This statement enters directly a full-rank model matrix corresponding to the seven log ratios of each observed count to the last (U U U), where the first parameter is an increment due to _positive response to C, the second is an increment due to positive response to at 1 east one of A and B, and the thi rd is an additional increment due to concordant responses to both A and B. The parenthesized information after the matrix literal instructs CATMOD to compute Wald tests of significance for each parameter, and label these tests respectively C, A OR B, A AND B. Note that the model matrix entered is derived from a corresponding matrix for the full set of eight log probabilities by subtraction of the last row from each of the first seven; the initial matrix to which this operation is applied is

[1 -1 1 -1 1 -1 1 -11

~ = 1 1 1 1 1 1 -1 -1 ,and the 1 ike 1 i hood 1 1 -1 -1 -1 -1 1 1

,ratio lack-ot-fit test tor is model is QL = 1.75, D.F. = 4, p = .78. The ML-fitted counts are shown in Table 7.

Although the overall response rates in this study are 61% for both Drugs A and B, and only 35% for Drug C, this model suggests (see fitted counts) that once either Drug A or Drug B has fail ed, the probabil ity of success with Drug C is at least as great as that for the remaining drug. 4.4. Outlying cells and guasi-independence in

an 18 x 6 table Table 8 is a slightly abridged version of

data presented by Casjens (1974), and analysed for outliers by Mosteller and Parunak (1985, in press). The latter authors explore various methods ot searching for extreme departures from independence in such a table, in the hope that such departures will be informative. Here, CATMOD is used to implement a rather conventional outlier search using residuals standardized by division by the root of the predicted values (the Poisson-based standard deviation estimate). For an initial run, one'may use:

1011

PROC CATMOD; POPULATION ARTIFAC: MODEL DIST= /NODESIGN NOPARM NOGLS NDPROFILE

ML PRED=FREQ; RESPONSE OUT=ARCHOUT;

DATA OUTCHEC; SET ARCHOUT (KtEP= TYPE NUMBER RESID PRED); IF TYPE ='FREQ'; STDRESD=(-RESID TSQRT ( PRED )T; KEEP NUMBER STDRESD;-

PROC SORT; BY DESCENDING STORESD; PROC PRINT;

The POPULATION statement forces separate treatment of each type of artifact as a popula­tion, whether or not ARTIFAC appears as a variable in the MODEL statement. An independeoce model is iteratively fit to the table, and the likelihood ratio chi-square goodness-ot-fit test obtained (QL = 180.49, D.F. = 85, p <. .000l). An output data set is produced containing the predicted values and residuals from each cell. Standardized residuals are created as described above. The data are sorted by descending standardized residual and printed. The analysis identifies two standardized residuals with absolute values 5.09 and 4.06, as compared to all others substantially below three. The large standardized residuals correspond to excesses of grinding stones in the immediate vicinity of water, and Humboldt projectile pOints 1-3 miles from water. Each excess has a plausible substantive explanation. To search for additional outliers, we fit a quasi­independence model to the remainder of the table by excluding the cells generating the identified outliers. Equivalently, these cells are being treated as structural zeroes for the continuing analysis.

The two outlying cell counts are deleted from entry in the DATA step. For the earlier MODEL statement, we substitute:

r~ODEL ARTI FAC*DIST= RESPONSE /NODESIGN NOPARM NOGLS NOPROFILE ML PRED=FREQ;

REPEATED/ RESPONSE =ARTIFAC DIST; Care is necessary here-because of the manner in which CATMOD handles zeroes and missing cells or, equivalently, random zero counts vs. structural zeroes. We wish to treat two cells as structural zeroes while retaining several random zeroes in the data table. In a multi­population context such as was defined in the initial outlier screen, CATMOD treats all missing cells or input zero counts as random unless a response category is missing or unfilled in all populations simultaneously, in which case the category is disregarded entirely. In a single population set-up, a zero or missing cell in that single population is, by definition, zero or missing in all populations, and its category is thus analogously discarded by CATMOD. Thus, in single population problems all zero or missing cells are treated as structural zeroes and are not modeled, unless special measures are taken to identify those which are truly random, and should be modeled, to the software. This is awkward; however, one cannot effectively define structural zeroes in the multi population set-up since missing cells are automatically treated as random zeroes. Thus, the quasi-independence model is fit in full log-linear rather than logistic form, so that a distinction between

Page 7: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

structural and random zeroes may be drawn. This is accomplished by transforming all random zeroes to small positive numbers in the DATA step. CATr-10D documentation recommends replace­ment of all random zeroes by 1 E-20 as a routine procedure whenever any log-linear model is to be fit. For this quasi-independence model such treatment is essential.

Once the data are so presented to CATMOD, the quasi-independence model yields a likelihood ratio lack-of-fit statistic of QL = 145.16, D.F. = 83~ P < .0001, so that quasi-independence is a poor fit. as was independence to the original table. Nevertheless, the standardized residuals from quasi-independence remain between + 2.7. and are fairly symmetrically and unimodally distributed. suggesting that further departure from independence ;s not due to one or a few additional outlying cells.

It must be noted that the above analyses, that is, the independence and quasi-independence fits, required respectively six and five itera­tions of the Newton-Raphson computing algorithm. which is very inefficient and time-consuming for problems of this size relative to the simple computation of expected values under independence (row total x column total/grand total), or to the use of iterative proportional fitting (IPF) for the quasi -independence model. For suffi­ciently large problems. limitations of computer resources may suggest the use of IPF through PROC BMOP and BMDP4F, or through the IPF PROC MATRIX command. 4.5. Stratified clinical trial data with an

ordlnal response

Table 9 displays results of a clinical trial of an experimental agent vs. placebo in treatment of pain from a chronic joint disease at one of two possible anatomic sites. The trial was conducted at two clinical centers. and response classified into one of three ordinal categories. Two models of interest for smoothing and analysing data of this type are an equal adjacent odds-ratio model [Andrich (1979), a classical log-linear model], and the proportional odds model obtained by applying a similar uniform association structure as the former to cumulative logits rather than to log ratios of probabil ities of individual categories.

The equal adjacent odds-ratio model specifies that the conditional odds-ratio of Good to Medium response for A'ttive relative to Placebo drug is equal to that for Medium to Poor response. for each anatomic site and clinical center combina­tion. This single odds-ratio then measures the Drug effect. The model which incorporates this assumption with main effects for anatomic site and clinical center may be fit using CATMOD by:

POPULATION SITE CENTER DRUG; MODEL PAIN = (1 0 0 0 0, 0 1 0 0 0,

1 0 2 0 0, 0 1 1 0 0, 1 0 0 2 0, 0 1 0 1 0, 1 0 2 2 0, 0 1 1 1 0, 1 0 0 0 2, 0 1 0 0 1, 10202,01101,10022,01011, 1 0222, 01 1 1 1 )(3='DRUG EFFECT', '4=' CENTER EFFECT', 5=' SITE EFFECT' )/ONEWAY FREQ PROB XPX COVB CORRBML PRED=FREQ NOQS;

The POPULATION statement tells CATMOD that data are to be arranged in populations corresponding

1012

to all distinct combinations of the variables Site, Center and Drug. Because the MODEL state­ment directly inputs the model matrix, instead of directing its construction using names of these factors, the POPULATION statement is required to insure that the data a-re not treated as a single population. Treated as eight popu­lations, sixteen log ratios of probabilities are modeled by default, and the model matrix is 16 x 5. The same model might have been fit with­out the POPULATION statement, by entering SITE x CENTER x DRUG x PAIN on the left-hand side of the MODEL statement equation. However, CATMOD would then have expected a larger (and more complex) 23 x 12 model matrix, corresponding to the 23 log ratios of probabilities it would create,by default. under the single P9Pulation. 24 response category assumption. For these data, with two random zeroes, failure to specify populations would have led CATMOD to treat the problem as a 22 response category single population, and the desired model would not have been possible to fit without first replacing the sampling zeroes by negligible numbers. as discussed in the previous example.

For the equal adjacent odds-ratio model, the likelihood ratio lack-ot-fit chi-square is 14.98 with D.F. =11, p = .18, indicating an adequate fit. The Site, Center and Drug parameters, standard errors, and Wald chi­square statistics are reported in Table 10. The Drug para~eter is essentially at the 5% level of significance.

An analogous proportional odds model may be fit by adding to the code:

RESPONSE 1 -1 0 % 0 1 -1 LOG 1 0 % 1 1/1 1 0/0 0 1;

which generates log ratios of cumulative odds. Uniform association across cumulative logits is imposed by changing all 2's to l's in the previous model matrix. For this model, maximum likelihood analysis is not available from CATMOD (though PROC LOG 1ST will provide it); ML must be dropped as an option in the MODEL statement. Further. the WLS fitting procedure uses the observed response functions which, for this data set, are undef; ned due to the pl acement of the two sampling zeroes. To remedy this, these zeroes are reploced by 0.5 in accordance with conventional practice using WLS categorical data analysis [Grizzle, Starmer and Koch (1969)J, and the analysis is carried through for ill ustration. The model fits adequately (QW for lack-of-fit = 9.60, D.F. = 11, P = .57), and its Site, Center and Drug parameters and related statistics are included in Table 10. In terms of pa rameter values, these resul ts agree closely with those of the ML fit obtained from PROC LOGIST [Koch, Imrey, Singer, Atkinson and Stokes (1985)J. With regard to formal inferenCe, all methods confirm the existence ot Site and Center effects (p < .05), while only the ML fit of the proportional 3dds model yields a p-value for Drug below 5% (Xi = 3.95, p = .047). However, of the three sets of results, the ~iLS fit has the least favorable asymptotics, and is likely too conservative.

Page 8: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

~'

4.6. Induced tumor regression in rat carcinogenesis

Frequently it is desirable to fit log-linear models to counts arising from sampling schemes more complex than the product-multinomial or Poisson models assumed by the WLS or ML algorithms of CATMOD and related software. Sometimes this may be accomplished through CATMOD if the conventional sampling model applies to some set of sampling u~its from which the observed counts are derlved~ and if the mode of derivation of these counts can be explicitly formulated within the class of response transformations supported by CATMGD. For instance, Table 11 shows artificial data on total palpated tumors and total tumors regressing in a common rodent model of breast carcinogenesis. A regressed tumor is one which was palpable repeatedly but was subsequently not found on palpation or necropsy. The data reported are for two strains of rats fed an anti-tumor agent~ and their controls. It is of interest to compare the observed proportions of tumors which regress inthese groups, viz. 48% and 52% for treated rats of Strain A and B respectively, vs. 22% and 27% for their corresponding control groups. Log-linear model analysis might be used to do this, but would be inappropriate if applied to the 2 x 2.x 2 table of Strain x Treatment x Regresslon, because of the clustered nature of tumor sam­plirg, which violates the product-mu~tinomial assumption since outcomes for multlple tumors in the same animal are undoubtedly associated biologically and statistically.

However, a product-multinomial model does apply to Table 11, which uses rats rather than tumors as the unit of analysis. Thus, 10g­linear models may be fit by deriving the regression proportions as response functions from Tabl ell. Appropri ate code for the saturated model is:

MODEL mHOT*TUMREG=TRTMNT ISTRAIN/ONEWAY FREQ PROB XPX COV COVB CORRB;

RESPONSE 1 -1 LOG 0 1 0 1 2 0 1 2 3/1 0 2 1 0 3 2 1 0;

Other mOdels may be obtained by changing TRTMNTISEX e.g. to TRTMNT SEX for the main effects model. The RESPONSE statement computes, for each population formed by combinations of TRTMNT and SEX, the log of the ratio of regr~sed to non-regressed tumors~ which is equal to the 10git of the proportion of tumors regressing in that population. This is the same response function that would be formed if the data were input on a per tumor basis and the default response was used. However, analysis in that situation would use the wrong likelihood or, for WLS, moments based on the wrong probability model. Formulation of these responses through the RESPONSE statement based on data input on a per subject basis allows the software to creat~ the correct estimated moments for aWLS analysls based on an appropriate probability model. The Wald statistics for the saturated and main effects model s are shown in Tabl e 12; the WLS analysis shows no interaction or Strain effect, and a significant (p=.0003) effect of the experimental drug in enhancing tumor regression.

1013

No ML analysis is available from CATMOD for this model, as CATMOD is capable of ML fits only when the product-multinomial likelihood applles to counts in the initially entered populatlon by response table. Also, CATMOD is not capable of accepting and applying WLS modeling to an observed vector and covariance matrix generated externally to CATMOD, such as from a TYPE=CORR SAS data set.

4.7. A split-plot marginal log-linear model for root caries prevalence data

As a final example, we return to the data of Table 2 relating presence of any root caries in each of four quadrants of the mouth to Town of residence and Ag~, where subjects were drawn from two towns of widely different water fluor­ide levels. Constructing variable names using MAX and MAND to represent maxillary (upper) and mandibular (lower) levels of teeth, and Land R prefixes to designate left and right sides of the mouth, the following CATMOD code fits a saturated log-linear model to the marginal prevalence proportions:

RES PONSE LOG IT; MODEL LMAND*RMAND*LMAX*Rt4AX=TOWN I AGECAT

I RESPONSE; REPEATED LEVEL 2 SIDE 2/ RESPONSE_=

LEVELISIDE; -The command RESPONSE LOGIT; defines the responses for analysis as the marginal logits of each variable listed on the left-hand slde of the MODEL statement equation. That equation defines the model matrix as consisting of all main effects and interactions incl uded in the interaction TOWN*AGECAT crossed wi th effects which are functions of the dimensions of the four-way dependent variable response table. The syntax REPEATED LEVEL 2 SIDE 2/ indicates that the dimensions of the LMAND*RMAND*LMAX*RMAX table correspond to the cross-classification of two repeated measures (split-plot) factors, LEVEL and SIDE, each with two values, and that the values of SIDE, the second listed variable, changes most rapidly in the 1 isting of ~es~onse dimensions in the MODEL statement. ThlS lS sufficient to specify the prefixes Land R as deSignating the two categories of SIDE, and MAND and MAX the two categories of LEVEL.

RESPONSE =LEVELISIDE; puts LEVEL, SIDE and their interaction into the model, and these are crossed with the whole-plot factors TOWN and AGECAT as a result of the I_RESPONSE_ portion of the ~IODEL statement.

The "ANGVAll table resulting from this analysis shows both whole-plot factors signifi­cant at p < .0001, and no other terms with p < .05. Clearly the model may be reduced. To specify the main effects model, one may slmply remove all verti ca 1 bars from the -MODEL and REPEATED statements. This model also shows no significant effects of the split-plot factors LEVEL and SIDE, while TO,JN and AGECAT remain Significant at p < .0001. If models without parameters representing differences due to LEVEL and SIDE are to be fit, the REPEATED statement may be_ ·dropped. To simultaneously fit separate main effects models of TOWN and AGECAT to each of the four quadrants, use

Page 9: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

MODEL LMANDIND*RMANDIND*LMAXIND*RMAXIND= TOWN AGECAT;

To treat all four quadrants identically by fitting the same main effects model to them with one set of parameters, so that the fitted values are the same for each quadrant, use

MODEL LMANDIND*RMANDIND*LMAXIND*RMAXIND= TOWN AGECATjAVERAGED PRED;

Here, the option AVERAGED directs tATMOD to construct a model matrix with identical rows corresponding to the elements of each set of multiple response functions within the same population, so that the resultant parameters apply to all response functions simultaneously. Since the fitted parameters are (weighted) averages of those that would have been fitted to each response function separately. the usage is justified. The "ANOVA II results for this model are given in Table 13. It is evident from examination of the parameters that this model might be further reduced to include only a pseudo-l ; near tenn for the effect across age categories. To do this, the variable AGEINC, with values -1, 0 and 1, must be created from AGECAT in the OATA step. Replacing AGECAT by AGEINC in the MODEL statement, and prefacing that statement by DIRECT AGEINC, would then fit the desired model. The added statement places AGEINC directly into the model matrix, rather than using its values to define a three­level categorical factor as was AGECAT. Finally, note that AVERAGED also allows modeling of differences between response functions using appropriate model specifications.

5. Campa ri son of PROC CATMOD with PROC FREQ

In Version 5.0 of SASjBASE, PROC CATMOD and PROC FREQ will form complementary tools for the analysis of categorical data. A brief (and somewhat oversimplified) comparison is appro­priate here. Note that PROC FREQ in Version 5.0 has been available as PROC TFREQ in earl ier versions. PROC FREQ produces a variety of contingency table measures of association, with Cochran-Mantel-Haenszel generalized average partial association tests. Major differences in approach between PROC FREQ and PROC CATMOD are enumerated below. i) FREQ executes randomization model-based analyses allowing formal statistical inference only to the subjects generating the data under study. CATMOD incorporates random sampling assumptions which, if valid, allow generalization to broader target populations. On the other hand, CATMOD analyses are invalid if these assumptions are grossly violated. Depending on the nature of the model and sample sizes available, CATMOD may rely upon stringent but untestable assumptions. FREQ makes very weak assumptions, at sacrifice of scope of inference. ii) FREQ generates a variety of standard descri'ptive statistics which are independent of any structural model. CATMOD generates only descriptive statistics calculated through its RESPONSE statement by the user, or based on structural models for such response functions. iii) FREQ does average partial association testing only, with summary partial association measures available under particular circumstances. FREQ is poor at describing

1014

complex multivariate association structures which its test procedures may adjudge to be non-random. CATMOD, on the other hand, allows full 109-linear or other structural modeling and general estimation of effects within any structural model which it can test. iv) FREQ produces some exact testing, but relies mainly on asymptotiC procedures. CATMOD uses asymptotiC methods only.

Thus, FREQ and CATMOD have essentially disjOint capabilities. However, frequently each will be valuable in generating analyses which will provide complementary insights into the same data set, addressing similar questions under different assumptions or at different levels of generality.

6. SASware Ballot 1985, CATMOD Section To conclude, a number of recommendations

are provided by which CATMOD might be made even more flexible and attractive to its users.

A. The REPEATED statement should be spl it into the two rather different statements of which it currently forms a hybrid. REPEATED should be retained for genuine repeated measures or split-plot analyses and a new statement, for instance, LOGLIN, added to incorporate its current function of log-linear model specification.

B. Provision should be allowed for direct input of a vector of response functions and associated covariance matrix, possibly produced as an output data set from another SAS PROC, for direct WLS modeling. This capability would allow for the modeling of data sets from complex sample surveys and other situations in which the standard probablility models do not apply. The programming needed to add such a capability would seem to be minimal, but may not be.

C. Allow the concatenation of response state­ments, so that more general response functions may be more easi ly constructed, for instance by taking ratios of mean scores by matrix operations on a vector of means, where the means are specifiable by key word.

D. Incorporate predicted probabil ities under general functional WLS modeling, such as marginal log-linear modeling or proportional odds modeling. Where invertible functions are modeled [See Dunn (1985) in this volume], back-transformation may be used. Otherwise, minimum Neyman chi-square esti­mates are available under the model [Koch, Imrey, Singer, Atkinson and Stokes (1985)).

E. Provide a focussed model reduction cap­ability within a single model statement, such as by speCification of a selected set of null effects, parameters, or contrasts among parameters

F. Introduce 1 imited automated model building capabilities, offering certain commonly used sequences of model reduction.

G. Allow front-end selection of individual degrees of freedom in multiple degree of freedom effects which are not nested. This would make it possible, for instance, to

Page 10: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

construct a model incorporating a linear trend only among three equally-spaced populations, without having to specify the linear contrast through a direct statement.

H. In a related vein, allow the user to choose effect parametrizations of certain common types, e.g. orthogonal polynomial contrasts, comparisons with control contrasts, etc., vlithin CATMOD automatically, as is done in SPSSX LOGLINEAR.

I. Since the Newton-Raphson algorithm is highly inefficient for large models and large tables, allow an iterative propor­tional fitting option for hierarchical models for large tables.

J. Improve or clarify the handling of random vs. structural zeroes. It seems fool ish and inelegant for the user to have to distinguish random zeroes to a log-linear model program by converting them to small po s it i ve numbers. Perhaps a speci fi cent ry symbol for a structural zero might be designated.

K. Improve the label ing of parameters in general, especially those incorporated with the RESPONSE effect.

L. Provide an option to allow printing of the model matrix when the combination of NOGLS and ML options is used. These options will frequently be used together by those who do not wish the WLS output, but are fitting a moderate sized model to an array of grouped categorical data. They would find the model matrix useful, are not dOing logistic regression, are not otherwise at risk of generating a monstrous output, and should be able to see the design.

M. Allow the user to print the non-full rank design matrix for the 109-1 inear model, rather than the reduced design matrix, as the former is easier for most users to check and interpret.

Acknowl edgements

The author is grateful to Sandra Emerson of SAS Institute,and to Beth Richardson, Vicki Dingler and Joan Alster of the University of Illinois Computing Services Office, for making available a test version of SAS 5.0 for exploration of PROC CATMOD during the prepara­tion of this paper. William Stanish, developer of CATMOD at the Institute, served extenSively as a consultant with regard to CATMOD's inner workings and those of PROC FUNCAT. Katherine Council and Andy Littleton of the Institute, and Gary Koch of the University of North Carol ina, provided unusual editorial cooperation to allow production of the manuscript in time to appear in these Proceedin1s. Ann Thomas of the Univer­sity of North Caro ina at Chapel Hill typed the manuscript rapidly and efficiently. Partial support for the activities at the Department of Biostatistics, University of North Carolina was provided through Joint Statistical Agreement JSA 84-5 with the U.S. Bureau of the Census. Fred Mosteller introduced the author to the data in problems of Section 4.4. J. s. Stamm and D. W. Banting kindly permitted use of the data from the Strafford-Woodstock Root Caries Studies in Section 4.7. \~illiam Stanish kindly reviewed the manuscript but bears no responsibility for any remaining errors.

1015

References

Agresti, A. [1984J. Analysis of Ordinal Categorical Data. New York: Wiley.

Andrich, D. [1979J. Biometrics 35, 403-415. Bishop, Y.M.M. [1969J. Biometrics 25, 383-400. Bishop, Y.M.M., Fienberg, S.E. and Holland, P.W.

[1975J. Discrete Multivariate Analysis. Cambridge, MA: MIT Press.

Brown, M.B. [1981]. In BMDP Statistical Software, Eds. W. J. Dixon, et a!., 143-208. Los Angeles: University of Cal ifornia Press.

Casjens, L. [1974J. The Prehistoric Human Ecology of Southern Ruby Valley, Nev.ada. Doctoral Dissertation. Harvard University, Department of Anthropology.

Dunn, J.E. [1985J. In SUGI-SAS Users Group lOth Conference Proceedings, 989-998.

Fienberg, S.E. [1980J. The AnalYSis of Cross­Classified Categorical Data. 2nd Ed. Cambridge, MA: ~IIT Press.

Goodman, L.A. [1983J. Biometrics 39, 149-160. Grizzle, J.E., Starmer, C.F. and Koch. G.G.

[1969J. Biometrics 25, 489-504. Harrell, F.E., Jr. [19831. In SUGI Supplemental

Library User1s Guide, Ed. S. P. Joyner, 181 202. Cary, NC: SAS Institute, Inc.

Koch. G.G., Imrey, P.B., Freeman, D.H., Jr. and Tolley, H.D. [1976J. In Proc. 9th Int. Biometric Conf. I, 317-336. Raleigh, Nc: The Blometric Society.

Koch, G.G., Imrey, P.B., Singer, J.S., Atkinson, s. and Stokes, M.E. [1985J. Lecture Notes on Categorical Data Analysis. Montreal: University of Montreal.

Koch, G.G., Landis, J.R., Freeman, J.L., Freeman, D.H., Jr. and Lehnen, R.G. [1977J. Biometrics 33, 133-158.

r~osteller, F. and Parunak, A. [1985J. In Exploring Data Tables, Trends, and Shapes, Eds. D. C. Hoaglin, F. Mosteller and J. Tukey, Ch. 5. New York: IJiley, in press.

SaIl, J.P. [1982J. In SAS User's Guide: Statistics, Ed. A. A. Ray, 257-286. Cary, NC: SAS Institute, Inc.

SPSS, Inc. [1983J. SPSSX User's Guide, 541-570. New York: McGraw Hill.

Stock, M.C., Downs, J.B., Gauer, P.K., Alster, J.M. and Imrey, P.B. [1985J. Chest 87,151-157.

Stokes, M.E. and Koch, G.G. [1983J. In SUGI­SAS Users Group 8th Conference Proceect1ngs, 795-800.

*SAS is the registered trademark of SAS Institute, Inc., Cary, NC, USA. SPSSX is the registered trademark of

SPSS, Inc., Chicago, IL, USA.

Page 11: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

TABLE 1. CANONICAL DATA ARRAY

P o p u 1 a t i o n s

nll

2 n21

s nsl

2 n12

nll

n22 n21

ns2 nsl

Responses 3 r

n13 nlr n12 n13 nlr

n23 n2r n22 n23 n2r

ns3 nsr ns2 ns3 nsr

TABLE 2. ROOT CARIES IN QUADRANTS OF THE MOUTH, BY AGE AND TOWN: STRATFORD-WOODSTOCK CARIES STUDY

Quadrant

Left Max ill a ry N N N N N N N N Y Y Y Y Y Y Y Y Right Maxillary N N N N Y Y Y Y N N N N Y Y Y Y Left Mandibular N N Y Y N N Y Y N N Y Y N N Y Y Right Mandibular N Y N Y N Y N Y N Y N Y N Y N Y

~ Town

30-49 Stratford 139 3 1 0 8 3 2 2 3 1 0 0 4 0 1 1 30-49 Woodstock 95 10 7 6 3 3 0 1 6 2 1 1 4 0 5 5

50-59 Stratford 61 5 5 2 7 2 0 0 3 2 1 0 0 0 3 50-59 Woodstock 43 2 7 3 5 2 0 1 7 0 3 4 2 2 4

60+ Stratford 31 5 1 2 5 0 0 2 3 0 1 0 2 0 3 5 60+ Woodstock 28 5 6 3 5 3 2 3 1 2 5 4 3 6 3 4

TABLE 3. ATELECTASIS 24 HOURS AFTER ABDOMINAL SURGERY, BY RESPIRATORY THERAPY: OBSERVED COUNTS AND LOGITS BY TREATMENT, WITH FITTED COUNTS UNDER INDEPENDENCE AND LOGIT DESIGN MATRIx*

Atel ect

Treatment

CDB

CPAP

IS *From

TABLE 4.

Treatment CDB CDB CPAP CPAP IS

Counts LOglt 06servea Fitted Observed Model Matrix

Absent Present Absent Present Logits Mean Treatment

13 6 11. 2 7.8 .773 0

13 9 12.9 9.1 .368 0

11 11 12.9 9.1 .000 -1 -1 Stock, M.C. et ~., Chest 87,151-157; 1985.

LOG RATIOS OF ALL COUNTS TO LAST COUNT FROM ATELECTASIS DATA: OBSERVED VALUES, SATURATED FULL-RANK MODEL MATRIX FOR U-TERMS, AND FITTED VALUES UNDER INDEPENDENCE

Atel ect Absent Present Absent Present Absent

Log Ratio To IS-Present

Observed .167

-.606 .167

-.201 .000

Fltted Under Independence

.206 -.147

.353

.000

.353

1016

Saturated Log-Linear _____ ~M"'o""de"_'_l Matrix for U-terms . ·--Ateloct" Atel ect'

Atelect 2 o 2 o 2

COB 2 2 1 1 o

CPAP COB CPAD 1 0 -1 1 -2 -1 2 -1 0 2 -1 -2 o -2 -2

Page 12: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

TABLE 5. NORTH CAROLINA MOTOR VEHICLE ACCIDENTS YIELDING SERIOUS DRIVER INJURY, BY SPEED, TIME OF DAY, PLACE AND YEAR: OBSERVED COUNTS AND LOG-LINEAR MODEL FITTED COUNTS*

Year 1973 1m

S~eed (MPH) Time Pl ace Observed F 1 tted Observed Fitted

~ 55 Night Urban 121 120 125 126 ~ 55 Night Rural** 374 369 383 388 ~ 55 Day Urban 232 236 197 193 ~ 55 Day Rural 697 699 575 573 > 55 Night Urban 27 22 13 18 > 55 Night Rural 278 289 252 241 > 55 Day Urban 5 5 4 4 > 55 Day Rural 200 193 119 126

*Compi1ed by Highway Safety Research Center, University of North Carol ina at Chapel Hill **Includes both all rural locations and urban interstate highways.

TABLE 6. ANALYSIS OF INDIVIDUAL PARAMETERS (U-TERMS) FROM FINAL LOG-LINEAR MODEL FOR ACCIDENT DATA

Effect Estimate S .E. Chi-sguare p-value

Speed 0.933 . 048 371.33 ~ .0001 Time -0.121 .048 6.30 .0121 Speed*Time 0.388 .048 64.28 <.0001 P1 ace 1. 045 .048 467.45 <.0001 Speed*Pl ace -0.493 .048 103.81 <.0001 Time*Pl ace 0.118 .048 5.99 .0144 Speed*Time*Place ·0.128 .048 6.97 . 0083 Year 0.096 .019 24.24 ~. 0001 Speed*Year -0.058 .020 8.52 .0035 Time*Year 0.062 .017 12.92 .0003

TABLE 7. JOINT RESPONSES OF 46 SUBJECTS TO ADMINISTRATION OF DRUGS A, BAND C: OBSERVED COUNTS AND FITTED COUNTS UNDER LOG-LINEAR MODEL WITH A AND B SYMMETRIC, C INDEPENDENT OF A AND B (F = FAVORABLE, U = UNFAVORABLE)

Res~onse Pattern Drug Observed Fitted

A B C Count Count

F F F 6 7.65 F F U 16 14.35 F U F 2 2.09 F U U 4 3.91 U F F 2 2.09 U F U 4 3.91 U U F 6 4.17 U U U 6 7.83

1017

Page 13: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

TABLE 8. NUMBER OF ARTIFACTS BY DISTANCE TO PERMANENT WATER

01stance to Permanent ~ater j - 1 2 3 4 5 6

Immediate Withi n 1/4 to 1/2 1/2 to 1 1 to 3 Over Artifact vicinity 1/4 milea mil e mile mi 1 es' 3 mil es

1 Specialized unifaces 20 102 54 38 29 3 2 Unifaces 2 or more edges retouched 33 136 86 58 56 7 3 Unifaces 1 edge retouched 27 122 68 51 53 0 4 Limited bifacial retouched 2 10 8 5 4 0 5 Large, heavy tools 11 82 34 35 30 2 6 Whole bi face 10 53 25 17 17 3 7 Round biface, base snapped, side notched 39 185 88 100 58 13 8 Pointed biface, base snapped,side notched 34 179 70 78 60 11 9 Rectangular biface, base snapped,

side notched 26 78 24 26 14 6 10 Biface midsection 24 88 32 41 26 3 11 Humboldt Pinto Northern (projectile points) 8 44 16 28 39 3 12 Elko Gypsum (projectile points) 15 75 30 35 27 8 13 Eastgate and Rose Spring

(projectile points) 11 32 5 11 21 2 14 Cottonwood Desert, side notched 12 28 5 18 7 0 15 Drill s 2 10 4 2 6 0 16 Pots 3 8 4 6 8 0 17 Grinding stones 13 5 3 9 7 0 18 Poi nt fragments 20 36 19 20 28 1

aThese counts exclude those in the Immediate vicinity column.

TABLE 9. TABULATION OF ANALGESIA RESPONSES OF PATIENTS WITH PAIN FROM CHRONIC JOINT DISEASE TO ONE OF TWO DRUG PREPARATIONS, BY SITE OF PAIN AND TREATMENT CENTER

Pain Clinical Res~onse Site Center Drug Gooa filed,um Poor Total

Active 5 20 3 28 Pl acebo 8 14 11 33

2 Active 0 12 12 24 2 Pl acebo 0 10 11 21

II Active 12 14 3 29 II Pl acebo 5 13 6 24

II 2 Active 4 9 3 16 II 2 Pl acebo 3 9 6 18

Total 37 101 55 193

TABLE 10. ESTIMATED PARAMETERS, STANDARD ERRORS AND WALD CHI-SQUARE TESTS FOR EQUAL ADJACENT ODDS AND PROPORTIONAL ODDS MODELS FOR THE CHRONIC JOINT PAIN CLINICAL TRIAL DATA

Parameter

Drug Center Site

;-IL Flt of Equal-Adjacent Odds

2 Estimate S.E. Xl

-.446 -.899

.675

.229

.243

.234

3.79 13.63 8.34

p

.0517

.0002

.0039

1018

Estimate

-.504 -.918

.725

WLS Fit of Proportional Odds

2 S. E. Xl

.285

.296

.287

3.13 9.59 6.37

p

.0768

.0020

.0116

Page 14: comment upon such capabilities more than - sasCommunity Imrey.pdf · This paper will review the capabilities of ... classification of several dimensions of 1006 ... of log-linear

TABLE 11. ARTIFICIAL DATA SHOWING REGRESSED BREAST TUMORS VS. TOTAL PALPATED BREAST TUMORS IN TWO STRAINS OF RATS TREATED WITH DIMETHYLBENZANTHRACENE, FOR CONTROLS AND RATS FED A NEW ANTI-TUMOR AGEN~AMONG RATS WITH PALPATABLE TUMORS

Total Palpated Groue Strain Tumors (Tumtot) 0

Regressed Tumors 1

(TiJmregl 2 3

Experimental A 1 4 5 Experimental A 2 3 6 6 Experimental A 3 4 4 3 Control A 1 6 2 Control A 2 13 3 2 Control A 3 10 1 2 2 Experimenta 1 B 1 8 4 Experimental B 2 4 4 5 Experimenta 1 B 3 3 2 4 4 Control B 1 9 0 Contro 1 B 2 8 3 3 Control B 3 10 1 2 3

TABLE 12. ESTIMATED PARAMETERS AND WALD TESTS OF SIGNIFICANCE FOR ARTIFICIAL RAT TUMOR REGRESSION DATA

Effect

Strain

Treatment

Strain by Treatment

Lack-of-fit

Saturated Model Main Effects Model Estimate 2 Estimate 2

Xl P Xl

-.1012 0.42 .52 -.0958 0.40

-.5570 12.82 .0003 -.5576 12.86

-.0222 0.02 .89

.02

TABLE 13. RESULTS OF AVERAGED MAIN EFFECTS MODEL FOR ROOT CARIES PREVALENCE DATA Analysis of Variance Table

Source

Intercept Town Agecat Residual

Effect

Intercept Town Agecat

OF

1 1 2

20

Parameter

Town

1 2 3 4

Stratford Stratford Stratford Woodstock Woodstock Woodstock

Chi-Square

358.57 18.45 37.35 27.56

Estimate

-1. 446 - .328 - .502 - .062

S.E.

.076

.076

.103

.107

Prob

.0001

.0001

.0001

.1202

Chi-Square

358.57 18.45 23.99 0.34

Agecat Predicted Marginal Quadrant

Caries Prevalence

30-49 50-59 60 + 30-49 50-59 60 +

1019

.0931

.1375

.2298

.1248

.1813

.2920

Prob

.0001

.0001

.0001

.5607

P

.53

.0003

.89