Transcript
Page 1: Publishing Nutrition Research: A Review of Multivariate Techniques—Part 1

R

PMPC

ATpitcnmitolmeatditrvtp

PHNCPCiepHCdoe

RcSp

A

©

RESEARCH

eview

ublishing Nutrition Research: A Review ofultivariate Techniques—Part 1

ATRICIA M. SHEEAN, PhD, RD; BARBARA BRUEMMER, PhD, RD; PHILLIP GLEASON, PhD; JEFFREY HARRIS, DrPH, RD, LDN;

AROL BOUSHEY, PhD, MPH, RD; LINDA VAN HORN, PhD, RD

irJ

Ttfdcmlitpgqptaapei

fciaaaytesitosciFptfoai

BSTRACThis article is the seventh in a series reviewing the im-ortance of research design, analyses, and epidemiologyn the conduct, interpretation, and publication of nutri-ion research. Although there are a variety of factors toonsider before conducting nutrition research, the tech-iques used to conduct the statistical analysis are funda-ental for translating raw data into interpretable find-

ngs. The statistical approach must be considered duringhe design phase of any study and often involves the usef multivariate analytical techniques. Multivariate ana-ytical techniques represent a variety of mathematical

odels used to measure and quantify an exposure–dis-ase or an exposure–outcome association, taking intoccount important factors that can influence this rela-ionship. The primary purpose of this review is to intro-uce the more commonly used multivariate techniques,ncluding linear and logistic regression (simple and mul-iple), and survival analyses (Kaplan Meier plots and Coxegression). These techniques are described in detail, pro-iding basic definitions and practical examples with nu-rition relevancy. An appreciation for the general princi-les within and presented previously in this article series

. M. Sheean is an assistant professor, and L. Vanorn is a professor, Department of Preventive Medicine,orthwestern University Feinberg School of Medicine,hicago, IL. B. Bruemmer is a senior lecturer, Graduaterogram in Nutrition Sciences, and director, Graduateoordinated Program in Dietetics, University of Wash-

ngton, Seattle. P. Gleason is a senior researcher, Math-matica Policy Institute, Inc, Geneva, NY. J. Harris is arofessor and Didactic Program director, Department ofealth, West Chester University of Pennsylvania, Westhester, PA. C. Boushey is an associate professor andirector, Coordinated Program in Dietetics, Departmentf Foods and Nutrition, Purdue University, West Lafay-tte, IN.

Address correspondence to: Patricia M. Sheean, PhD,D, Northwestern University, Feinberg School of Medi-

ine, Department of Preventive Medicine, 680 N Lakehore Dr, Ste 1400, Chicago, IL 60611. E-mail:[email protected] accepted: July 2, 2010.Copyright © 2011 by the American Dietetic

ssociation.0002-8223/$36.00

ddoi: 10.1016/j.jada.2010.10.010

2011 by the American Dietetic Association

s vital for enhancing the rigor in which nutrition-relatedesearch is implemented, reviewed, and published.Am Diet Assoc. 2011;111:103-110.

his article is the seventh in a series reviewing theimportance of research design, analyses, and epide-miology in the conduct, interpretation, and publica-

ion of nutrition research. Other articles in this seriesocused on topics including study design and hypothesesevelopment (1); sampling techniques, sample size, andritical elements of manuscript preparation (2); nonpara-etric procedures (3); qualitative research (4); epidemio-

ogic methods (5); and, most recently, measurement andnterpretation of nutrition-related outcomes and diagnos-ic tools (6). Collectively, the aim of this series is torovide the Journal readership with tools to enhanceeneral understanding of key concepts inherent in high-uality nutrition research by providing relevant exam-les and additional resources. These articles are intendedo serve as a review for more experienced researchers andlso to offer practical, simple explanations for those whore new to the field. An appreciation for the generalrinciples outlined in each of these articles is vital fornhancing the rigor in which nutrition-related research ismplemented, reviewed, and published.

Although there are a variety of factors to consider be-ore conducting nutrition research, the techniques used toonduct statistical analysis are fundamental for translat-ng raw data into interpretable findings. The statisticalpproach must be considered during the design phase ofny study and often involves the use of multivariatenalytical techniques. By definition, a multivariate anal-sis (or multivariate modeling) is an efficient analyticalool used to control for confounding effects, to assessffect modification, and to summarize the association ofeveral predictor variables with some outcome variable ofnterest (7). More simplistically, multivariate analyticalechniques represent a variety of mathematical modelsften used in epidemiologic research to classically mea-ure and quantify an exposure–disease or more practi-ally an exposure–outcome association, taking in accountmportant factors that can influence this relationship.or instance, in most nutrition studies involving humanarticipants, multivariate techniques must be employedo adjust or control for the effects of basic demographicactors (ie, age, race/ethnicity, and sex) on the outcomef interest. Failure to control for these variables, inddition to others, can lead to spurious results andnappropriate inferences regarding the true exposure–

isease or exposure–outcome relationship. Essentially

Journal of the AMERICAN DIETETIC ASSOCIATION 103

Page 2: Publishing Nutrition Research: A Review of Multivariate Techniques—Part 1

mlaputadeat

LLnmttvtdsoiwbsbTd

saPcctfwli

F4B

F

1

ultivariate techniques allow for the analysis of the re-ationship between more than one independent variablend one or more dependent variables. Thus, the primaryurpose of this review is to introduce the more commonlysed multivariate techniques, including linear and logis-ic regression, and survival analyses. These techniquesre highly prevalent in the literature and are described inetail, providing basic definitions (Figure 1) and practicalxamples with nutrition relevancy. Future articles willddress other less common, but equally important, mul-ivariate techniques.

INEAR REGRESSIONinear regression is a commonly applied statistical tech-ique used to assess the relationship between two orore variables where the dependent variable is quanti-

ative. In general, the assumption for linear regression ishat the independent variables (x) and the dependentariable (y) are linearly related (Figure 2). Estimatinghe parameters of a linear regression model is typicallyone using the least squares method, and linear regres-ion models are often used to assess how well a given setf covariates (or x values) can predict the outcome ofnterest (or y). For example, suppose an investigatorould like to explore the relationship between admissionlood glucose and hospital length of stay (LOS). The re-earcher might hypothesize that patients with higherlood glucose on admission will have longer hospital LOS.o investigate this hypothesis, the researcher collects

igure 1. Basic terminology and definitions associated with multivariatth ed. New York, NY: Oxford University Press; 2001. bRiegelman RD,rown and Co; 1996.

ata from a sample of hospital patients on their admis- a

04 January 2011 Volume 111 Number 1

ion blood glucose levels and their LOS. A simplisticpproach to analyzing these data would be to estimateearson linear correlation coefficients between the twoontinuous variables. However, correlation coefficientsan only evaluate the strength of linear relationship be-ween two variables. In this example, they cannot be usedor prediction or to tell us what LOS to expect for someoneith a given blood glucose level. To estimate how much

onger patients remain in the hospital for every unitncrease in admission blood glucose, it is essential to find

istical techniques. References: aLast JM. A Dictionary of Epidemiology.RP. Studying a Study and Testing a Test. 3rd ed. Boston, MA: Little,

igure 2. Graphic representation of linear regression.

e statHirsch

formula for the best straight line through the observed

Page 3: Publishing Nutrition Research: A Review of Multivariate Techniques—Part 1

dbo

taiidphretbw(q

etcxtatdrsfsrgc(eartmbtrTfitrcnbiBwnccavwmig

edreamce

R

e

tbwRsatccwmwsvlaRmm

iistd

ata points. The results from such a regression can alsoe used to predict the LOS for a patient with a given setf characteristics including admission blood glucose.In a simple regression model, a linear relationship be-

ween one independent variable (x) and a dependent vari-ble (y) is expressed as: y��0��1x, where �0 is the y-ntercept (ie, the estimated value of y when x�0), and �1s the regression coefficient (the estimated increase in theependent variable for every unit increase in the inde-endent variable). Using the previous example, y is theospital LOS, x is the admission blood glucose, and �1epresents the additional time spent in the hospital forvery 1-unit increase in admission blood glucose. Thisype of modeling is used to depict simple relationshipsetween two linearly related variables. Clearly, thereould be other clinical factors to consider in this relation

eg, age, diagnoses, and comorbidities), which would re-uire more sophisticated modeling.Multiple linear regression modeling is considered an

xpansion of the simple linear regression model. In thisype of modeling, additional independent variables (orovariates) are added to the regression equation (eg, x1,2, and x3) to assess the strength of the association be-ween one dependent variable and one independent vari-ble (eg, admission blood glucose), to improve its predic-ive abilities for the outcome variable and to better fit theata to a straight line. (It should be noted that multipleegression refers to multiple independent variables for aingle outcome variable and multivariate regression re-ers to the analysis of multiple outcome variables in theame model.) Suppose that a simple linear regressioneveals a positive relationship between admission bloodlucose and the hospital LOS. Such a relationship couldonceivably be explained by some other, third variableeg, age, body mass index [BMI], and diagnosis). Forxample, patients with high admission blood glucose maylso have high BMI values, and it may be their BMIather than admission blood glucose that helps to explainhe longer LOS. In this case, a simple regression modelight show a positive relationship between admission

lood glucose and LOS, but a multiple linear regressionhat controlled for BMI and other factors might show noelationship between admission blood glucose and LOS.o explore this relationship further, BMI may be changed

rom a continuous independent variable to a categoricalndependent variable and this is often accomplishedhrough the creation of indicator variables (previouslyeferred to as dummy variables). In this example, BMIould be a two-level categorical variable, such as obese vsot obese and this is simple to code (eg, 1�obese, 0�nono-ese) and to interpret in your regression model. However,f there is an interest in looking at a variety of levels ofMI, the investigator could create �-1 indicator variableshere � is the total number of levels of BMI. In mostutrition research using BMI, there are generally fourategorizations used; “normal” is most often the referenceategory to which all other indicator variables are codednd compared. Referring back to our example, indicatorariables representing “obese,” “overweight,” or “under-eight” could be created and inserted into the regressionodel each as an independent variable to compare the

nfluence of these individual BMI categories on the blood

lucose and hospital LOS relationship. s

To enhance understanding of multiple regression mod-ling, consider a well known example—the Harris-Bene-ict equation. Based on experiments with indirect calo-imetry, multiple regression models were developed tostimate resting energy expenditure (REE) in adult mennd women as predicted by a set of characteristics of 239en and women (8). For illustrative purposes, we will

onsider the following equation for women, where restingnergy expenditure is the dependent variable (y):

EE�655�9.6�(weight in kg)�1.8�(height in cm)

�4.7�(age).

The regression coefficients in the Harris-Benedictquation can be interpreted as follows:

�0 (the intercept) is equal to 655; corresponding to theestimate of REE when an individual has 0 weight, 0height and is 0 years of age. The intercept, as in thiscase, if often biologically implausible and simply re-flects a mathematical extrapolation with no meaningfulinterpretation.The regression coefficient for weight (�1�9.6) reflectsthe estimated average increase in a person’s REE foreach 1 kg increase in their weight, while holding heightand age constant.The regression coefficient for height (�2�1.8) repre-sents the estimated average increase in REE for each 1cm increase in a person’s height, while holding weightand age constant.The regression coefficient for age (�3��4.7) has aslightly different interpretation due to the negativesign, reflecting the estimated average decrease in REEfor each 1-year increase in age, while holding weightand height constant.

An important assumption of the Harris-Benedict equa-ion is that there is no effect modification (or interaction)etween the covariates in the model (ie, weight�height,eight�age, or weight�sex). That is to say, changes inEE per unit change in weight are assumed to be con-tant for individuals of all heights, ages, and sex. If thisssumption is incorrect, then regression models based onhe effect modifier (ie, weight, height, and sex) must beonducted and presented. For example, a researcherould estimate separate regression models for individualsith different values of the variable that is believed toodify the effect (ie, separate models for men andomen). Alternatively, the researcher could estimate a

ingle regression model, but interact the independentariable of interest (eg, weight) with the covariate be-ieved to be the effect modifier (eg, sex). Considering therere sex-specific Harris-Benedict equations, it is likely theEE varied significantly by sex; thus, two separate for-ulae were developed to better reflect more precise esti-ates of the indirect calorimetry data.Although simple and multiple linear regression model-

ng techniques are used extensively in nutrition research,t is important to remember that these estimates areubject to random error and uncertainty. This uncer-ainty is recognized in regression equations in a couple ofifferent ways. First, researchers should estimate the

tandard error of each of the regression coefficients and

January 2011 ● Journal of the AMERICAN DIETETIC ASSOCIATION 105

Page 4: Publishing Nutrition Research: A Review of Multivariate Techniques—Part 1

uvmtiratatdtsotpem

dTmtuRitcttbarrivsmct

plruetbicmsttdasi

LApe

r(Fvpim

wtwiXttdnthramt

Fl

1

se the standard errors to calculate the confidence inter-als (CIs) around these regression coefficient point esti-ates (7). The standard error and associated CI reflect

he uncertainty of the estimation process by providingnformation on the extent to which the true value of theegression coefficient could differ from the point estimatend still remain consistent with the data used to estimatehe model. Given this uncertainty in the point estimate ofregression coefficient, researchers often conduct tests of

he statistical significance of a regression coefficient toetermine whether a claim can be confidently made thathe regression model provides evidence of a true relation-hip—positive or negative—between the covariate andutcome. If a regression coefficient is positive and statis-ically significant, the researcher can claim evidence of aositive relationship between the covariate and outcome,ven after controlling for the other covariates in theodel.Second, researchers should consider the coefficient of

etermination, or the R2, in linear regression modeling.he R2 will relay how well the regression line approxi-ates the real data points; the higher the R2 the better

he agreement between the observed and modeled val-es—an R2 value of 1.0 indicates perfect agreement, an2 value of 0.0 indicates no agreement. Another way of

nterpreting R2 values is that they indicate the propor-ion of the variation in the dependent variable explainedollectively by the independent variables in the model. Inhe previous example, an R2 value of 0.50 would indicatehat half of the variation in adults’ REE can be explainedy differences or variation in their heights, weights, andges. When additional covariates are added to a linearegression model (eg, x1 and x2), the R2 value will eitheremain constant or more likely increase, since additionalnformation should help predict values in the dependentariable, even if only by chance. Thus, in addition to theimple R2 value, adjusted R2 values are generated inultiple linear regression models. These values will in-

rease only if the new covariate improves the model morehan would be expected by chance.

There are also additional considerations that are im-ortant in conducting and reporting results from multipleinear regression models. First, when interpreting theesults of any regression equation, it is vital to report thenit to which the regression function corresponds. Forxample, if an investigator were interested in examininghe relationship between sodium intake and diastoliclood pressure, it would be meaningless to report thendependent variable in one mg sodium increments. In-rements of perhaps 1,000 mg sodium would likely beore interpretable. Second, it is also important to con-

ider the context of the regression equation. For example,he Harris-Benedict equation was developed nearly a cen-ury ago and may not reflect the body weight and racialiversity prevalent in society today; thereby limiting itspplicability for certain clinical populations. As with anytatistical tool, the reader should be well informed of itsnherent and contextual limitations.

OGISTIC REGRESSIONlthough linear regression is appropriate when the de-endent variable is quantitative, logistic regression mod-

ling is used in epidemiologic studies or other nutrition- p

06 January 2011 Volume 111 Number 1

elated research where the dependent variable is binaryeg, 0 and 1) or dichotomous (eg, yes/no or alive/deceased.)or example, the probability that an individual will de-elop coronary heart disease by a certain age might beredicted by family history, serum cholesterol, body massndex, and diet quality. The basic logistic regression

odel can be expressed mathematically as follows:

log (P ⁄ 1�P)�log odds��0��1X,

here P denotes the probability of the outcome, the in-ercept (�0) is an estimate of the log odds of the outcomehen X�0, and the coefficient (�1) reflects the estimated

ncrease in log odds of the outcome per 1-unit increase in(7). A graphic depiction of this mathematical formula-

ion is depicted in Figure 3. Fundamentally, it is criticalo understand that this type of modeling is designed toescribe the probability of an outcome given a set ofumerical or categorical risk factors. The logistic func-ion, or (P/1�P), describes a probability and, therefore,as a limited range between 0 and 1. The end valueeflects the natural logarithm of the odds of the outcome,lso called the logit or the log odds. This type of modelingay be used in case-control, cross-sectional and prospec-

ive studies and is considered the most popular modeling

igure 3. Graphic and mathematic equivalent formulations of theogistic regression function.

rocedure (9). This is due, in part, to the ease and inter-

Page 5: Publishing Nutrition Research: A Review of Multivariate Techniques—Part 1

ptposf9fgtfith3

pbtcae

waghgtiarbofskTwspqfthbefas

gormitmotdp

taalscatt(oar6wtastwwthnia

sttptwgerpttstnsOp0teiitrmaybr

SCsd

retability of its regression estimates that can easilyranslate into an odds ratio (OR). The OR expresses therobability of having the exposure among those with theutcome (cases) divided by the odds of having the expo-ure among those without the outcome, given a set of riskactors (5). An OR should always be accompanied by a5% CI. The 95% CI is similar but distinctly differentrom the coefficient of determination used in linear re-ression. Specifically, the 95% CI relays the reliability ofhe point estimate and is calculated using informationrom the variance-covariance matrix. Although this der-vation is beyond the scope of this review, it is importanto understand that a narrow CI (eg, 2.2 to 2.8) impliesigh precision and a wide confidence interval (eg, 2.2 to4.8) implies poor precision.To better illustrate logistic regression modeling, sup-

ose an investigator wanted to examine the relationshipetween the presence or absence of infection and paren-eral nutrition administration in a cohort of patients withancer. Here, the dependent variable is infection (eg, yes/no)nd parenteral nutrition is the independent variable, or thexposure variable. This is simply presented as:

OR � �0��1(parenteral nutrition),

here parenteral nutrition�1 or 0. Inherent in this ex-mple is the need for a non-exposed group, or a controlroup where parenteral nutrition�0. This group shouldave similar baseline clinical characteristics to the otherroup, but did not receive parenteral nutrition. Limitinghe analyses to this simple relationship reveals a univar-ate logistic regression model, which is often referred tos the crude model. However, most analytical approachesequire several other characteristics of the participantse considered to help increase the predictive capabilitiesf the model and to control for specific confounders (ie,actors known to influence the exposure–disease relation-hip). Similar to multivariate linear regression, this isnown as multivariate logistic regression modeling.here are many approaches to model building, all ofhich include practical and mathematical decisions and

ome consideration of additional characteristics of thearticipants. For example, depending on the researchuestion or population under study, most models controlor age and sex. Some statistical programs can providehe best model when a set of given variables are provided;owever, these best models should be reviewed carefullyecause they may not reflect a model that includes allxpected variables. Often, some variables will need to beorced into the model based on convention and the bestpproach may not take this into consideration (eg, age,ex, and race/ethnicity).To illustrate an approach to fitting a multivariate lo-

istic regression model, we will continue with the previ-us example using the output shown in the Table. Eachow of the table represents a different logistic regressionodel reflecting the odds of infection and a given set of

ndependent variables; parenteral nutrition is consideredhe main exposure variable. The �, or the intercept onlyodel, is often conducted to assess the background odds

f the outcome. In this example, any patient admitted forhis type of cancer treatment has a 40% likelihood ofeveloping an infection, regardless of the exposure to

arenteral nutrition. Model 2, or the crude model, reflects T

he odds of infection when only the exposure variable isdded (when X1�1). The crude model is often thought ofs the comparative model because it serves as the base-ine for which all other variables are considered for inclu-ion. In this model, patients admitted for this type ofancer treatment who receive parenteral nutrition arepproximately 2.2 times more likely to develop an infec-ion than patients who do not receive parenteral nutri-ion, not taking into account other clinical characteristicsie, �2, �3, �4, �5, and �6). In Models 3 through 6, webserve very little movement of the OR and 95% CI whenge, sex, and race/ethnicity are included independently,espectively, or when added together (Model 8). In Modelsand 7, there is a slight decrease in the OR and 95% CIhen diagnosis or treatment, respectively, are added to

he crude model. These changes reflect a confoundingssociation of these independent variables on the expo-ure-disease relationship. In Models 9 and 10, we con-inue to observe a slight alteration in the OR and 95% CIhen diagnosis and then treatment are included alongith age, sex, and race/ethnicity. To interpret either of

hese models, one can say that patients admitted to theospital for this cancer treatment who receive parenteralutrition are about two times more likely to develop an

nfection after controlling for age, sex, race/ethnicity, di-gnosis (Model 9), and Treatment (Model 10).However, in model building it is always critical to as-

ess if there is evidence of effect modification (or interac-ion). This is necessary when the investigator suspectshat the outcome may be significantly altered by theresence of another variable in the model. Typically,here is a rationale or biological plausibility underlyinghy the two independent variables may be interacting toreatly alter the point estimate. To test for interaction,ntails adding product terms to your multivariate logisticegression model. For example, in the previous example,erhaps the investigator suspected that the risk of infec-ion was significantly altered by diagnosis or a specificreatment modality. Using multivariate logistic regres-ion, one can test these effects by inserting the producterms of parenteral nutrition�diagnosis and parenteralutrition�treatment into the regression equation. Aseen in Models 11 and 12, significant decreases in theRs and accompanying 95% CIs occur. In this cohort aarticular diagnosis confers a specific treatment (coded asor 1); therefore, a more logical approach to presenting

hese data would be to present the results separately forach treatment regimen. Based on the models presentedn the Table, the researcher can conclude that the odds ofnfection increase with parenteral nutrition administra-ion in this cohort of patients with cancer and that thiselationship is modified by which particular cancer treat-ent the patient receives and is largely unaffected by

ge, sex, or race/ethnicity. Additional modeling and anal-sis would then be required for each treatment strata toest optimize the model reflecting the exposure-diseaseelationship of interest.

URVIVAL ANALYSISohort studies allow us to follow an exposure, such as apecific diet or behavior, forward to an outcome such aseath, hospitalization, or having a body mass index �30.

he most powerful analyses of this type of study examine

January 2011 ● Journal of the AMERICAN DIETETIC ASSOCIATION 107

Page 6: Publishing Nutrition Research: A Review of Multivariate Techniques—Part 1

fodc

pdetypgvogoaiiptsrrws

atdgdissitdato

cSmtt

cotfaTef

CTtwvcntmhltwmfcudroa

1

actors that influence the length of time until the outcomeccurs. So in this context, survival is remaining in theisease-free state or being free from whatever physicalondition (eg, obesity) is being analyzed.

Survival analysis has many of the same parameters asreviously described for logistic regression but adds ad-itional information to that approach on the time to thevent. Using our previous example of parenteral nutri-ion exposure and infection, we could use survival anal-sis to examine time until first infection in a group ofatients who received parenteral nutrition compared to aroup who did not receive parenteral nutrition. The in-estigator could test the hypothesis that infections wouldccur more rapidly in the parenteral nutrition exposedroup based on the higher incidence of hyperglycemiabserved in parenteral nutrition recipients. This type ofnalysis enhances the previous logistic regression find-ngs further by reinforcing the strength of the associationn a different context and provides a foundation to sup-ort a cause and effect premise. However, the difficulty inhis example is the varying time points of actual exposureince the non–parenteral nutrition group did not actuallyeceive parenteral nutrition. To determine these tempo-al changes in infections, standardized time framesould need to be created to allow for adequate compari-

ons for the groups across time.Another more straightforward example of survival

nalyses could be an investigator who wants to examinehe association between dietary cholesterol and heartisease in individuals over a 10-year period. Logistic re-ression could be used to test the association between theietary data, including dietary cholesterol intake, and thencidence of myocardial infarction and produce an OR toummarize the estimated risk. However, survival analy-is would take advantage of additional information thats not captured in the logistic regression model. The timeo the event may contribute important information on theifference between two diets (eg, defined Western dietnd low [total and saturated] fat, low cholesterol diet). Inheory, such research could conclude that individuals on

Table. Sample logistic regression models, parameter estimates, staexamining the associations of infection (the outcome) and parentera

Modela Intercept�SE Exposure�SE �2�SE

Model 1: Intercept .3336�.1073Model 2: PN �.0904�.1608 .7687�.2192Model 3: Age �.1366�.4548) .7712�.2204 .000921�.Model 4: Sex �.0255�.1835 .7749�.2196 �.1627�.Model 5: Race/ethnicity �.0916�.1951 .7689�.2197 .00234�.Model 6: Diagnosis �.3641�.2076 .7090�.2219 .4849�.Model 7: Treatment �.2353�.1731 .7186�.2215 .5761�.Model 8: Age�sex�race �.0673�.4959 .7777�.2216 .000705�.Model 9: Age�sex�race�diagnosis �1.0521�.6508 .7356�.2238 .0111�.Model 10: Age�sex�

race�diagnosis�treatment �1.2433�.6583 0.7115�.2252 0.0131�.Model 11: PN�diagnosis�

PN�Diagnosis �.1769�.2435 .3105�.3553 .1539�.Model 12: PN�treatment�

PN�treatment �.1035�.1859 0.4640�.2581 .0522�.

aVariable definitions: age is continuous; sex is coded as 1�woman, 0�man; race is codtreatment is coded as 1�aggressive chemotherapy, 0�non-aggressive chemotherapy.

ne diet had a longer disease-free period or time until a T

08 January 2011 Volume 111 Number 1

ardiovascular event compared to the other diet group.urvival analysis is also conducted using a regressionodel, but the model always includes a factor representing

he timing of the outcome (10). The actual statistical test ishe Cox regression analysis test and is represented as:

Log(t)��0(t)��1xi

In this mathematical formula t�time, �0 is the inter-ept or constant, and �1 is the beta or slope. The inclusionf a time factor allows for the examination of the risk ofhe outcome based on estimates at any one point in timeollowing initial enrollment. The risk is actually evalu-ted in a cumulative fashion over the time of the study.he test yields a hazard ratio that is a relative riskstimate for the association between the exposure or riskactor and the health outcome of interest (11).

ensored Valueso conduct a survival analysis, the investigator createswo variables, one variable for the endpoint that indicateshether the event or condition ever occurs and a secondariable for the time factor. The endpoint in this exampleould be one of two options: either participants are diag-osed with a myocardial infarction (endpoint�1) duringhe follow-up period, or they are not diagnosed with ayocardial infarction (endpoint�0). Of those who do notave the endpoint, some participants may be lost to fol-

ow-up or some may die from another cause. These areermed censored outcomes because from that time for-ard the participant is no longer at risk of the outcomeyocardial infarction and should be essentially removed

rom further analysis. The time variable is generallyounted from the day of enrollment in the study and issually recorded in days. In this example the potentialay of the endpoint would then be between 0 and 3,652epresenting the 10 years of follow-up. An extension ofur example, including the endpoint and time factors isn individual who has a myocardial infarction at day 456.

errors (SEs), odds ratios (ORs) and 95% confidence intervals (CIs)ition (PN) administration (the exposure)

�3�SE �4�SE �5�SE �6�SE OR (95% CI)

1.40 (1.25-1.55)2.16 (1.40-3.30)2.16 (1.40-3.33)2.17 (1.41-3.33)2.16 (1.40-3.32)2.03 (1.32-3.14)2.05 (1.33-3.17)

�.1625�.2220 .0126�.2220 2.18 (1.41-3.37)�.0796�.2264 .1609�.2327 .6294�.2656 2.09 (1.35-3.24)

�0.0590�.2277 0.1898�.2343 0.5237�.2713 .5066�.2580 2.04 (1.31-3.17)

.6486�.4550 1.36 (.96-1.95)

.9397�.5031 1.59 (1.23-2.06)

�white, 0�other; diagnosis is coded as 1�multiple myeloma, 0�other diagnosis; and

ndardl nutr

0084722162197228524480085600967

00972

3245

3704

ed as 1

he endpoint would be one and the time would be 456.

Page 7: Publishing Nutrition Research: A Review of Multivariate Techniques—Part 1

FtoildfctaaedmTt

KImopetpx0r

tAgtitbttiaArptebsoIgctmbct

F 10-y

or a participant lost to follow up during the sixth yearhe endpoint would be zero and the time would be the dayf last contact, such as day 2,242. At day 2,243 this persons censored from the analysis because this person can noonger provide an estimate on the association betweenietary cholesterol and the incidence of a myocardial in-arction. Those who complete the study without a myo-ardial infarction would have an endpoint of zero with aime of 3,650. This regression analysis provides the over-ll estimate of the association between the dietary groupsnd the outcome with consideration for the time to thevent and the use of censoring to account for noneventsuring the study. This is a more refined analysis andakes use of all of the information on the individuals.he Cox regression analysis has provided an estimate ofhe survival or hazard analysis over time (12).

aplan-Meier Plotsn addition to the regression analysis, a survival analysisay include a Kaplan-Meier plot, which is an illustration

f the change in the survival curve from the initial timeoint of enrollment at Day 0 to the closure date of thendpoint at 10 years. This curve is a very useful visual ofhe relationship. An example of a Kaplan-Meier curve isresented in Figure 4. To orient to the plot, consider theaxis a time continuum from the start of the study (Day

) to the last day of the study (Day 3,650). The y axis

igure 4. Kaplan-Meier Plot using a hypothetical example reflecting a

epresents the participants still at risk of the endpoint m

hroughout the study with an estimate of the association.t enrollment, all data points are at the upper left of theraph showing all participants free of the endpoint. Overhe course of the study, as events occur (a myocardialnfarction) (lost to follow-up or death from another cause)he curves will slope down. However, if a participant haseen censored the slope does not drop at the date whenhe individual was censored but at the next date whenhere is an event. Therefore, a drop in the curve alwaysndicates that there has been at least one event but it maylso reflect any censored participants since the last event.t the end of the study, the final position of the lineepresents both loss to events and to censored partici-ants and reflects only those individuals still at risk forhe outcome at the end of the study. If there were novents (a myocardial infarction) or censored observationsoth curves would be straight across. Every drop in thelope represents one or more events (occurrences of theutcome among sample members) at that point in time.n the example where the question compares two dietroups, there would be one curve representing the low-holesterol diet and one for the control diet. In addition tohe graphic presentation, the Kaplan-Meier method esti-ates a log-rank test, which will compare the difference

etween these two slopes at the end of the study. A moreommon use of a log-rank test is the Mantel-Haenszelest, which also compares two samples with a nonpara-

ear follow-up period.

etric distribution. In the example of the Kaplan-Meier

January 2011 ● Journal of the AMERICAN DIETETIC ASSOCIATION 109

Page 8: Publishing Nutrition Research: A Review of Multivariate Techniques—Part 1

mtFaitwadwfo

MTt[cidfcpCcwciewpsyPpwcd

latvtt

CBcpdimad

adccrussaiwlshaawer

SNa

R

1

1

1

1

ethod the data are nonparametric because the distribu-ion is right censored as opposed to a normal curve. (Seeigure 4 to confirm the shape of the distributions.) Thectual log-rank test that was conducted on this hypothet-cal data produced a P value of �0.001. The interpreta-ion of the information presented on the plot, combinedith the finding of the log-rank test, could be summarizeds: There was a significant difference between the twoiets in this study in the time to a myocardial infarctionith the individuals on the low cholesterol diet remaining

ree of a myocardial infarction longer than the individualsn the defined Western diet.

ULTIVARIATE COX REGRESSION ANALYSIShe examples above were univariate approaches in whichhe estimates were made with only the exposure (lowtotal and saturated] fat, low cholesterol diet and theontrol diet (Western diet) and the outcome (myocardialnfarction or no myocardial infarction) in the model. Asescribed in the earlier section there may be extraneousactors that influence the basic relationship of interest (ie,onfounding). Therefore, it may be necessary to includeossible covariates in these analyses. However, althoughox regression analysis is well suited to the inclusion of

ovariates, the Kaplan-Meier plot and the log-rank testill only test the univariate model. For this reason, the

ombination of the two approaches provides an easilynterpretable graphic of the overall shape of the differ-nces over time and a test result on the relationship,hich may include confounding factors (10). In this hy-othetical example, the actual multivariate Cox regres-ion analysis, including adjustment for age and sex,ielded a hazard ratio of 0.22 (95% CI of 0.16 to 0.29;�0.0001). The conclusion is that there is a significantrotective effect in the time to a myocardial infarctionith the low (total and saturated) fat, low cholesterol diet

ompared to the time until myocardial infarction with theefined Western diet after adjustment for age and sex.Survival analysis methods have been used in many

arge prospective studies involving diet and health suchs the Nurse’s Health Study, the Women’s Health Initia-ive, and the Framingham Heart Study. Therefore, it isery useful for nutrition researchers and dietetics prac-itioners to have the skills to interpret these statisticalests.

ONCLUSIONSefore embarking on the aforementioned statistical pro-

edures, it is assumed that careful attention has beenaid to all data collection methods. Although the topic ofata management is not addressed in this review, it ismportant to highlight that the statistical analyses, no

atter how rigorous, will be fraught with inherent errorsnd lead to meaningless or flawed results when recorded

ata are not reflective of the study conditions. Quality

10 January 2011 Volume 111 Number 1

ssurance techniques and standardized approaches toata gathering and data management should be suffi-iently detailed within the methods section of every arti-le before the description of the statistical approach. Thiseview has provided a basic explanation of commonlysed multivariate techniques in nutrition-related re-earch, simple and multiple linear and logistic regres-ion, as well as Kaplan-Meier and Cox Proportional Haz-rds regression. Understanding the constructs andnterpretation of these techniques can help considerablyhen designing nutrition research, and also when ana-

yzing, interpreting, and publishing study findings. Con-ultations and collaboration with a biostatistician isighly recommended because these types of advancednalyses involve sophisticated statistical programmingnd statistical expertise. Future articles in this seriesill introduce other multivariate statistical techniques

ncountered when conducting and publishing nutrition-elated research.

TATEMENT OF POTENTIAL CONFLICT OF INTEREST:o potential conflict of interest was reported by theuthors.

eferences1. Boushey C, Harris J, Bruemmer B, Archer S, Van Horn L. Publishing

nutrition research: A review of study design, statistical analysis, andother key elements of manuscript preparation, Part 1. J Am DietAssoc. 2006;106:89-96.

2. Boushey C, Harris J, Bruemmer B, Archer S. Publishing nutritionresearch: A review of sampling, sample size, statistical analysis, andother key elements of manuscript preparation, Part 2. J Am DietAssoc. 2008;108:679-688.

3. Harris J, Boushey C, Bruemmer B, Archer S. Publishing nutritionresearch: A review of nonparametric methods, Part 3. J Am DietAssoc. 2008;108:1488-1496.

4. Harris J, Gleason P, Sheean P, Boushey C, Beto J, Bruemmer B. Anintroduction to qualitative research for food and nutrition profession-als. J Am Diet Assoc. 2009;109:80-90.

5. Bruemmer B, Harris J, Gleason P, Boushey C, Sheean P, Van Horn L.Publishing nutrition research: A review of epidemiological methods.J Am Diet Assoc. 2009;109:1728-1737.

6. Gleason PM, Harris JE, Sheean PM, Boushey CJ, Bruemmer B.Publishing nutrition research: Validity, reliability, and diagnostic testassessment in nutrition-related research. J Am Diet Assoc. 2010;110:409-419.

7. Szklo M, Nieto FJ. Stratification and adjustment: Multivariate anal-ysis in Epidemiology. In: Epidemiology Beyond the Basics. 1st ed.Gaithersburg, MD; Aspen Publishers; 2000.

8. Harris JA, Benedict FG. A Biometric Study of Basal Metabolism inMan. Washington, DC: Carnegie Institute; 1919. Publication No. 279.

9. Kleinbaum DG. Introduction to logistic regression. In: Dietz K, GailM, Krickeberg K, Singer B, eds. Logistic Regression: A Self LearningText.1st ed. New York, NY: Springer-Verlag; 1994.

0. Royston P, Parmar MKB, Altman DG. Visualizing length of survivalin time-to-event studies: A complement to Kaplan-Meier plots. J NatlCancer Inst. 2008;100:92-97.

1. Dekker FW, de Mutsert R, van Dijk PC, Zoccali C, Jager KJ. Survivalanalysis: Time-dependent effects and time-varying risk factors. Kid-ney Int. 2008;74:994-997.

2. van Dijk PC, Jager KJ, Zwinderman AH, Zoccali C, Dekker FW. The

analysis of survival data in nephrology: Basic concepts and methods ofCox regression. Kidney Int. 2008;74:705-709.

Top Related