dornsife.usc.edu€¦ · web viewin particular, the results demonstrated that details of the...

Avoiding lost discoveries by using modern, robust statistical methods: An illustration based on

a lifestyle intervention study

Running head: Robust methods

Word total: 4,709

Rand Wilcox1, Mike Carlson2, Stan Azen3, and Florence Clark4

1Department of Psychology, University of Southern California, USA2, 4 Division of Occupational Science and Occupational Therapy, University of Southern California, USA3Department of Preventive Medicine, University of Southern California, USA

This work was supported by National Institute on Aging, R01 AG021108.

Corresponding Author: Rand Wilcox, Department of Psychology, University of Southern California. 618 Seeley Mudd Building, University Park Campus, Los Angeles, CA 90089-1061, USA. Email: [email protected]

mailto:[email protected]

Abstract

Background When analyzing clinical trial data, it is desirable to take advantage of major

advances in statistical techniques for assessing central tendency and measures of association.

Although numerous new procedures that have considerable practical value have been developed

in the last quarter century, such advances remain underutilized.

Purpose This article has two purposes. The first is to review common problems associated with

standard methodologies (low power, lack of control over type I errors, incorrect assessments of

strength of association), and then summarize some modern methods that can be used to

circumvent such problems. The second purpose is to illustrate the practical utility of modern

robust methods using data from a recently conducted intervention study.

Methods To examine the value of robust statistical methodology, data from the Well Elderly 2

randomized controlled trial were analyzed to document the participant characteristics that are

associated with pre-to-post change over the course of a lifestyle intervention for older adults.

Results In multiple instances, robust methods uncovered differences among groups and

associations among variables that were not detected by classic techniques. In particular, the

results demonstrated that details of the nature and strength of association were sometimes

overlooked when using ordinary least squares regression and Pearson's correlation.

Conclusions Modern robust methods can make a practical difference in detecting and

describing differences between groups and associations between variables. Such procedures

should be applied more frequently when analyzing trial-based data.

Keywords:

robust methods, violation of assumptions, outliers, heteroscedasticity, lifestyle intervention,

occupational therapy, quality of life

Introduction

During the last quarter of a century, many new and improved methods for comparing groups

and studying associations have been derived that have the potential of documenting important

statistical relationships that are likely to be missed when using standard techniques.1-7 Although

awareness of these modern procedures is growing, it is evident that their practical utility

remains relatively unknown and that they remain underutilized. The goal of this paper is to

briefly review why the new methods are important and to illustrate their value using data from a

recently conducted randomized controlled trial (RCT).

Problems with Standard Methodology

Appreciating the importance of recently developed robust statistical procedures requires an

understanding of modern insights into when and why traditional methods based on means can

be highly unsatisfactory. In the case of detecting differences among groups, there are three

fundamental problematic issues that have important implications in controlled trials. The first

goal of this article is to briefly review these issues. After this overview, we will then examine

the viability of various robust procedures as a means of solving the problems associated with

standard procedures.

Problem 1: Heavy-tailed distributions

The first insight has to do with the realization that small departures from normality toward a

more heavy-tailed distribution can have an undesirable impact on estimation of the population

variance and can greatly affect power.1, 7 Moreover, in a seminal paper, Tukey argued that

heavier than normal tails are to be expected. 8 Random samples from such distributions are

prone to having outliers, and experience with modern outlier detection methods supports

Tukey's claim. This is not to suggest that heavy-tailed distributions always occur, but to

document that it is ill advised to ignore problems that can result from heavy-tailed distributions.

The classic illustration of the practical consequences of heavy-tailed distributions is based on a

mixed (contaminated) normal distribution where with probability 0.9 an observation is sampled

from a standard normal, otherwise an observation is sampled from a normal distribution with

mean 0 and standard deviation 10. Despite the apparent similarity between the two

distributions, as shown in Figure 1, the variance of the mixed normal is 10.9, which greatly

reduces power. As an illustration, consider two normal distributions with variance 1 and means

0 and 1, which are shown in the left panel of Figure 2. When testing at the 0.05 level with a

sample size of 25 per group, Student's T test has power=0.9. The right panel shows the mixed

normal as well as another mixed normal distribution that has been shifted to have mean 1. Now

Student's T test has power=0.28.

For two distributions with means µ1 and µ2 and common variance σ2, a well-known measure of

effect size, popularized by Cohen9, is

δ= μ 1−μ2σ

. (1)

The usual estimate of δ, d, is obtained by estimating the means and the assumed common

variance in the standard way. Although the importance of a given value for δ depends on the

situation, Cohen suggested that as a general guide a large effect size is one that is visible to the

naked eye. Based on this criterion, he concluded that for normal distributions,

δ =0.2, 0.5, and 0.8 correspond to small, medium and large effect sizes, respectively. In the left

panel of Figure 2, δ =1.0, which from Cohen's perspective would be judged to be quite large. In

the right panel, δ=0.28, which would be judged to be a small effect size from a numerical point

of view. But from a graphical point of view, the difference between means in the right panel

corresponds to what Cohen describes as a large effect size. This illustrates that δ can be highly

sensitive to small changes in the tails of a given distribution and that it has the potential of

characterizing an effect size as small when in fact it is large. The upshot of this result is that in

the context of analyzing RCTs, the use of Cohen’s d can oftentimes lead to a spuriously low

estimate of effect magnitude.

Similar anomalies can result when using Pearson's correlation. The left panel of Figure 3 shows

a bivariate normal distribution with correlation ρ =0.8, and the middle panel shows a bivariate

normal distribution with correlation ρ=0.2. However, in the right panel, although ρ=0.2, the

bivariate distribution appears to be similar to the bivariate normal distribution with ρ =0.8. One

of the marginal distributions in the third panel in Figure 3 is normal, but the other is a mixed

normal, which illustrates that even a small change in one of marginal distribution can have a

large impact on ρ.

Problem 2: Assuming normality via the central limit theorem

Many introductory statistics books claim that with a sample size of 30 or more, when testing

H 0 : μ=μ0 , μ0 given, normality can be assumed.10-12 Although this claim is not based on wild

speculations, two things were missed. First, when studying the sampling distribution of the

mean, X ,early efforts focused on relatively light-tailed distributions for which outliers are

relatively rare. However, in moving from skewed, light-tailed distributions to skewed, heavy-

tailed distributions, larger than anticipated sample sizes are needed for the sampling distribution

of X to be approximately normal. Second, and seemingly more serious, is the implicit

assumption that if the distribution of X is approximately normal,

T=√n(X−μ)s

will have, approximately, a Student's t distribution with n-1 degrees of freedom, where n is the

sample size. To illustrate that this is not necessarily the case, imagine that 25 observations are

randomly sampled from a lognormal distribution. In such a case the sampling distribution of the

mean is approximately normal, but the distribution of T is poorly approximated by a Student's t

distribution with 24 degrees of freedom. Figure 4 shows an estimate of the actual distribution T

based on a simulation with 10,000 replications. If T has a Student's t distribution, then P (T ≤ -

2.086)=0.025. But when sampling from a lognormal distribution, the actual probability is

approximately 0.12. Moreover, P(T ≥ 2.086) is approximately 0.001 and E(T)= -0.54.

As a result, Student's t is severely biased. If the goal is to have an actual Type I error probability

between 0.025 and 0.075 when testing at the 0.05 level, a sample size of 200 or more is

required. For skewed distributions having heavier tails, roughly meaning the expected

proportion of outliers is higher, a sample size in excess of 300 can be required.

A possible criticism of the illustrations just given is that they are based on hypothetical

distributions and that in actual studies such concerns may be less severe. But at least in some

situations, problems with achieving accurate probability coverage in actual studies have been

found to be worse.1 In particular, based on bootstrap samples, the true distribution of T can

differ from an assumed Student's t distribution even more than indicated here.1

Turning to the two-sample case, it should be noted that early simulation studies dealing with

non-normality focused on identical distributions. If the sample sizes are equal, the sampling

distribution of the difference between the sample means will be symmetric even when the

individual distributions are asymmetric but have the same amount of skewness. In the one-

sample case, when sampling from a symmetric, heavy-tailed distribution, the actual probability

level of Student's T is generally less than the nominal level, which helps explain why early

simulations studies regarding the two-sample case concluded that Student's T is robust in terms

of controlling the probability of a Type I error. But more recent studies have indicated that

when the two distributions being compared differ in skewness, the actual Type I error

probability can exceed the nominal level by a substantial amount 1, 13.

Problem 3: Heteroscedasticity

The third fundamental insight is that violating the usual homoscedasticity assumption (i.e. the

assumption that all groups are assumed to have a common variance), is much more serious than

once thought. Both relatively poor power and inaccurate confidence intervals can result. Cressie

and Whitford14 established general conditions under which Student's T is not even

asymptotically correct under random sampling. When comparing the means of two independent

groups, some software packages now default to the heteroscedastic method derived by Welch15,

which is asymptotically correct. Welch's method can produce more accurate confidence

intervals than Student's T, but serious concerns persist in terms of both Type I errors and power.

Indeed, all methods based on means can have relatively low power under arbitrarily small

departures from normality.1 Also, when comparing more than two groups, commonly used

software employs the homoscedastic ANOVA F test, and typically no heteroscedastic option is

provided. This is a concern because with more than two groups, problems are exacerbated in

terms of both Type I errors and power.1 Similar concerns arise when dealing with regression.

Robust Solutions

As suggested above, the application of standard methodology for comparing means can

seriously bias the conclusions drawn from clinical trial data. Fortunately, a wide array of

procedures has been developed that produce more accurate, robust results when the

assumptions that underlie standard procedures are violated. We describe some of these methods

below.

Alternate Measures of Location

One way of dealing with outliers is to replace the mean with the median. Two centuries ago,

Laplace was aware that for heavy-tailed distributions, the usual sample median can have a

smaller standard error than the mean.16 But a criticism is that under normality the median is

rather unsatisfactory in terms of power.

Note that the median belongs to the class of trimmed means. By definition, a trimmed mean is

the average after a prespecified proportion of the largest and smallest values is deleted. A crude

description of why methods based on the median can have unsatisfactory power is that they

trim all but one or two values. A way of dealing with this problem is to trim less, and based on

efficiency, Rosenberger and Gasko17 conclude that a 20% trimmed mean is a good choice for

general use. That is, its standard error compares well to the standard error of the mean under

normality, but unlike the mean, it provides a reasonable degree of protection against the

deleterious effects of outliers. Heteroscedastic methods for comparing trimmed means, when

dealing with a range of commonly used designs, have also been derived.1

Another general approach when dealing with outliers is to use a measure of location that first

empirically identifies values that are extreme, after which these extreme values are down

weighted or deleted. Robust M-estimators are based in part on this strategy.1,2,5,7 Methods for

testing hypotheses based on robust M-estimators and trimmed means have been studied

extensively. The strategy of applying standard hypothesis testing methods after outliers are

removed is technically unsound and can yield highly inaccurate results due to using an invalid

estimate of the standard error. Theoretically sound methods have been derived.1-7 In both

theoretical and simulation studies, when testing hypotheses about skewed distributions, it seems

that a 20% trimmed mean has an advantage over a robust M-estimator,1 although no one single

method dominates.

Transformations

A simple way of dealing with skewness is to transform the data. Well-known possibilities are

taking logarithms and using a Box-Cox transformation.18, 19 One serious limitation is that simple

transformations do not deal effectively with outliers, leading Doksum and Wong20 to conclude

that some amount of trimming remains beneficial even when transformations are used. Another

concern is that after data are transformed, the resulting distribution can remain highly skewed.

However, both theory and simulations indicate that trimmed means reduce practical problems

associated with skewness, particularly when used in conjunction with certain bootstrap

methods.1, 13

Nonparametric regression

There are well-known parametric methods for dealing with regression lines that are nonlinear.

But parametric models do not always provide a sufficiently flexible approach, particularly when

dealing with more than one predictor. There is a vast literature on nonparametric regression

estimators, sometimes called smoothers, for addressing this problem. Robust versions have

been developed.1

To provide a crude indication of the strategy used by smoothers, imagine that in a regression

situation the goal is to estimate the mean of Y, given that X=6, based on n pairs of observations.

The strategy is to focus on the observed X values close to 6 and use the corresponding Y values

to estimate the mean of Y. Typically, smoothers give more weight to Y values for which the

corresponding X values are close to 6. For pairs of points for which the X value is far from 6,

the corresponding Y values are ignored. The general problem, then, is determining which values

of X are close to 6 and how much weight the corresponding Y values should be given. Special

methods for handling robust measures of variation have also been derived.1

Robust measures of association and effect size

There are two general approaches to robustly measuring the strength of association between

two variables. The first is to use some analog of Pearson's correlation that removes or down

weights outliers. The other is to fit a regression line and measure the strength of the association

based on this fit. Doksum and Samarov20 studied an approach based on this latter strategy using

what they call explanatory power. Let Y be some outcome variable of interest, let X be some

predictor variable, and let Y be the predicted value of Y based on some fit, which might be

based on either a robust linear model or a nonparametric regression estimator that provides a

flexible approach to curvature. Then explanatory power is

ρe2=

σ 2( Y )σ 2¿¿

where σ 2 (Y ) is the variance of Y. When Y is obtained via ordinary least squares, ρe2is equal to ρ2

where ρ is Pearson's correlation. A robust version of explanatory power is obtained simply by

replacing the variance, σ 2, with some robust measure of variation.1 Here the biweight

midvariance is used with the understanding that several other choices are reasonable. Roughly,

the biweight midvariance empirically determines whether any values among the marginal

distributions are unusually large or small. If such values are found, they are down weighted.

This measure of variation has good efficiency among the many robust measures of variation

that have been proposed. It should be noted, however, that the choice of smoother matters

when estimating explanatory power, with various smoothers performing rather poorly in terms

of mean squared error and bias, at least with small to moderate sample sizes.21 One smoother

that performs relatively well in simulations is the robust version of Cleveland's smoother.22

When comparing measures of location corresponding to two independent groups, Algina,

Keselman, and Penfield23 suggest a robust version of Cohen's measure of effect size that

replaces the means and the assumed common variance with a 20% trimmed mean and 20%

Winsorized variance. They also rescale their measure of effect size so that under normality, it is

equal to δ. The resulting estimate is denoted by dt. It is noted that a heteroscedastic analog of δ

can be derived using a robust version of explanatory power.24 Briefly, it reflects the variation of

the measures of location divided by some robust measure of variation of the pooled data, which

has been rescaled so that it estimates the population variance under normality. Here, we label

this version of explanatory power ξ2, and ξ is called a heteroscedastic measure of effect size. It

can be shown that for normal distributions with a common variance, δ =0.2, 0.5 and 0.8

correspond to ξ =0.15, 0.35 and 0.5, respectively. This measure of effect size is readily

extended to more than two groups.

Consider again the two mixed normal distributions in the right panel of Figure 2, only now the

second distribution is shifted to have a mean equal to 0.8. Then Cohen's d is 0.24, suggesting a

small effect size, even though from a graphical perspective we have what Cohen characterized

as a large effect size. The Algina et al measure of effect size, dt is 0.71, and ξ =0.46,

approximately, both of which suggest a large effect size.

Comments on Software

A practical issue is applying robust methods. This can be done using one of the most important

software developments during the last thirty years: the free software R, which can be

downloaded from http://www.r-project.org/. The illustrations used here are based on a library of

over 900 R functions1, which are available at http://www-rcf.usc.edu/~rwilcox/ .

An R package WRS is available as well and can be installed with the R command

install.packages(``WRS'',repos=``http:R-Forge.R-project.org''), assuming that the most recent

version of R is being used.

Practical illustration of robust methods: Analysis of a lifestyle intervention for older

adults

Design

To illustrate the practical benefits of using recently developed robust methods, we report some

results stemming from a recent RCT of a lifestyle intervention designed to promote the health

and well-being of community-dwelling older adults. Details regarding the wider RCT are

summarized by Clark et al.25 Briefly, this trial was conducted to compare a six-month lifestyle

intervention to a no treatment control condition. The design utilized a cross-over component,

such that the intervention was administered to experimental participants during the first six

study months and to control participants during the second six months. The results reported

here are based on a pooled sample of n=364 participants who completed assessments both prior

to and following receipt of the intervention.

In assessing the potential importance of using robust statistical methods, our goal was to assess

the association between number of hours of treatment and various outcome variables. Outcome

variables included: (a) eight indices of health-related quality of life, based on the SF -36

(physical function, bodily pain, general health, vitality, social function, mental health, physical

component scores, and mental component scores)26; (b) depression, based on the Center for

Epidemiologic Studies Depression Scale27; and (c) life satisfaction, based on the Life

Satisfaction Index – Z Scale21. In each instance, the outcome variable consisted of signed post-

treatment minus pre-treatment change scores. Preliminary analyses revealed that all outcome

variables were found to have outliers based on boxplots, raising the possibility that modern

robust methods might make a practical difference in the conclusions reached.

Prior to discussing the primary results, it is informative to first examine plots of the regression

lines estimated via the nonparametric quartile regression line derived by Koenker and Ng28.

Figure 5 shows the estimated regression line (based on the R function qsmcobs) when

predicting the median value of the variable physical function given the number of hours of

individual sessions28. Notice that the regression line is approximately horizontal from 0 to 5

hours, but then an association appears. Pearson's correlation over all sessions, is 0.178

(p=0.001), indicating an association. However, Pearson’s correlation is potentially misleading

because the association appears to be nonlinear. The assumption of a straight regression line

can be tested with the R function qrchk, which yields p=0.006. The method used by this

function stems from results derived by He and Zhu29, which is based on a nonparametric

method for estimating the median of a given outcome variable of interest. As can be seen, it

appears that with 5 hours or less, the treatment has little or no association with physical

function, but at or above 5 hours, an association appears. In fitting a straight regression line via

the Theil and Sen estimator, for hours greater than or equal to 5, the robust explanatory measure

of association is 0.347. And the hypothesis of a zero slope, tested with the R function regci,

yields p=0.015. For hours less than 5, the Theil-Sen estimate of the slope is 0. In summary,

Pearson’s correlation indicates an association, but it is misleading in the sense that for less than

5 hours there appears to little or no association, and for more than 5 hours the association

appears to be much stronger than indicated by Pearson’s correlation.

Note also that after about 5 hours, Figure 5 suggests that the association is strictly increasing.

However, the use of a running-interval smoother with bootstrap bagging, which was applied

with the R function rplotsm, yields the plot in Figure 6. Again, there is little or no association

initially and around 5-7 hours an association appears, but unlike Figure 5, Figure 6 suggests that

the regression line levels off at about 9 or 10 hours. The main point here is that smoothers can

play an important role in terms of detecting and describing an association. Indeed, no

association is found after 9 hours, although this might be due to low power.

Similar issues regarding nonlinear associations occurred when analyzing other outcome

variables. Figure 7 shows the median regression line when predicting physical composite based

on individual session hours. Pearson's correlation rejects the null hypothesis of no association

(r=.2, p=0.015), but again a smoother indicates that there is little or no association from about 0

to 5 hours, after which an association appears. For 0 to 5 hours, r=-0.071 (p=0.257), and for 5

hours or more, r=0.25 (p=0.045). But the 20% Winsorized correlation for 5 hours or more is

rw=0.358 (p=0.0045). (Winsorizing means that the more extreme values are “pulled in”, which

reduces the deleterious effects of outliers). The explanatory measure of effect size, ρe, is

estimated to be 0.47.

A cautionary remark should be made. When using a smoother, the ends of the regression line

can be misleading. In Figure 8, as the number of session hours gets close to the maximum

observed value, the regression appears to be monotonic decreasing. But this could be due to an

inherent bias when dealing with predictor values that are close to the maximum value. For

example, if the goal is to predict some outcome variable Y when X=8, the method gives more

weight to pairs of observations for which the X values are close to 8. But if there are many X

values less than 8, and only a few larger than 8, this can have a negative impact on the accuracy

of the predicted Y value.

Figure 8 shows the estimate of the regression line based on a running-interval smoother with

bootstrap bagging. When the number of session hours is relatively high, again there is some

indication of a monotonic decreasing association, but it is less pronounced, suggesting that it

might be an artifact of the estimators used. One concern is that there are very few points close

to the maximum number of session hours, 16.5. Consequently, the precision of the estimates of

physical composite, given that the number of session hours is close to 16.5, would be expected

to be relatively low.

Table 1 summarizes measures of association between session hours and the 10 outcome

variables. Some degree of curvature was found for all of the variables listed. Shown are

Pearson's r; the Winsorized correlation , which measures the strength of a linear association

while guarding against outliers among the marginal distributions; and the robust explanatory

measure of the strength of association based on Cleveland's smoother, . (Note that there is no

p-value reported for . Methods for determining a p-value are available but the efficacy of

these methods, in the context of Type I error probabilities, has not been investigated.)

As can be seen, when testing at the 0.05 level, if Pearson's correlation rejects, the Winsorized

correlation rejects as well. There are, however, four instances where the Winsorized correlation

rejects and Pearson's correlation does not: vitality, mental composite, depression, and life

satisfaction. Particularly striking are the results for depression. Pearson's correlation is -0.022

(p=0.694), whereas the Winsorized correlation is -0.132 (p=0.018). This illustrates the basic

principle that outliers can mask a strong association among the bulk of the points.

Note that values are always positive. This is because it measures the strength of the

association without assuming that the regression line is monotonic. That is, it allows for the

possibility that the regression line is increasing over some region of the predictor values, but

decreasing otherwise.

Another portion of the study dealt with comparing the change scores of two groups of

participants. The variables were measures of physical function, bodily pain, physical composite

and a cognitive score. For the first group there was an ethnic match between the participant and

therapist; in the other group there was no ethnic match. The total sample size was n=205. Table

2 summarizes the results when comparing means with Welch's test and 20% trimmed means

with Yuen's test, which was applied with the R function yuenv2. (A theoretically correct

estimate of the standard error of a trimmed mean is based in part on the 20% Winsorized

variance, which was used here.) As can be seen, Welch's test rejects at the 0.05 level for three

of the four outcome variables. In contrast, Yuen's test rejects in all four situations. Also shown

are the three measures of effect size previously described. Note that following Cohen's

suggestion, d is relatively small for the first outcome variable and a medium effect size is

indicated in the other three cases. The Algina et al.23 measure of effect size, dt, is a bit larger. Of

particular interest is the heteroscedastic measure of effect size, , which effectively deals with

outliers. Now the results indicate a medium to large effect size ( ) for three of the outcome

variables. (Formal hypothesis testing methods for comparing the population analogs of d and

have not been derived.)

Discussion

There are many important applications of robust statistical methodology that extend beyond

those described in this paper. For example, the violation of assumptions in more complex

experimental designs is associated with especially dire consequences, and robust methods for

dealing with these problems have been derived1.

A point worth stressing is that no single method is always best. The optimal method in a given

situation depends in part on the magnitude of true differences or associations, and the nature of

the distributions being studied, which are often unknown. Robust methods are designed to

perform relatively well under normality and homoscedasticity, while continuing to perform

well when these assumptions are violated. It is possible that classic methods offer an advantage

in some situations. For example, for skewed distributions, the difference between the means

might be larger than the difference between the 20% trimmed mean, suggesting that power

might be higher using means. However, this is no guarantee that means have more power

because it is often the case that a 20% trimmed mean has a substantially smaller standard error.

Of course, classic methods perform well when standard assumptions are met, but modern robust

methods are designed to perform nearly as well for this particular situation. In general,

complete reliance on classic techniques seems highly questionable in that hundreds of papers

published during the last quarter century underscore the strong practical value of modern

procedures. All indications are that classic, routinely used methods can be highly unsatisfactory

under general conditions. Moreover, access to modern methods is now available via a vast array

of R functions. The main message here is that modern technology offers the opportunity for

analyzing data in a much more flexible and informative manner, which would enable

researchers to learn more from their data, thereby augmenting the accuracy and practical utility

of clinical trial results.

Table 1: Measures of association between hours of treatment and the variables listed in column 1 (n=364)

Pearson’s r p rw * p re **PHYSICAL FUNCTION 0.178 0.001 0.135 0.016 0.048

BODILY PAIN 0.170 0.002 0.156 0.005 0.198

GENERAL HEALTH 0.209 0.0001 0.130 0.012 0.111

VITALITY 0.099 0.075 0.139 0.012 0.241

SOCIAL FUNCTION 0.112 0.043 0.157 0.005 0.228

MENTAL HEALTH 0.141 0.011 0.167 0.003 0.071

PHYSICAL COMPOSITE 0.200 0.0002 0.136 0.015 0.255

MENTAL COMPOSITE 0.095 0.087 0.149 0.007 0.028

DEPRESSION -0.022 0.694 -0.132 0.018 0.134

LIFE SATISFACTION 0.086 0.125 0.118 0.035 0.119

*rw= 20% Winsorized correlation

**re=robust explanatory measure of association

Table 2: P-values and measures of effect size when comparing ethnic matched patients to a non-matched group

Welch’s test: p-value

Yuen’s test:p-value

d dt ξ

Physical Function 0.1445 0.0469 0.212 0.310 0.252

Bodily Pain .01397 <.0001 0.591 0.666 0.501

Physical Composite <.0001 0.0002 0.420 0.503 0.391

Cognition 0.0332 0.0091 0.415 0.408 0.308

The measures of effect size are d (Cohen’s d), (a robust measure of effect size that assumes homoscedasticity), and (a robust measure of effect size that allows heteroscedasticity). Under normality and homoscedasticity, d=0.2, 0.5 and 0.8 approximately correspond to =.15, .35 and .5, respectively

Figure 1: Despite the obvious similarity between the standard normal and contaminated normal distributions, the standard normal has variance 1 and the contaminated normal has variance 10.9.

Figure 2: Left panel, d=1, power=0.96. Right panel, d=.3, power=.28.

Figure 3: Illustration of the sensitivity of Pearson's Correlation to contaminated (heavy-tailed) distributions.

Figure 4: The solid line is the distribution of Student's T, n = 30, when sampling from a lognormal distribution. The dashed line is the distribution T when sampling from a normal distribution.

Figure 5: The median regression line for predicting physical function based on the number of session hours.

Figure 6: The running-interval smooth based on bootstrap bagging using the same data in Figure 5.

Figure 7: The median regression line for predicting physical composite based on the number of session hours.

Figure 8: The estimated regression line for predicting physical composite based on the number of session hours. The estimate is based on the running-interval smoother with bootstrap bagging.

References

1. Wilcox RR. Introduction to robust estimation and hypothesis testing. 2nd Ed. Elsevier

Academic Press, Burlington, MA, 2005.

2. Hampel FR, Ronchetti EM, Rousseeuw PJ, and Stahel WA. Robust statistics: the approach

based on influence functions. Wiley, New York, 1986.

3. Hoaglin DC, Mosteller F, and Tukey, JW. Understanding robust and exploratory data

analysis. Wiley, New York, 1983.

4. Hoaglin DC, Mosteller F, and Tukey JW. Exploring data tables, trends, and shapes. Wiley,

New York, 1985.

5. Huber P and Ronchetti EM. Robust statistics. 2nd Ed. Wiley, New York, 2009.

6. Maronna RA, Martin MA and Yohai VJ. Robust statistics: theory and methods. Wiley, New

York, 2006.

7. Staudte RG and Sheather SJ. Robust estimation and testing. Wiley, New York, 1990.

8. Tukey JW. A survey of sampling from contaminated normal distributions. In: I. Olkin et al.

(Eds.) Contributions to probability and statistics. Stanford University Press, Stanford, CA,

1960, pp. 448-485.

9. Cohen, J. Statistical Power Analysis for the Behavioral Sciences. 2nd Ed. Academic Press,

New York, 1988.

10. Shavelson RJ. Statistical reasoning for the behavioral sciences. Needham Heights, MA:

Allyn and Bacon, 1988, p. 266.

11. Goldman RN and Weinberg JS. Statistics an introduction. Englewood Cliffs, NJ: Prentice-

Hall, 1985, p. 252.

12. Huntsberger DV and Billingsley P. Elements of statistical inference. 5th ed. Boston, MA:

Allyn and Bacon, 1981.

13. Cribbie RA, Fiksenbaum L, Keselman H J and Wilcox RR. (in press). Effects of

nonnormality on test statistics for one-way independent groups designs. British Journal of Math

Statistical Psychology, in press.

14. Cressie NA and Whitford HJ. How to use the two sample t-test. Biometrical Journal,

1986; 28: 131-148.

15. Welch BL. The significance of the difference between two means when the population

variances are unequal. Biometrika 1938; 29, 350-362.

16. Hand A. A history of mathematical statistics from 1750 to 1930. Wiley, New York, 1998.

17. Rosenberger JL and Gasko M. Comparing location estimators: Trimmed means, medians,

and trimean. In: D. Hoaglin, F. Mosteller, J. Tukey (Eds.) Understanding robust and

exploratory data analysis. Wiley, New York, 1983, pp. 297-336.

18. Box GE and Cox DR. (1964) An analysis of transformations. Journal of the Royal Statistical Society B 1964; 26: 211-252.

19. Sakia R M. The Box-Cox transformation: A review. The Statistician 1992; 41:169-178.

20. Doksum KA and Wong C-W. Statistical tests based on transformed data. Journal of the

American Statistical Association 1983; 78: 411-417.

21. Wood V, Wylie ML and Sheafor B. An analysis of short self-report measure of life

satisfaction: Correlation with rater judgments. Journal of Gerontology 1969; 24:465-469.

22. Cleveland WS. (1979). Robust locally-weighted regression and smoothing scatterplots.

Journal of the American Statistical Association 1979; 74: 829-836.

23. Algina J, Keselman HJ and Penfield RD. An alternative to Cohen's standardized mean

difference effect size: A robust parameter and confidence interval in the two independent

groups case. Psychol Methods 2005; 10: 317-328.

24. Wilcox RR and Tian T. Measuring effect size: A robust heteroscedastic approach for two

or more groups. Journal of Applied Statistics, in press.

25. Clark F, Jackson J, Mandel D, Blanchard J, Carlson M, Azen S, et al. Confronting

challenges in intervention research with ethnically diverse older adults: the USC Well Elderly II

trial. Clin Trials 2009; 6: 90-101.

26. Ware JE and Kosinski M, Dewey JE. How to score version two of the SF-36 Health Survey.

QualityMetric, Lincoln, RI, 2000.

27. Radloff LS. The CES-D Scale: A self-report depression scale for research in the general

population. Applied Psychological Measurement 1977; 1:385-401.

28. Koenker R and Ng P. Inequality Constrained Quantile Regression. Sankhya, Indian Journal of Statistics 2005; 67:418-440.

29. He X and Zhu L-X. A lack-of-fit test for quantile regression. Journal of the American

Statistical Association 2003; 98:1013-1022.

dornsife.usc.edu€¦ · web viewin particular, the results demonstrated that details of the...

Documents