department of statistics 9 february 2007 ssp core facility 1 (and precision) effective research...

9 February 2007 SSP Core Facility 1

Department of Statistics

(and Precision) Effective Research Design

Planning

for Grant Proposals & More

Walt Stroup, Ph.D.Professor & Chair, Department of Statistics

University of Nebraska, Lincoln

Power Power



Outline for Talk

I. What is “Power Analysis”? Why should I do it?

II. Essential Background

III. A Word about Software

IV. Decisions that Affect Power – several examples

V. Latest Thinking

VI. Final Thoughts



Power and Precision Defined

Precision a.k.a “Margin of Error”− In most cases, the standard error of relevant estimate

Power−Prob { reject H0 given H0 false }

−Prob { research hypothesis statistically significant }

Power analysis−essentially, “If I do the study this way, power = ?”

Sample size estimation−How many observations required to achieve given power?



What’s involved in Power Analysis

WHAT IT’S NOT: “Painting by numbers...”

IF IT’S DONE RIGHT Power analysis should be

−a comprehensive conversation to plan the study

−a “dress rehearsal” for the statistical analysis once the data are collected



Why do a Power Analysis?

For NIH Grant Proposal−because it’s required

For many other grant proposals−because it gives you a competitive edge

Other reasons−practical: increases chance of success; reduces

“we don’t have time to do it right, but lots of time to do it over” syndrome

−ethical



Ethical???

Last Ph.D. in U.S. Senate Irritant to doctrinaire left and right

Keynote address to 1997 American Stat. Assoc. “... we can continue to make policy based on ‘data-free ideology’ on we can inform policy where possible by competent inquiry...”

late U.S. Senator Daniel Patrick Moynihan



Ethical

Results of your study may affect policy Well-conceived research means

−better information

−greater chance of sound decisions

Poorly-conceived research− lost opportunity

−deprives policy-makers of information that might have been useful

−or worse: bad information misinforms or misleads public



What affects Power & Precision?

A short statistics lesson

1. What goes into computing test statistics

2. What test statistics are supposed to tell us

3. A bit about the distribution of test statistics

4. Central and non-central t, F, and chi-square( mostly F )



What goes into a test statistic?

Research hypothesis – motivation for study

Assumed not true unless data show compelling evidence otherwise

Research hypothesis: HA ; opposite: H0

H0 true HA true

Fail to reject H0 Type II error

Reject H0 Type I error Power



What goes into a test statistic?

Visualize using F

But same basic principles for t, chi-square, etc

F is ratio of variation attributable to factor under study vs. variation attributable to noise

22

2

of obs effect size estimates

(noise)

NF

N of obs effect sizevariance of noise(i.e. among obs)



When H0 True – i.e. no trt effect

numerator (trt) d.f., denominator (noise/error) d.f.~ FF



When H0 false (i.e. Research HA true)

num. d.f., den (error) d.f.,

2

2

~ F "non-centrality parameter"

N of obs effect size

F



What affects Power?

2

2

Increase "non-centrality parameter" Increase Power





What should be in a conversation about Power?

2

2

Increase "non-centrality parameter" Increase Power



Effect size: what is the minimum that matters? Variance: how much “noise” in the response

variable (range? distribution? count? pct?) Practical Constraints Design: same N can produce varying Power



About Software (part I) Canned Software

− lots of it−Xiang and Zhou working on report−“painting by numbers”

Simulation−most accurate; not constrained by canned scenarios−you can see what will happen if you actually do this...

“Exemplary data set” + modeling software−nearly as accurate as simulation−“dress rehearsal” for actual analysis−MIXED, GLIMMIX, NLMIXED: if you can model it

you can do power analysis



Design Decisions – Some Examples

Main Idea: For the same amount of effort, or $$$, or # observations, power and precision can be quite different

Power analysis objective: Work smarter, not harder

Simple example – design of regression study−From STAT 412 exercise



Treatment Design Exercise

Class was asked to predict Bounce Height of basketball from Drop Height and to see if relationship changes depending on floor surface

Decision: What drop heights to use???



Objectives and Operating Definitions

Recall objective: does drop: bounce height relationship change with floor surface?

0 1 0 1

1 1

Model:

relationship change meansC C T T

C T

y X X

operating definition



Consequences of Drop Height Decisions Should we use fewer drops heights & more obs per drop

height or vice versa?

table from Stat 412 Avery archive



Simulation

CRD example: 3 treatments, 5 reps / treatment Suspected Effect size: 6-10% relative to control,

whose mean is known to be ~ 100 Standard deviation: 10 considered “reasonable” Simulate 1000 experiments Reject H0: equal trt means 228 times

−power = 0.228 at alpha=0.05

Ctl mean ranked correctly 820 times (intermediate mean ranked correctly 589 times)



“Exemplary Data” Many software packages for power & sample size

− e.g SAS PROC POWER− for FIXED effect models only

“Exemplary Data” more general Especially (but not only) when “Mixed Model Issues”

− random effects− split-plot structure− errors potentially correlated: longitudinal or spatial data− any other non-standard model structure

Methods use PROC MIXED or GLIMMIX− adapted from Stroup (2002, JABES)

Chapter 12, SAS for Mixed Models − (Littell, et al, 2006)



“Exemplary Data” - Computing Power using SAS

create data set like proposed design

run PROC GLIMMIX (or MIXED) with variance fixed

=(F computed by GLIMMIX)rank(K) [or chi-sq with GLM]

use GLIMMIX to compute

critical F (Fcrit ) is value s.t.

P{F (rank(K), υ, 0 ) > Fcrit}= [or chi-square]

Power = P{F [rank(K), υ, ] >Fcrit }

SAS functions can compute Fcrit & Power



/* step 1 - create data set with same structure as proposed design use MU (expected mean) instead of observed Y_ij values *//* this example shows power for 5, 10, and 15 e.u. per trt */

data crdpwrx1; input trt mu; do n=5 to 15 by 5; do eu=1 to n; output; end; end;cards;1 1002 943 90;

Compute Power with GLIMMIX – CRD example



Compute Power with GLIMMIX – CRD example

/* step 2 - use PROC GLIMMIX to compute non-centrality parameters for ANOVA tests & contrasts ODS statements output them to new data sets */proc sort data=crdpwrx1;by n;

proc glimmix data=crdpwrx1;by n; class trt; model mu=trt; parms (100)/hold=1; contrast 'et1 v et2' trt 0 1 -1; contrast 'c vs et' trt 2 -1 -1; ods output tests3=b; ods output contrasts=c;run;



/* step 3: combine ANOVA & contrast n-c parameter data sets use SAS functions PROBF and FINV to compute power */data power; set b c; alpha=0.05; ncparm=numdf*fvalue; fcrit=finv(1-alpha,numdf,dendf,0); power=1-probf(fcrit,numdf,dendf,ncparm);proc print;

Obs Effect Label DF DenDF alpha nc fcrit power

1 trt 2 12 0.05 2.53333 3.88529 0.223612 et1 v et2 1 12 0.05 0.40000 4.74723 0.089803 c vs et 1 12 0.05 2.13333 4.74723 0.26978

Type III Tests of Fixed Effects

EffectNum

DFDen DF F Value Pr > F

trt 2 12 1.27 0.3169

Contrasts

LabelNum

DFDen DF F Value Pr > F

et1 v et2 1 12 0.40 0.5390

c vs et 1 12 2.13 0.1698

Note close agreementof Simulated Power(0.228) and “exemplarydata” power (0.224)



More Advanced Example

Plots in 8 x 3 grid Main variation along 8 “rows” 3 x 2 treatment design Alternative designs

− randomized complete block (4 blocks, size 6)

− incomplete block (8 blocks, size 3)

−split plot

RCBD “easy” but ignores natural variation



Picture the 8 x 3 Grid

Gradient

e.g. 8 schools, gradient is “SES”, 3 classrooms each



SAS Programs to Compare 8 x 3 Designdata a; input bloc trtmnt @@; do s_plot=1 to 3; input dose @@; mu=trtmnt*(0*(dose=1)+4*(dose=2)+8*(dose=3)); output; end;cards;1 1 1 2 31 2 1 2 32 1 1 2 32 2 1 2 33 1 1 2 33 2 1 2 34 1 1 2 34 2 1 2 3;

proc glimmix data=a noprofile; class bloc trtmnt dose; model mu=bloc trtmnt|dose; random trtmnt/subject=bloc; parms (4) (6) / hold=1,2; lsmeans trtmnt*dose / diff; contrast 'trt x lin'

trtmnt*dose 1 0 -1 -1 0 1; ods output diffs=b; ods output contrasts=c;run;

Split-Plot



8 x 3 – Incomplete Blockdata a; input bloc @@; do eu=1 to 3; input trtmnt dose @@; mu=trtmnt*(0*(dose=1)+4*(dose=2)+8*(dose=3)); output; end;cards;1 1 1 1 2 1 32 1 1 1 2 2 23 1 1 1 3 2 34 1 1 2 1 2 25 1 2 1 3 2 26 1 2 2 1 2 37 1 3 2 1 2 38 2 1 2 2 2 3;

proc glimmix data=a noprofile; class bloc trtmnt dose; model mu=trtmnt|dose; random intercept / subject=bloc; parms (4) (6) / hold=1,2; lsmeans trtmnt*dose / diff; contrast 'trt x lin'




8 x 3 Example - RCBDdata a; input trtmnt dose @@; do bloc=1 to 4; mu=trtmnt*(0*(dose=1)+4*(dose=2)+8*(dose=3)); output; end;cards;1 1 1 2 1 3 2 1 2 2 2 3;

proc glimmix data=a noprofile; class bloc trtmnt dose; model mu=bloc trtmnt|dose; parms (10) / hold=1; lsmeans trtmnt*dose / diff; contrast 'trt x lin'




How did designs compare?

Suppose main objective is compare regression over 3 levels of doses: do they differ by treatment? (similar to basketball experiment)

Operating definition is thus H0: dose regression coefficient equal

Power for Randomized Block: 0.66 Power for Incomplete Block: 0.85 Power for Split-Plot: 0.85 Same # observations – you can work smarter



But what if I don’t know Trt Effect Size or Variance?

“How can I do a power analysis? If I knew the effect size and the variance I wouldn’t have to do the study.”

What trt effect size is NOT: it is NOT the effect size you are going to observe

It is somewhere between−what current knowledge suggests is a reasonable

expectation

−minimum difference that would be considered “important” or “meaningful”



And Variance??

Know thy relevant background / Do thy homework

Literature search: what have others working with similar subjects reported as variance?

Pilot study Educated guess

−range you’d expect 95% of likely obs? divide it by 4

−most extreme values you can plausibly imagine? divide range by 6



Hierarchical Linear Models

From Bovaird (10-27-2006) seminar 2 treatment 20 classrooms / trt 25 students / classroom 4 years reasonable ideas of classroom(trt),

student(classroom*trt), within student variances as well as effect size

Implement via exemplary data + GLIMMIX



Categorical Data?

Example: Binary data “Standard” has success probability of 0.25 “New & Improved” hope to increase to 0.30 Have N subjects at each of L locations

For sake of argument, suppose we have−900 subjects / location

−10 locations



Power for GLMs

2 treatments P{favorable outcome} for trt 1 p= 0.30; for trt 2 p=0.25 power if n1=300; n2=600data a; input trt y n; datalines;1 90 3002 150 600;

proc glimmix; class trt; model y/n=trt / chisq; ods output tests3=pwr;run;

data power; set pwr; alpha=0.05; ncparm=numdf*chisq; crit=cinv(1-alpha,numdf,0); power=1-probchi(crit,numdf,ncparm); proc print; run;

exemplary data



Power for GLMM Same trt and sample size per location as before 10 locations Var(Location)=0.25; Var(Trt*Loc)=0.125 Variance Components: variation in log(OddsRatio) Power?data a; input trt y n; do loc=1 to 10; output; end; datalines; 1 90 300 2 150 600 ;

proc glimmix data=a initglm; class trt loc; model y/n = trt / oddsratio; random intercept trt / subject=loc; random _residual_; parms (0.25) (0.125) (1) / hold=1,2,3; ods output tests3=pwr;run;



GLMM Power Analysis Results

Obs Effect NumDF DenDF alpha ncparm fcrit power

1 trt 1 9 0.05 2.29868 5.11736 0.27370

Odds Ratio Estimates

trt _trt Estimate DF

95% Confidence

Limits

1 2 1.286 9 0.884 1.871

Gives you expected Conf Limits for # Locations & N / Loccontemplated

Gives you the power of the test of TRT effect on prob(favorable)



GLMM Power: Impact of Sample Size?

N of subjects per trt per location?

N of Locations?

Three cases

1. n-300/600 10 loc2. n=600/1200, 10 loc3. n=300/600, 20 loc

data a; input trt y n; do loc=1 to 10; output; end; datalines; 1 90 300 2 150 600 ;





GLMM Power: Impact of Sample Size?Recall, for 10 locations, N=300/600,

CI for OddsRatio was (0.884, 1.871); Power was 0.274For 10 locations, N=600 / 1200

Odds Ratio Estimates

trt _trt Estimate DF 95% Confidence Limits

1 2 1.286 9 0.891 1.855


1 trt 1 9 0.05 2.40715 5.11736 0.28421

For 20 locations, N=300 / 600Odds Ratio Estimates

trt _trt Estimate DF 95% Confidence Limits

1 2 1.286 19 1.006 1.643


1 trt 1 19 0.05 4.59736 4.38075 0.53003

N alone has almost no impact



Recent developments

Continue binary example Power analysis shows:

-level 0.10 0.05 0.05 0.01 0.05 0.01

Power 0.80 0.80 0.90 0.80 0.95 0.90

Llocations

27 38 46 53 57 68

what do you do?



More Information

Consider studies directed toward improving success rate similar to that proposed in study

Lit search yields 95 such studies 29 have reported statistically significant gains of

p1-p2>0.05 (or, alternatively, significant odds ratios of [(30/70)/(25/75)]=1.28 or greater)

If this holds, “prior” prob (desired effect size ) is approx 0.3



An Intro Stat Result

0Pr desired effect size | reject H

0Pr | D.E.S. Pr D.E.S.

Pr | D.E.S. Pr D.E.S. Pr | DES Pr DES

reject H

reject reject

for =0.10, power=0.8

0.8 0.30.77

0.8 0.3+0.1 0.7

real Pr{type I error}is more like 0.23than 0.10!!!



Returning to All Scenarios

-level 0.10 0.05 0.05 0.01 0.05 0.01

Power 0.80 0.80 0.90 0.80 0.95 0.90

Llocations

27 38 46 53 57 68

Pr{DES | reject H0 }

0.77 0.87 0.89 0.97 0.89 0.97

NOTE dramatic impact of alpha-level when “prior” Pr { DES } is relatively low

POWER role increases at Pr { DES } increases



Closing Comments

In case it’s not obvious− I’m not a fan of “painting by numbers”

−Role of power analysis misunderstood & underappreciated

MOST of ALL it is an opportunity to explore and rehearse study design & planned analysis

Engage statistician as a participating member of research team

Give it the TIME it REQUIRES

46

ThanksThanks

... for coming... for coming

department of statistics 9 february 2007 ssp core facility 1 (and precision) effective research...

Documents

power precision

given power

power analysiswhat

power analysisessentially

research hypothesis

sizevariance of noisei

obseffect size

research ha truewhat