ap statistics - denton isd...ap statistics 1 the only statistics you can trust are those you...

22
RE-EXPRESSING DATA (PART 2) CHAP 9 AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically falsified)

Upload: others

Post on 24-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

R E - E X P R E S S I N G D A T A ( P A R T 2 )

C H A P 9

AP Statistics1

The only statistics you can trust are those you falsified yourself.

Sir Winston Churchill (1874 - 1965)(Attribution to Churchill is ironically falsified)

Page 2: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Goal of Re-expression2

Make the distribution of a variable more symmetric:

A symmetric distribution can be analyzed much more easily than a skewed distribution.

Page 3: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Goal of Re-expression3

Make the spread of several groups more alike:

With similar spreads, distributions are easier to compare.

Page 4: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Goal of Re-expression4

Make the form of a scatterplot more linear:

Linear regression is easy – non-linear regression is not!

Page 5: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Goal of Re-expression5

Make the scatter in a scatterplot spread out evenly rather than following a fan-shape:

An even scatter is a necessary condition for analysis we will learn about later.

Page 6: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

What Transformation?6

Ratios of two quantities (e.g., mph) often benefit from a reciprocal.

The reciprocal of the data

-1

An uncommon re-expression, but sometimes useful.

Reciprocal square root

-1/2

Measurements that cannot be negative often benefit from a log re-expression.

We’ll use logarithms here

“0”

Counts often benefit from a square root re-expression.

Square root of data values

½

Data with positive and negative values and no bounds are less likely to benefit from re-expression.

Raw data1

Try with unimodal distributions that are skewed to the left.

Square of data values

2

CommentNamePower

When in doubt, start here:

Ladder of Powers (see p 237)

Page 7: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Important Models7

Exponential Model: 0 1ˆlog y b b x

2 4 6 8 10

01

00

00

20

00

0

Original Data

x

y

2 4 6 8 102

46

81

0

Transformed Data

x

log

(y)

This is the zero power on the ladder. It is useful for values that grow (or shrink) by percentages.

Page 8: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Important Models8

Logarithmic Model: 0 1ˆ logy b b x

Data with a wide range of x-values or with a scatterplot that is very steep at the left and levels out towards the right.

0 10000 20000

24

68

Original Data

x2

y2

0.0 1.0 2.0

24

68

Transformed Data

log(x)

y2

Page 9: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Important Models9

Power Model: 0 1ˆlog logy b b x

The authors of the textbook call this one the Goldilocks Model – when steps on the ladder are either too big or too small.

2 4 6 8 10

02

04

06

08

01

00

Original Data

x

y3

0.0 1.0 2.00

12

34

Transformed Data

log(x)

log

(y3

)

Page 10: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Example10

Below are data from 12 perch caught in a lake in Finland (length in cm and weight in grams).

Length

(cm)

Weight

(g)

Length

(cm)

Weight

(g)

8.8 5.9 28.7 300.0

19.2 100.0 30.1 300.0

22.5 110.0 39.0 685.0

23.5 120.0 41.4 650.0

24.0 150.0 42.5 820.0

25.5 145.0 46.6 1000.0

Page 11: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Example11

In order to create a model to predict weight from length, start by looking at the data:

There is a fairly strong, positive, and nonlinear association between weight and length.

Page 12: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Example12

Page 13: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Example13

We need to transform the data (one or both variables) to achieve a more linear relationship. In the biological sciences, power models are fairly common, so we’ll start there.

Take the logarithm of both variables (either base-10 or base-e log –we don’t care which).

The association between the logs of the variables is quite linear.

Page 14: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Example14

Create a linear model, and then check the residuals to determine if the model may be reasonable. Note – you can’t use either R or R-squared to determine if your model is reasonable. These statistics are only useful after you assess the model fit.

Regression Analysis: log(W) versus log(L)

The regression equation is

log(W) = - 2.06 + 3.05 log(L)

Predictor Coef SE Coef T P

Constant -2.0596 0.1498 -13.75 0.000

log(L) 3.0538 0.1037 29.44 0.000

S = 0.0680088 R-Sq = 98.9% R-Sq(adj) = 98.7%

Page 15: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Example15

Linear Model –remember your calculator doesn’t know you are using log-transformed data when it produces the equation.

log weight 2.06 3.05log length

The residuals appear to be fairly random, so this linear model is reasonably appropriate.

Page 16: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Example16

Describe what the slope represents:

log weight 2.06 3.05log length

For every one-unit increase in the log of length, the log of weight increases by about 3.05.

Page 17: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Example17

Describe what the correlation represents:

The correlation is the square root of R-squared, which is about 0.994. This indicates there is a very strong, positive, linear relationship between the logs of weight and length.

Predictor Coef SE Coef T P

Constant -2.0596 0.1498 -13.75 0.000

log(L) 3.0538 0.1037 29.44 0.000

S = 0.0680088 R-Sq = 98.9% R-Sq(adj) = 98.7%

Page 18: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Example18

Describe what R-squared represents:

About 98.9% of the variability in the log of weight is accounted for by the regression with the log of length.

Predictor Coef SE Coef T P

Constant -2.0596 0.1498 -13.75 0.000

log(L) 3.0538 0.1037 29.44 0.000

S = 0.0680088 R-Sq = 98.9% R-Sq(adj) = 98.7%

Page 19: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Example19

Use the model to predict the weight of a perch that is 35 cm long.

The predicted weight for a 35 cm perch is about 446 grams.

log weight 2.06 3.05log length

log weight 2.06 3.05log 35

log weight 2.649

2.649weight 10

weight 445.66

Page 20: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

What Can Go Wrong?20

• Don’t expect the re-expressed model to be perfect.

• Don’t use R or R-squared to decide which is the best model.

• A transformation won’t make a multimodal distribution unimodal.

• You can’t transform data into a linear form if the scatterplot rises and falls in a cyclical manner.

• If your data has values of zero or that are negative, some transformations can’t be done (logs, for example). Sometimes, if the negative data are close to zero, you can add a very small constant (1/2 and 1/6 are common) to all data values to make them all positive.

• If you have data that are dates (years), pick a reference year to be zero, and look at years from the point forward.

Page 21: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

What Can Go Wrong?21

• Keep the model simple – avoid making multiple transformations on the same variable, or mixing quite different transformations on both variables.

• Stay close to the ladder of powers.

Page 22: AP Statistics - Denton ISD...AP Statistics 1 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874 - 1965) (Attribution to Churchill is ironically

Assignment22

Read Chapter 9

Exercises #15, 17-20, 25

xkcd.com