ap statistics - denton isd...ap statistics 1 the only statistics you can trust are those you...
TRANSCRIPT
R E - E X P R E S S I N G D A T A ( P A R T 2 )
C H A P 9
AP Statistics1
The only statistics you can trust are those you falsified yourself.
Sir Winston Churchill (1874 - 1965)(Attribution to Churchill is ironically falsified)
Goal of Re-expression2
Make the distribution of a variable more symmetric:
A symmetric distribution can be analyzed much more easily than a skewed distribution.
Goal of Re-expression3
Make the spread of several groups more alike:
With similar spreads, distributions are easier to compare.
Goal of Re-expression4
Make the form of a scatterplot more linear:
Linear regression is easy – non-linear regression is not!
Goal of Re-expression5
Make the scatter in a scatterplot spread out evenly rather than following a fan-shape:
An even scatter is a necessary condition for analysis we will learn about later.
What Transformation?6
Ratios of two quantities (e.g., mph) often benefit from a reciprocal.
The reciprocal of the data
-1
An uncommon re-expression, but sometimes useful.
Reciprocal square root
-1/2
Measurements that cannot be negative often benefit from a log re-expression.
We’ll use logarithms here
“0”
Counts often benefit from a square root re-expression.
Square root of data values
½
Data with positive and negative values and no bounds are less likely to benefit from re-expression.
Raw data1
Try with unimodal distributions that are skewed to the left.
Square of data values
2
CommentNamePower
When in doubt, start here:
Ladder of Powers (see p 237)
Important Models7
Exponential Model: 0 1ˆlog y b b x
2 4 6 8 10
01
00
00
20
00
0
Original Data
x
y
2 4 6 8 102
46
81
0
Transformed Data
x
log
(y)
This is the zero power on the ladder. It is useful for values that grow (or shrink) by percentages.
Important Models8
Logarithmic Model: 0 1ˆ logy b b x
Data with a wide range of x-values or with a scatterplot that is very steep at the left and levels out towards the right.
0 10000 20000
24
68
Original Data
x2
y2
0.0 1.0 2.0
24
68
Transformed Data
log(x)
y2
Important Models9
Power Model: 0 1ˆlog logy b b x
The authors of the textbook call this one the Goldilocks Model – when steps on the ladder are either too big or too small.
2 4 6 8 10
02
04
06
08
01
00
Original Data
x
y3
0.0 1.0 2.00
12
34
Transformed Data
log(x)
log
(y3
)
Example10
Below are data from 12 perch caught in a lake in Finland (length in cm and weight in grams).
Length
(cm)
Weight
(g)
Length
(cm)
Weight
(g)
8.8 5.9 28.7 300.0
19.2 100.0 30.1 300.0
22.5 110.0 39.0 685.0
23.5 120.0 41.4 650.0
24.0 150.0 42.5 820.0
25.5 145.0 46.6 1000.0
Example11
In order to create a model to predict weight from length, start by looking at the data:
There is a fairly strong, positive, and nonlinear association between weight and length.
Example12
Example13
We need to transform the data (one or both variables) to achieve a more linear relationship. In the biological sciences, power models are fairly common, so we’ll start there.
Take the logarithm of both variables (either base-10 or base-e log –we don’t care which).
The association between the logs of the variables is quite linear.
Example14
Create a linear model, and then check the residuals to determine if the model may be reasonable. Note – you can’t use either R or R-squared to determine if your model is reasonable. These statistics are only useful after you assess the model fit.
Regression Analysis: log(W) versus log(L)
The regression equation is
log(W) = - 2.06 + 3.05 log(L)
Predictor Coef SE Coef T P
Constant -2.0596 0.1498 -13.75 0.000
log(L) 3.0538 0.1037 29.44 0.000
S = 0.0680088 R-Sq = 98.9% R-Sq(adj) = 98.7%
Example15
Linear Model –remember your calculator doesn’t know you are using log-transformed data when it produces the equation.
log weight 2.06 3.05log length
The residuals appear to be fairly random, so this linear model is reasonably appropriate.
Example16
Describe what the slope represents:
log weight 2.06 3.05log length
For every one-unit increase in the log of length, the log of weight increases by about 3.05.
Example17
Describe what the correlation represents:
The correlation is the square root of R-squared, which is about 0.994. This indicates there is a very strong, positive, linear relationship between the logs of weight and length.
Predictor Coef SE Coef T P
Constant -2.0596 0.1498 -13.75 0.000
log(L) 3.0538 0.1037 29.44 0.000
S = 0.0680088 R-Sq = 98.9% R-Sq(adj) = 98.7%
Example18
Describe what R-squared represents:
About 98.9% of the variability in the log of weight is accounted for by the regression with the log of length.
Predictor Coef SE Coef T P
Constant -2.0596 0.1498 -13.75 0.000
log(L) 3.0538 0.1037 29.44 0.000
S = 0.0680088 R-Sq = 98.9% R-Sq(adj) = 98.7%
Example19
Use the model to predict the weight of a perch that is 35 cm long.
The predicted weight for a 35 cm perch is about 446 grams.
log weight 2.06 3.05log length
log weight 2.06 3.05log 35
log weight 2.649
2.649weight 10
weight 445.66
What Can Go Wrong?20
• Don’t expect the re-expressed model to be perfect.
• Don’t use R or R-squared to decide which is the best model.
• A transformation won’t make a multimodal distribution unimodal.
• You can’t transform data into a linear form if the scatterplot rises and falls in a cyclical manner.
• If your data has values of zero or that are negative, some transformations can’t be done (logs, for example). Sometimes, if the negative data are close to zero, you can add a very small constant (1/2 and 1/6 are common) to all data values to make them all positive.
• If you have data that are dates (years), pick a reference year to be zero, and look at years from the point forward.
What Can Go Wrong?21
• Keep the model simple – avoid making multiple transformations on the same variable, or mixing quite different transformations on both variables.
• Stay close to the ladder of powers.
Assignment22
Read Chapter 9
Exercises #15, 17-20, 25
xkcd.com