what is an association between variables? explanatory and response variables key characteristics...

2.1 Relationships

What is an association between variables? Explanatory and response variables Key characteristics of a data set

Association between a pair of variables

Association: Some values of one variable tend to occur more often with certain values of the other variable

Both variables measured on same set of individuals

Examples:◦ Height and weight of same individual◦ Smoking habits and life expectancy◦ Age and bone-density of individuals

Causation?

Caution: Often there are spurious, other variables lurking in the background ◦ Shorter women have lower risk of heart attack◦ Countries with more TV sets have better life expectancy

rates◦ More deaths occur when ice cream sales peak

Just explore association or investigate a causal relationship?

Key Characteristics of a Data Set

Certain characteristics of a data set are key to exploring the relationship between two variables. These should include the following:

Cases: Identify the cases and how many there are in the data set.

Label: Identify what is used as a label variable if one is present.

Categorical or quantitative: Classify each variable as categorical or

quantitative.

Values: Identify the possible values for each variable.

Explanatory or response: If appropriate, classify each variable as

explanatory or response.

2.2 Scatterplots

Scatterplots Interpreting scatterplots Categorical variables in scatterplots

Examples

Does fidgeting keep you slim? Some people don’t gain weight

even when they overeat. Perhaps fidgeting and other “nonexercise activity” explains why, here is the data:

We want to plot Y vs. X◦ Which is Y?◦ Which is X?

Things to look for on scatterplot:

Form (linear, curve, exponential, parabola) Direction: ◦ Positive Association: Y increases as X increases◦ Negative Association: Y decreases as X increases

Strength: Do the points follow the form quite closely or scattered?

Outliers: deviations from overall relationship

Let’s look again…

Example: State mean SAT math score plotted against the percent of seniors taking the exam

Adding a categorical variable or grouping

May enhance understanding of the data

Categorical variable is (region):◦“e” is for northeastern states◦“m” is for midwestern states ◦All others states excluded

Example: Adding categorical variable

Other things:

Plotting different categories via different symbols may throw light on data

Read examples 2.7-2.9 for more examples of scatterplots

Existence of a relationship does not imply causation ◦ SAT math and SAT verbal scores have strong

relationship◦ But a person’s intelligence is causing both

The relationship does not have to hold true for every subject, it is random

2.3 Correlation

The correlation coefficient r Properties of r Influential points

Correlation Linear relationships are quite common

Correlation coefficient r measures strength and direction of a linear relationship between two quantitative variables X and Y

Data structure: (X,Y) pairs measured on n individuals◦ Weight and blood pressure ◦ Age and bone-density

We say a linear relationship is strong if the points lie close to a straight line and weak if they are widely scattered about a line. The following facts about r help us further interpret the strength of the linear relationship.

Properties of Correlation r is always a number between –1 and 1.

r > 0 indicates a positive association.

r < 0 indicates a negative association.

Values of r near 0 indicate a very weak linear relationship.

The strength of the linear relationship increases as r moves away from 0 toward –1 or 1.

The extreme values r = –1 and r = 1 occur only in the case of a perfect linear relationship.

Measuring Linear Association

Formula: 1

x x y yr

Calculate means and standard deviations of data Standardize X and Y:

take off respective mean divide by corresponding standard deviation

Take products of X(standardized)*Y (standardized) for each subject

Add up and divide by n-1

Or just ask your calculator nicely!

Issues: r is affected by outliers Captures only the strength of the “linear”

relationship◦ it could be true that Y and X have a very strong non-

linear relationship but r is close to zero r = +1 or -1 only when points lie perfectly on a

straight line. (Y=2X+3) SAS program: correlation.doc◦ proc corr is the procedure

For each graph, estimate the correlation r and interpret it in context.

Correlation Examples

2.4 Least-Squares Regression

Regression lines Least-squares regression line Predictions Facts about least-squares regression Correlation and regression

Regression Line Straight line which describes best how the response

variable y changes when the explanatory variable x changes◦ We do distinguish between Y and X cannot switch their roles

Equation of straight line: ŷ = b0 + b1x◦ ŷ is the predicted value (the line at a given x value)◦ b0 is the intercept (where it crosses the y-axis)◦ b1 is the slope (rate)

Procedure◦ calculate best b0 and b1 for your data◦ Find the line that best fits your data ◦ Use this line to predict y for different values of x

Example: Regression line for NEA data. We can predict the mean fat gain at 400 calories

Prediction and Extrapolation

Fitted line for NEA data:

Pred. fat gain = 3.505 – 0.00344(NEA)

Prediction at 400 calories:

Pred. fat gain = 3.505 – 0.00344*400 = 2.13 kg

So when a person’s NEA increases by 400 calories when they overeat, they will have a predicted fat gain of 2.13 kilograms.

Prediction and Extrapolation

Warning: Extrapolation--predicting beyond the range of the data--is dangerous!

Prediction at 1500 calories

Pred. fat gain = 3.505 – 0.00344*1500 = -1.66 kg

So predicting for a 1500 NEA increase when overeating, the prediction is that they will lose 1.66 kilograms of weight Not trustworthy Far outside the range of the data

Least Squares Regression (LSR) Line

The line which makes the sum of squares of the vertical distances of the data points from the line as small as possible

y is the observed (actual) response ŷ is the predicted response by using the line Residuals◦ Error in prediction◦ y – ŷ

Formula for Least Squares Regression lineCalculate:

sslope b r

0 1intercept b Y b X

0 1y b b x

, , , ,x yX Y s s r

Example: (NEA data)

s 257.66

s 1.1389

0.7786

0 1b Y b X

Using the formula:

Slope: b1 = -.7786 * 1.1389/257.66 = -0.00344

Intercept: b0 = (mean of y) – slope * (mean of x) = 2.388 – (-0.00344)*324.8 = 3.505

Regression line: Predicted fat gain = 3.505 – 0.00344*cal

ŷ = 3.505 – 0.00344x

Example: Predicted values and Residuals

Predicted fat gain for observation 2 (-57 cal.)ŷ2 = 3.505 – 0.00344*(-57) = 3.70108 kg

Observed fat gain: y2 = 3.0 kg

Residual or error in prediction = y2 - ŷ2 = 3.0 – 3.70108 = -0.70108 kg

Residual practice

Residual is yi – ŷi

For NEA data observation 14 has NEA = x14 = 580◦ Find the predicted value, ŷ14

◦ Find the residual, y14 - ŷ14

Properties of regression line Cannot switch Y and X Passes through the mean of x and mean of y

Physical interpretation of the slope b1: ◦ with one unit increase in X, how much does Y change on

average?◦ Example: NEA data: with 1 calorie increase in NEA, fain

gain changes by -0.00344 kg◦ How about 100 increase in NEA?

Using software SAS will evaluate the least squares regression line

but you have to know where to find them in the output!◦ Residuals and predicted values are also printed

SAS program : regression.doc ◦ the regression procedure is proc reg

We will do a deeper analysis of regression in chapter 10

Correlation and Regression In correlation, X and Y are interchangeable, NOT

so in regression.

Slope (b1), depends on correlation (r)

R2—Coefficient of Determination ◦ Square of correlation◦ Fraction of variation in y explained by LSR line◦ Higher R2 suggests better fit◦ Example: R2 = 0.6062 for NEA data

means that 60.62% of the unexplained variation in fat gain is explained by your fitted regression line with x = NEA.

R2—another example

less spreadtight fit

R2 = 0.989

Explains the part of the variation of y which comes from the linear relationship between y and x. In this case between Height and Age.

more scattermore error in prediction

R2 = 0.849

2.5 Cautions About Correlation and Regression

Residuals and residual plots Outliers and influential observations Lurking variables Correlation and causation

A residual is the difference between an observed value of the response variable and the value predicted by the regression line:

residual = observed y – predicted y

Residuals

y ˆ y

A regression line describes the overall pattern of a linear relationship between an explanatory variable and a response variable. Deviations from the overall pattern are also important. The vertical distances between the points and the least-squares regression line are called residuals.

Residuals add up to zero and have a mean of zero

Thus, a fit is considered good if the plot shows a random spread of points about the zero line but without any definitive pattern

Scatterplot of residuals against explanatory variable Helps assess the fit of regression line

Residual plot

Outliers and influential observations Outliers: Lies outside the pattern of other

observations◦ Y-outliers: large residual◦ X-outliers: often influential in regression

Influential points: Deleting this point changes your statistical analysis drastically◦ pull the regression line towards themselves

Least squares regression is NOT robust to presence of outliers

Example: Gesell data

r = 0.4819 Subject 15: ◦ Y-outlier ◦ Far from line◦ High residual

Subject 18: ◦ X-outlier ◦ Close to line◦ Small residual

Example: Gesell data

r = 0.4819 Drop 15: ◦ r = 0.5684

Drop 18: ◦ r = 0.3837

Both have some influence, but neither seems excessive

Association, however strong, does NOT imply causation.

Some possible explanations for an observed association

The dashed lines show an association. The solid arrows show a cause-

and-effect link. x is explanatory, y is response, and z is a lurking variable.

Explaining Association: Causation

It appears that lung cancer is associated with smoking.

How do we know that both of these variables are not being affected by an unobserved third (lurking) variable?

For instance, what if there is a genetic predisposition that causes people to both get lung cancer and become addicted to smoking, but the smoking itself doesn’t CAUSE lung cancer?

1. The association is strong.

2. The association is consistent.

3. Higher doses are associated with stronger responses.

4. Alleged cause precedes the effect.

5. The alleged cause is plausible.

We can evaluate the association using the following criteria:

Establishing Causation

what is an association between variables? explanatory and response variables key characteristics...

data categorical variable

label variable

y data structure

quantitative variables

overall relationship

causal relationship

boththe relationship

data set1association

Documents

1 response surface a response surface model is a special...

1 multiple regression response, y (numerical) explanatory...

the characteristics and variables accounted by operators

relationships scatterplots and correlation. explanatory and...

two variables: which test? scatter...

4.1 scatterplots explanatory and response variables ...

bayesian inference in regression models with ordinal...

analysis of gene expression data (nominal explanatory...

draft adoption … · web viewa total of 17 explanatory...

objectives (ips chapter 2.1) scatterplots scatterplots ...

lagged explanatory variables and the estimation of …lagged...

class 20: thurs., nov. 18 specially constructed explanatory...

a spatial interaction model with spatially structured ...nby...

class 19: tuesday, nov. 16 specially constructed explanatory...

simple group comparisons limits you to simple explanatory...

4. models with multiple explanatory variables

scientific investigations report 2006–5217 characteristics...

chapter 4 scatterplots and correlation. chapter outline...

table 3.9 explanatory variables used in the linear mixed...

models with qualitative explanatory variables (factors)