advanced research methods: regression analysis theory and modeling by erlan bakiev, ph. d

Advanced Research Methods:Regression Analysis Theory and Modeling

By Erlan Bakiev, ph. D.

Faculty and Text

• Textbook:

– Keith, T. Z., (2006). Multiple Regression and Beyond. Pearson

– Montgomery, Peck, and Vining, Introduction to Linear Regression Analysis, 4th ed., 2006

Lecture Outline

•Overview of Regression Analysis

•Example

•Guidelines for Group Project (due week 5 class)

Goals of Today’s Lecture

• What is Regression analysis• Introduction to the most widely used statistical

models– Linear regression– Logistic regression

• How these models are used to analyze data and inform decisions

– When different models are appropriate– How to fit, interpret, and assess different

models• Practice with different data sets

A Note on SPSS

•SPSS stands for Statistical Package for the Social Sciences and it is one of the most widely used programs for statistical analysis in social sciences.

•Widely used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations and others

–User Friendly•You will use a basic set of commands for model fitting•Current version of SPSS is 18

• Regression analysis is a statistical tool for the investigation of relationships between variables

• İnvestigator seeks to understand the causal effect

of one variable upon another, for example– the effect of a price increase upon demand, – the effect of changes in the money supply upon the

inflation rate.

Regression Analysis

Types of Regression Models

• Two types of regression models– Linear regression

• Simple linear regression (continuous Y, one X)

• Multiple linear regression (continuous Y, several Xs)

– Logistic regression• Binary Y, several Xs

• Linear regression forms the basis for understanding most regression techniques

Steps in a Regression Analysis

Explore the Data Graphically

Select a Tentative Structure

Estimate the structure and its uncertainty

Assess the plausibility of the tentative structure

Use the estimated structure for your inferences (suitably qualified by the estimated uncertainty)

• Regression analysis with a single explanatory variable is termed “simple regression.”

• Ex: Education income relationship• I = α + βE + ε α = a constant amount (what one

earns with zero education);where

– β = the effect in dollars of an additional year of schooling on income, hypothesized to be positive; and

– ε = the “noise” term reflecting other factors that influence earn-ings.

Simple Regression

• İt is a technique that allows additional factors to enter the analysis separately so that the effect of each can be estimated.”

• Simultaneously several independent variables can influence a dependent variable.

• Ex: İnfluence of Education and experience on income

• The model is as follows:I = α + βE + γX + ε where “γ” is expected to be

positive

Multiple Regression

• Data sets consist of a set of observations– Each observation contains

• One dependent variable (“Y”)• One or more independent variables (“Xs”)

• We want to determine if there is a structure to the data set

– Relationship between the response and one or more predictors

Model-Based Data Analysis

Response = structure(predictors) + “error”

•Structure– Varies from observation to observation

• Function of predictors– Systematic, deterministic

•“Error”– The error of a sample is the deviation of the sample from

the (unobservable) true function value.

Modeling Process

• Fitting a model then requires:– Selecting a tentative structure– Estimating the structure and its uncertainty

• Typically as statistics and their sampling distributions

– Assessing plausibility of the choice of structure• Model diagnostics

Characterizing the Relationship BetweenTwo Variables

• Outline– Types of variables– Correlation– Graphical Methods– Linear Regression

İndependent and Dependent Variables

• Each observation has two parts: dependent Y and one or more independent Xs

• We are interested in the relationship of X and Y– At a minimum, X and Y vary together– X and Y are associated

• Statistical relationship does not imply causationGujarati, D.N. (2003). Basic Econometrics, International Edition - 4th ed.. McGraw-Hill Higher Education. pp. 22-24. ISBN 0-07-112342-3.

Examples

•X: income Y: consumption•X: parents’ SES Y: education level•X: education level Y: income•X: range to target Y: probability of hit/kill

•The statistical methods we will study establish association•Association does not entail causality

Types of Variables

•Continuous: variable can take any value in a (possibly infinite) range

– Money, height, blood pressure, weight•Discrete: variable takes on a countable set of numerical values (often finite and small)

– People in a queue, hits on a target•Ordinal: Variable has a finite set of non-numeric, but ordered values (categorical)

– Level of schooling, rating•Nominal: Variable is finite, non-numeric, non-ordered

– Religion, gender

Methods of Examining Relationships

• Method X Y• Simple Regression continuous continuous• Multiple Regression continuous,discrete continuous• Logistic Regression continuous,discrete nominal

• Other combinations are possible• These methods form the foundation of most other methods

Continuous X, Continuous Y

• Simplest case• Correlation

– Closely related to simple regression– Widely used in some substantive areas to

characterize association• Graphical methods

– Essential exploratory and diagnostic tool

A Graphic Illustration

-2-1

01

2x1

-2 -1 0 1 2y1

Correlation 0.8

Correlation

• Definition

• Attributes– Measures how X and Y vary together linearly– Scale-free– Ranges from –1 to +1– Zero correlation is not necessarily independence

• Note that this is pure association, no specification of response/dependent and predictor/independent variables

2/1)(Var)(Var/)E)((E XYXXEYY

2/1)(Var)(Var/Y)(X,Covariance XY

Examples of Correlation

-2-1

01

23

x1

-4 -2 0 2 4y1

Correlation -0.9

-2-1

01

23

x1-4 -2 0 2 4

y1

Correlation -0.5

-2-1

01

23

x1

-4 -2 0 2 4y1

Correlation 0.0-2

-10

12

3x1

-4 -2 0 2 4y1

Correlation 0.1

-2-1

01

23

x1

-4 -2 0 2 4y1

Correlation 0.5

-2-1

01

23

x1-4 -2 0 2 4

y1

Correlation 0.8

Problems with Correlation

• Correlation is a single number summary of association

• But– Outliers can dominate the value (it is not robust)– Zero correlation does not mean no relation

• By itself, it does not allow the prediction of Y from X

– This is usually of interest

Graphical Methods: Two-Way Scatterplot(1993 Car Data)

12

34

56

Eng

ine

Siz

e (l)

1500 2000 2500 3000 3500 4000Weight (lbs)

infile _skip(11) EngSiz _skip(12) Wt _skip(1) using 93cars.datscatter EngSiz Wt, ytitle("Engine Size (l)") xtitle("Weight (lbs)")

Linear Regression

•In simple linear regression, we use the following model for the expected value of y given x:

– E(y|x) = β0 + β1x •Relationship characterized by two numbers

– Straight line•Wide applicability in practical situations

– Many relationships approximately linear (or can be made so)

– Forms the basis for more sophisticated analysis

The “Best” Straight Line

• How do we construct the best line?• Some ideas

– Minimize sum of distances from each Y to the line

– Minimize sum of absolute values of distances from each Y to the line

– Minimize sum of squared distances from Y to the line,

N

iii xy

1

210 )(

02

46

Eng

ine

Siz

e (l)

1500 2000 2500 3000 3500 4000Weight (lbs)

EngSiz Fitted values

Regression Line for Car Data

EngSiz = -1.90 + 0.0015 Weight

regress EngSiz Wtpredict pengsizscatter EngSiz Wt, ytitle("Engine Size (l)") xtitle("Weight (lbs)") || line pengsiz Wt, clc(black)

Interpretation of Coefficients

• In many applications we are interested in the coefficients directly

– β1 is the change in expected value of y for a unit change in x (slope)

• β1 = 0 means that x is not related to change in mean of y

– β0 is the value of E(y) for x=0 (intercept)• Very often of little interest because reflects

choice of origin• Often outside range of x data values

Fit a Regression LineWe Can Graph the Result

600

620

640

660

680

700

15 20 25str

avg_scr Fitted values

What We Need to Have to Do Inference

• Construct 95% confidence interval for the slope• Get p-values for the t-statistic• This will allow us to state whether there is a

statistically significant association between the independent and dependent variables

– Test the hypothesis that β1 = 0• SPSS, of course, does the arithmetic

• In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. The lower the p-value, the less likely the result is if the null hypothesis is true, and consequently the more "significant" the result is, in the sense of statistical significance.

What does it mean to “explain” variation?

• General idea– Want to assess whether the regression line

“explains” the data better than simpler alternative models

– Variation left after model fit is “unexplained”• R2 indicates what percent of variability in Y is

accounted for by the regression model• Some properties of R2

– 0 R2 1– R2=0: Regression line is horizontal– R2=1: Data fits perfectly on regression line

Multiple Regression Models

• Extension of simple regression:

• Together the βi are called the regression coefficients

kk xxxyxxy...22110

22110

In Policy Analysis Applicationsİndependent Variable Are Very

Important• İndependent/ Categorical / Dummy Variables• Often at least part of our data is qualitative

– Subject is male or female– Location is urban or rural– Others include ethnic group, insured vs.

uninsured, high-school vs post-high school– Etc., etc., etc.

• In much of the modeling in health most of the variables are qualitative

• We need to generate the 0,1 indicators

Will LM Approach Work for Other Types of Data?

• Suppose we have 0,1 data– Success/failure, die/live, leave military/stay, ...– Suppose we want to link p with covariates (age,

income, disease, etc.)?• Least squares won’t work nicely … let’s take a look!

Example

• We want to know whether the presence of CHD is related to age.

– If we take a group of people of a given age• What is the fraction that have CHD?• Equivalently, what is the probability that a

person of a given age has CHD?– We expect the probability to depend on age.

Example: Coronary Heart Disease

0.2

.4.6

.81

chd

20 30 40 50 60 70age

Linear Regression on 0,1 CHD Data

0.5

1ch

d/Fi

tted

valu

es

20 30 40 50 60 70age

chd Fitted values

Fitted values are out-of-range

Why A Different Type of Regression?

• Logistic regression is the most commonly used generalization of multiple linear regression

• Output data is categorical with 2 categories– Categorical: no metric, no order– Usually coded as 0/1– Terminology: failure/success– Typical examples: dies/lives, does not/does

have condition, does not/does marry, etc.• As we’ve seen, linear regression can be

inappropriate

Interpretation

• A year increase in age means that one is 11% more likely to have CHD than someone a year younger

• But the probability that you have CHD is much different depending on age

• Logistic regression is often used to see how much additional risk is contributed by some risk factor (e.g. smoking)

– The coefficient shows how much the risk factor increases your chances of having some condition

– But the probability may still be small

Group Project•Each of you will form a group to perform and report on a regression analysis of a set of data you select

– Multiple linear regressionAt least 100 observations (i.e., 100 y’s and 100 x1’s, x2’s, …, xk’s)No more than 1000 observations At least 5 x’s – Data set and analysis goal must have my approval– Analysis report due March 15, 2011– Teams of up to three students– Must do model fitting, interpretation, and write up results

• The paper is 10p max• This includes all text, tables, graphics• Write as memo to explain

– What you are trying to do– What you found out

• Content will depend on data set and what you find

SPSS Commands for the Group Project

•Menu clear•Analyze•Regression•Linear•Move dependent and independent variables into boxes•Statistics•Descriptives •Part and Partial Correlations•Plots•Histogram•Normal probability plot•Continue•Stepwise•OK

Resources to help you learn and use Statahttp://www.ats.ucla.edu/stat/spss

http://www.ats.ucla.edu/stat/spss

http://www.ats.ucla.edu/stat/spss

advanced research methods: regression analysis theory and modeling by erlan bakiev, ph. d

Documents