here, pal! regress this!

Here, pal!

Regress this!

presented by

Miles Hamby, PhD

Principle, Ariel Training ConsultantsMilesFlight.20megsfree.com

[email protected]

Part 2

The EquationMODEL 3

IV B (Slope)

(Constant) 35.577

Age -.117

Gender -.110

Married -4.05E-02

Black .439

Native Am .719

Asian -.553

Hispanic -.830

Unknown .531

Alien -.618

GPA -.277

Transfer Cr 4.285E-02

Undergrad -3.259

Tutoring -4.71E-07

Accounting 2.638

Business 2.651

Y = a + bAge + bGen + bMar +bBlk

+ bNA + bAsn + bHis + bUnk + bAln

+ bGPA + bXfer + bUndergrad

+ bTutor + bAcc + bBus

Y = 35.57 + (-.11)Age + (-.11)Gen

+ (-.04)Mar + (.43)Black

+ (.71)NatAm + (-.55)Asian

+ (-.83)Hisp + (-.53)Unk + (-.61)Alien

+ (.27)GPA + (.04)Xfer + (-3.25)Under

+ (-.04)Tutor + (2.63)Acc + (2.65)Bus

Let’s Predict!What is the predicted Quarters to completion for:

Age 36, Male, Single, Black, US citizen, 3.5 GPA, 35 Transfer credits, Undergraduate, no Tutoring, Business major

Y = 35.57 - (.11)Age - (.11)Gen - (.04)Mar + (.43)Black + (.71)NatAm - (.55)Asian - (.83)Hisp - (.53)Unk - (.61)Alien - (.27)GPA + (.04)Xfer – (3.25)Under - (.04)Tutor + (2.63)Acc + (2.65)Bus

Y = 35.57 - (.11)(36) - (.11)(0) - (.04)(0) + (.43)(1) + (.71)(0) - (.55)(0) - (.83)(0) - (.53)(0) - (.61)(0) - (.27)(3.5) + (.04)(35) – (3.25)(1) - (.04)(0) + (2.63)(0) + (2.65)(1)

35.86 = 35.57 – 3.96 - 0 - 0 + .43 + 0 - 0 - 0 - 0 – 0 - .94 + 1.4 – 3.25 - 0 + 0 + 2.65

What is the predicted Quarters to completion for:

Age 45, Female, Married, White, Alien, 3.0 GPA, No Transfer credits, Undergraduate, Tutored, Computer major

Y = 35.57 - (.11)Age - (.11)Gen - (.04)Mar + (.43)Black + (.71)NatAm - (.55)Asian - (.83)Hisp - (.53)Unk - (.61)Alien - (.27)GPA + (.04)Xfer – (3.25)Under - (.04)Tutor + (2.63)Acc + (2.65)Bus

Y = 35.57 - (.11)(45) - (.11)(1) - (.04)(1) + (.43)(0) + (.71)(0) - (.55)(0) - (.83)(0) - (.53)(0) - (.61)(1) - (.27)(3.0) + (.04)(0) – (3.25)(1) - (.04)(1) + (2.63)(0) + (2.65)(0)

25.8 = 35.57 – 4.95 - .11 - .04 + 0 + 0 - 0 - 0 - 0 - .61 - .81 + 0 - 3.25 - .04 + 0 + 0

Example Profiles

Excel

Variation in the DV

Each successive Model explains more of the variation (R2) in the DV (Time to Completion)

Model Summary

.179a .032 .031 6.242 .032 22.926 9 6253 .000

.343b .118 .116 5.960 .086 152.204 4 6249 .000

.392c .154 .152 5.838 .036 133.252 2 6247 .000

Model1

2

3

R R SquareAdjustedR Square

Std. Error ofthe Estimate

R SquareChange F Change df1 df2 Sig. F Change

Change Statistics

Predictors: (Constant), Alien, Black, Marital Status, Unkn, Gender, AGE, Asian, Hisp, Native Americana.

Predictors: (Constant), Alien, Black, Marital Status, Unkn, Gender, AGE, Asian, Hisp, Native American, Tutoring Sessoin Date,XFER CR, GPA , Undergrad Status

b.

Predictors: (Constant), Alien, Black, Marital Status, Unkn, Gender, AGE, Asian, Hisp, Native American, Tutoring Sessoin Date,XFER CR, GPA , Undergrad Status, Accounting, Business

c.

All three Models are significant (F < .05)

But, 84.6% or more of the variation is still unexplained

Possible factors?

Worklife, children, personal goals, financial aid, company sponsorship

Model Summary

.179a .032 .031 6.242 .032 22.926 9 6253 .000

.343b .118 .116 5.960 .086 152.204 4 6249 .000

.392c .154 .152 5.838 .036 133.252 2 6247 .000

Model1

2

3




Change Statistics

Predictors: (Constant), Alien, Black, Marital Status, Unkn, Gender, AGE, Asian, Hisp, Native Americana.

Predictors: (Constant), Alien, Black, Marital Status, Unkn, Gender, AGE, Asian, Hisp, Native American, Tutoring Sessoin Date,XFER CR, GPA , Undergrad Status

b.

Predictors: (Constant), Alien, Black, Marital Status, Unkn, Gender, AGE, Asian, Hisp, Native American, Tutoring Sessoin Date,XFER CR, GPA , Undergrad Status, Accounting, Business

c.

The point is – with R2 only .154, there is some other other factor out there contributing more to Time to Completion and we need to find it!

Coefficientsa

35.577 .968 36.768 .000

-.117 .009 -.162 -13.256 .000

-.110 .157 -.009 -.701 .483

-4.05E-02 .040 -.012 -1.017 .309

.439 .176 .036 2.497 .013

.719 .641 .028 1.120 .263

-.553 .243 -.037 -2.275 .023

-.830 .351 -.041 -2.366 .018

.531 .254 .032 2.092 .036

-.618 .216 -.038 -2.867 .004

-.277 .221 -.016 -1.254 .210

4.285E-02 .002 .283 20.762 .000

-3.259 .218 -.237 -14.959 .000

-4.71E-07 .000 -.007 -.566 .571

2.638 .240 .135 10.970 .000

2.651 .181 .191 14.686 .000

(Constant)

Age

Gender

Marital Status

Black

Native American

Asian

Hisp

Unkn

Alien

GPA

XFER CR

Undergrad Status

Tutoring Sessoin Date

Accounting

Business

Model3

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: Quarters to Completiona.

Variation in the Slopes

Cannot tell by the slopes – cannot compare apples to oranges

Is the slope of Age (-.117) more or less than slope of

GPA (-.277)?

Apples to apples – i.e., use Standardized ‘Beta’

Beta Age (-.162) more Beta Acc (.016);i.e., unit of Age results in greater change than unit of GPA

Drawing Conclusions

Summarize the correlations (Pearson’s R)

Summarize the effects (coefficient B)

Summarize the variation (R2)

“There is a statistically significant association between all the variables and Time to Completion.”

“Academic major and transfer credits, and Undergraduate status seem to have the greatest affects.”

“However, 86% of the variation in Time to Completion is still unexplained.”

Suggest what’s next“Data on worklife, income, finances, and company sponsorship should be collected and anlayed.”

In Summary

• Regression measures the strength of association (correlation) for all variables considered at the same time

• Regression can predict the outcome of any given profile

• Regression measures the amount of effect (slope) of each variable on the dependent variable as ameliorated by all other variables

Regress it, Pal!

It’s where it’s at!

Tests of Significance

t-test for dichotomous variable (two categories)

eg – Is there a difference in GPA between men and women?

F-test - One-way ANOVA for polychomtomous (more than two categories)

eg -- Is there a difference in GPA between African-American, Hispanic, Anglo, and Native American students?

Purpose – determine if there is a significant difference between means of the categories of the nominal variable

References

Lind, D., Marchal, R., Mason (2001); Statistical Techniques in Business & Economics, 11th ed., McGraw-Hill Companies, Inc., New York, NY. ISBN 0-07-112318-0

McClendon, J. (1994); Multiple Regression and Causal Analysis, F.E. Peacock Pulishers, Inc., Itasca, IL. ISBN 0-87581-384-4

SPSS (1999); SPSS Base 9.0 Applications Guide, SPSS, Inc., Chicago, IL. ISBN 0-13-020401-3

Shortcoming of t-test and F ~

eg -

Can we predict the GPA of a student based on gender?

Regression predicts!

Can we predict the level of satisfaction with a course based on gender?

Can we predict the likelihood of graduation of a student based on gender?

They do not predict.

Examples - Means of t-test and F

Dichotomous - Find the mean GPA of males and that of females and compare them with a t-test.

Polychotomous - Find the mean GPA for African-Americans, Hispanics, and Anglos and compare them with a one-way ANOVA

Example 1 Data

Arbitrarily Code ‘gender’ (nominal variables)

Female = 1

Male = 0

ID SAT GPA Gender

Stud 1 5 3.2 F (1)

Stud 11 3 2.7 M (0)

(a) Correlation r (SPSS ‘R’) = .846

Interpretation – GPA is strongly associated with gender type

Example 1ID SAT GPA Gender

Stud 1 5 3.2 F (1)

Stud 11 3 2.7 M (0)

Model Summary

.846a .716 .700 .76 .716 45.343 1 18 .000Model1




Change Statistics

Predictors: (Constant), Gendera.

Example 1

(b) Significance of difference in means of GPA by gender – ANOVA F < 0.05

Interpretation - reject Ho, i.e., there is a statistically significant difference in GPA according to gender

ID SAT GPA Gender

Stud 1 5 3.2 F (1)

Stud 11 3 2.7 M (0)

ANOVAb

26.450 1 26.450 45.343 .000a

10.500 18 .583

36.950 19

Regression

Residual

Total

Model1

Sum ofSquares df Mean Square F Sig.


Dependent Variable: Satisfactionb.

(c) Regression model (y=a+bx)

Example 1

Interpretation – Male SAT is 1.9, female SAT is 1.9 + 2.3 = 4.2; i.e., mean female GPA is higher than mean male GPA

GPA = 1.9 + 2.3 (gender code)

Coefficientsa

1.900 .242 7.867 .000

2.300 .342 .846 6.734 .000

(Constant)

Gender

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.

Dependent Variable: Satisfactiona.

ID SAT GPA Gender

Stud 1 5 3.2 F (1)

Stud 11 3 2.7 M (0)

(a) Correlation r (SPSS ‘R’) = .837

Model Summary

.837a .701 .685 .3544Model1




Interpretation – GPA is strongly associated with gender type

Example 1ID SAT GPA Gender

Stud 1 5 3.2 F (1)

Stud 11 3 2.7 M (0)

Example 1

(b) Significance of difference in means of GPA by gender – ANOVA F < 0.05

ANOVAb

5.305 1 5.305 42.230 .000a

2.261 18 .126

7.566 19

Regression

Residual

Total

Model1

Sum ofSquares df Mean Square F Sig.


Dependent Variable: GPAb.

Interpretation - reject Ho, i.e., there is a statistically significant difference in GPA according to gender

ID SAT GPA Gender

Stud 1 5 3.2 F (1)

Stud 11 3 2.7 M (0)

(c) Regression model (y=a+bx)

Coefficientsa

2.620 .112 23.377 .000

1.030 .158 .837 6.498 .000

(Constant)

Gender

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.

Dependent Variable: GPAa.

Example 1

Interpretation – Male GPA is 2.62, female GPA 2.62 + 1.03 = 3.65; i.e., mean female GPA is higher than mean male GPA

GPA = 2.62 + 1.03 (gender code)

ID SAT GPA Gender

Stud 1 5 3.2 F (1)

Stud 11 3 2.7 M (0)

Nonsense coding – randomly assigning a random number to a nominal variable

Regardless of the number assigned to a nominal variable, the strength of association is unaffected,

ie, r (correlation), r2 (coef. of determination) and B (slope)

eg –

Male = 1, Female = 2



Hispanic = 35, African-American = 72, Anglo = 87

For dichotomous variable, coding number is not important

BUT – slopes and intercepts coded nonsense are difficult to interpret, unless coded ‘0’ or ‘1’

eg -


Mean GPA for Male (Ym) = 2.8,

Mean GPA for Female (Yf) = 3.5

Ym – YfXm - Xf

= 2.62 – 3.650 - 1

– 1.03- 1

1.03= =Slope B =

Result - the mean GPA of the category coded 0 = the Y-intercept

0 (Male)

1 (Female)

Y

X

3.65

2.62

B = 1.03

Interpretation –

Female GPAs tend to be predictably higher than Male GPAs

Ym – YfXm - Xf

= 2.62 – 3.650 - 1

– 1.03- 1

1.03= =Slope B =

0 (Female)

1 (Male)

Y

X

2.62

3.65

B = - 1.03

Interpretation – same result

Female GPAs tend to be predictably higher than Male GPAs

Recode Male = 1, Female = 0:

Ym – YfXm - Xf

= 3.65-2.620 - 1

1.03- 1

-1.03= =Slope B =

0 (Female)

1 (Male)

Y

X

2.62

3.65

Interpretation –

We can predict GPA based on male or female

Thus, regression equation is:

With one variable category = 0, (eg female) then Y intercept is the mean of that category and the slope predicts the other category

Y = 3.65 – 1.03X

Fine – but what about polychotomous variables?

Cannot use single dummy variable for more than two categories.

Why? This would assume the nominal categories were actually interval, ie, one was more of the other.

eg, if ethnic variable were coded thus:

Hispanic = 1, African-Am = 2, Anglo = 3,

the regression would assume that Anglo is 2 units greater than Hispanic, etc

Regression also interprets a dichotomous variable (eg male=0, female=1) as female being 1 unit more than male.

However, with more than two categories, this is not true.

But, with dichotomous, the mean score of code ‘0’ is the intercept, and the mean score of code ‘1’ is the intercept + the slope.

eg – Ethnic Category

Therefore, must treat each category as a unique variable –

Has it

Doesn’t have it

Hispanic

1

0

Afr-Am

1

0

Anglo

1

0

Code each category/variable as:

1 = ‘presence of characteristic’ or

0 = ‘absence of characteristic’

a ‘Dummy’ variable

(Depicts 3 students – one in each ethnic category)

For each case/subject, code each category as either ‘having it’ or ‘not’

Coding Polychotomous Nominal VariablesAs Dummy Variables

Case ID

Stud 1

Stud 2

Stud 3

Hispanic0

1

0

Afr-Am

1

0

0

Anglo

0

0

1

eg –

Student 1 is an African-American

Student 2 is an Hispanic

Student 3 is an Anglo

(Depicts 3 students – one in each ethnic category)

Regression equation would look like:

Y = a + bH + bAA + bAn

i.e., the sum of the three dummies for each case always equals ‘1’.

Stud 1 (Hispanic) ~ 0 + 1 + 0 = 1 Stud 2 (Afr-Am) ~ 1 + 0 + 0 = 1Stud 3 (Anglo) ~ 0 + 0 + 1= 1

Problem – ‘perfect multi-collinearity’

Case ID

Stud 1

Stud 2

Stud 3

Hispanic

0

1

0

Afr-Am

1

0

0

Anglo

0

0

1

The resulting regression equation would return a confusing Y-intercept (a):

Y = a + bH + bAA + bAn

Case ID

Stud 1

Stud 2

Stud 3

Hispanic

0

1

0

Afr-Am

1

0

0

Anglo

0

0

1

i.e., what is the reference point from which to determine the actual means of the other variables?

What to do -

drop one category from the regression

i.e., use only g – 1 dummies

eg, -

Y = a + bH + bAA (bAn dropped for all cases)

Reference group – the category/group chosen to be dropped

Choosing the Reference group – the group that has the most normative support

By leaving out a group, not all cases will sum to ‘1’, and therefore:

the regression equation predicts the mean Y for the group to which the case/student belongs, in reference to the Y-intercept.

Student 1 (African-AM): YAA = a + 0 (b*0H) + b (b*1AA) = a + bAA

Student 2 (Hispanic): YH = a + b (b*1H) + 0 (b*0 AA) = a + bH

Student 3 (Anglo): YAn = a + 0 (b*0H) + 0 (b*0 AA) = a

Case IDStud 1Stud 2Stud 3

Hispanic010

Afr-Am100

Anglo001

i.e., Mean Y of reference group ‘Anglo’ is the intercept ‘a’; All other groups are then compared to ‘Anglo’


Satisfaction425

Hispanic010

Afr-Am100

Anglo001

Satisfaction coding:

5Very satisfied

4Satisfied

3Ambivalent

2Dissatisfie

d

1Very Dissatisfied

Example – Satisfaction with a course


Satisfaction425

Hispanic010

Afr-Am100

Anglo001

Thus, predicted satisfaction level for any other Hispanic student would be ~ Y = 3.0 + 0.4*1 +0.2*0 = 3.4

Likewise, predicted satisfaction level for any other Africa-American student would be ~ = 3.0 + 0.4*0 +0.2*1 = 3.2

Assume that a multiple regression of the affect of ethnicity on satisfaction returned a Y-intercept of 3.0 with slopesH =.4, AA =.2 (An held out as reference group)

i.e., Y = 3.0 + .4H + .2AA

Predicted satisfaction level for any Anglo student is the intercept ‘a’ = 3.0

Slopes – indicate difference between a specific category/group and the reference group.

ie, the 0.4 slope for Hispanic indicates Hispanic satisfaction is 0.4 more than Anglo.

i.e., Y = 3.0 + 0.4H + 0.2AA

Likewise, African-American satisfaction is 0.2 more than Anglo.

Also, relatively, African-American satisfaction is 0.2 less than Hispanic.

Note - this does not predict the satisfaction level (or GPA, etc) of a unique individual student – only one of a particular ethnic background.

Because there is no ‘degree’ of the characteristic ‘ethnicity’

i.e., you are either Anglo, or you are not.

Coefficientsa

3.000 .588 5.106 .000

-.333 .831 -.112 -.401 .693

.375 .777 .135 .482 .636

(Constant)

African-American

Hispanic

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.


Example (re Example Data)

Satisfaction and Ethnic Group – Anglo as reference group

Interpretation - Mean Anglo satisfaction level is 3.0, mean Afr-Am level is 2.667, mean Hispanic level is 3.375

Regression Model ~ Y = 3.0 -.333AA + .375H

Effect of Multi-colinearity in SPSS – If SPSS detects perfect multi-colinearity within the selected IVs, it drops one IV.

Coefficientsa

3.375 .509 6.633 .000

-.708 .777 -.239 -.911 .375

-.375 .777 -.126 -.482 .636

(Constant)

African-American

Agnlo

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.


Excluded Variablesb

.a . . . .000HispanicModel1

Beta In t Sig.Partial

Correlation Tolerance

Collinearity

Statistics

Predictors in the Model: (Constant), Agnlo, African-Americana.

Dependent Variable: Satisfactionb.

To make a prediction more ‘individually unique’, add other variables

eg, gender (nominal), age (ratio), time spent on homework (ratio)

Adding Other Variables

Y = a + [b*H + b*AA] + b*Age + b*Homework

Y(GPA) = a + [b*H + b*AA] + b*Age + b*Homework


GPA3.93.13.2

Hispanic010

Afr-Am100

Anglo001

Age192823

Hours on Homework

1458

Example – given above data, the regression prediction model would be:

Y(GPA) = a + [b*H + b*AA] + b*Age + b*Homework

Intercept - new ‘a’ intercept is no longer the mean score for Anglo -

it is now the individual score for someone who scored ‘0’ age and ‘0’ hours on homework

However - things now change

Slopes – now indicates difference between ethnic group and the reference group for individuals who do not differ in ‘age’ or ‘homework’.

Applications

Research Question -

Do gender, culture, or age of a student have an effect on the student’s perception of his/her learning? RETENTION????

Student Opinion Polls (RETENTION???)

at Strayer University

That is, can we predict a student’s RETNETION???perception of his/her learning based on his/her gender, culture, and age?

And if so, which variable has the greatest effect?

Applications

Collect data from a survey asking students to indicate their perception of satisfaction and instructor effectiveness and how they perceived their instructor.

Methodolgy

Survey must be designed for a regression,

i.e., must have DV and IV.

Dependent Variables:

Instructor Effectiveness - Scale data, 4 through 1

How satisfying was this course?VERY SATISFYING SATISFYING NOT SATISFYING DISAPPOINTING

How effective do you feel your instructor was?VERY EFFECTIVE EFFECTIVE SOMEWHAT NOT EFFECTIVE

Satisfaction – Scale data, 4 through 1

Independent Variables – nominal, four descriptors

FREE

DISCUSSION

LECTURE

BASED

THEORY

BASED

ACTIVITY

BASED

Which one of the following describes your instructor’s teaching technique?

STUDENT

CENTERED

LITTLE

INVOLVEMENT

GAVE TIME

TO THINK ALONE

ACTIVE

PARTICIPATION

Which one of the following describes your instructor’s involvement with students?

Which one of the following describes your instructor’s method of teaching?

Which one of the following best describes your instructor?

GOT US INVOLVED

MOSTLY INSTRUCITONS

MOSTLY

WRITTEN

MOSTLY

ACTIONS

LISTENER DIRECTOR INTERPRETER COACH

Model Summary

.175a .031 .019 .68 .031 2.625 4 333 .035Model1




Change Statistics

Predictors: (Constant), Coach, Listener, Interpretor, Directora.

SPSS Regression Output – CorrelationInstructor Descriptor on Satisfaction (all included)

Interpretation –

As all descriptors were included, the correlation (multiple R) is difficult to interpret.

Coefficientsa

3.039 .134 22.710 .000

.388 .146 .214 2.662 .008

.219 .141 .156 1.547 .123

.293 .145 .186 2.013 .045

.378 .146 .208 2.596 .010

(Constant)

Listener

Director

Interpretor

Coach

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.


SPSS Regression Output – Means & SlopesInstructor Descriptor on Satisfaction (all included)

Interpretation ~

As all four descriptors were included, means and slopes are difficult to interpret. However, because it ran, perfect multi-colinearity must not exist – i.e., at least one of the records is missing a ‘1’ score for at least one descriptor.

SPSS Regression Output – Instructor Descriptor on Satisfaction

Coach as Reference

Model Summary

.105a .011 .002 .69 .011 1.233 3 334 .298Model1




Change Statistics

Predictors: (Constant), Interpretor, Listener, Directora.

Interpretation –

With Coach as reference, descriptors depict a modest correlation (R =.105) to Satisfaction and explain only 1.1% (R2-.101) of the variation in Satisfaction.

SPSS Regression Output – Means and SlopesInstructor Descriptor on Satisfaction (Coach as Reference)

Coefficientsa

3.313 .083 39.890 .000

.161 .118 .089 1.371 .171

-4.15E-02 .101 -.030 -.413 .680

3.835E-02 .108 .024 .354 .724

(Constant)

Listener

Director

Interpretor

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.


Interpretation –

With Coach as reference, Mean score for Coach is 3.313, Listener is 3.474 (slightly higher), Director is 3.272 (slightly lower than Coach), Interpret is 3.351 (slightly higher than Coach). The relatively small slopes suggest relatively little effect the respective descriptor has on Satisfaction.

Excel OutputsExamples of same regression

analyses in MS Excel

(Refer to handouts)

In Summary

• Regression, as a primary tool for prediction, requires quantitative data.

• Qualitative variables are vastly used in social research

• Convert these qualitative variables to quantitative variables by ‘dummy’ coding them ‘1’ – presence of quality, or ‘0’ – absence of quality.

• By so doing, the correlations, means, and slopes become meaningful.

Applications

Research Question -

Do gender, culture, or age of a student have an effect on the student’s perception of his/her learning? RETENTION????

Student Opinion Polls (RETENTION???)

at Strayer University

That is, can we predict a student’s RETNETION???perception of his/her learning based on his/her gender, culture, and age?

And if so, which variable has the greatest effect?

here, pal! regress this!

Documents

slope of age

slope of gpa

unit of age results

standardized betabeta

transfer credits

variation r2there

transfer cr4

dv time