crash course in statistics - schwarz & partners · slide 9 some notes about. types of scales...

Crash Course in Statistics

Data Analysis (with SPSS)

July 2014

Dr. Jürg Schwarz [email protected]

Neuroscience Center Zurich

Slide 2

Part 1: Program 9 July 2014: Morning Lessons (09.00 – 12.00)

◦ Some notes about.

- Type of Scales

- Distributions & Transformation of data

- Data trimming

◦ Exercises

- Self study about Boxplots

- Data transformation

- Check of Dataset

Part 2: Program 10 July 2014: Morning Lessons (09.00 – 12.00)

◦ Multivariate Analysis (Regression, ANOVA)

- Introduction to Regression Analysis

General Purpose

Key Steps

Testing of Requirements

Simple Example

Example of Multiple Regression

- Introduction to Analysis of Variance (ANOVA)

Types of ANOVA

Simple Example: One-Way ANOVA

Example of Two-Way ANOVA

Requirements

Slide 4

Part 2: Program 10 July 2014: Afternoon Lessons (13.00 – 16.00)

◦ Introduction to other multivariate methods (categorical/categorical – metric/metric)

- Methods

- Choice of method

- Example of discriminant analysis

◦ Exercises

- Regression Analysis

- Analysis of Variance (ANOVA)

◦ Remains of the course

- Evaluation (Feedback form will be handed out and collected afterwards)

- Certificate of participation will be issued Christof Luchsinger will attend at 15.30

Table of Contents

Some notes about� ______________________________________________________________________________________ 9

Types of Scales ...................................................................................................................................................................................................... 9

Nominal scale ............................................................................................................................................................................................................................ 10

Ordinal scale .............................................................................................................................................................................................................................. 11

Metric scales (interval and ratio scales) .................................................................................................................................................................................... 12

Hierarchy of scales .................................................................................................................................................................................................................... 13

Properties of scales ................................................................................................................................................................................................................... 14

Summary: Type of scales .......................................................................................................................................................................................................... 15

Exercise in class: Scales ...................................................................................................................................................................................... 16

Distributions ......................................................................................................................................................................................................... 17

Measure of the shape of a distribution ...................................................................................................................................................................................... 18

Transformation of data ......................................................................................................................................................................................... 20

Why transform data? ................................................................................................................................................................................................................. 20

Type of transformation ............................................................................................................................................................................................................... 20

Linear transformation ................................................................................................................................................................................................................. 21

Logarithmic transformation ........................................................................................................................................................................................................ 22

................................................................................................................................................................................................................................................... 24

Summary: Data transformation .................................................................................................................................................................................................. 25

Data trimming ....................................................................................................................................................................................................... 26

Finding outliers and extremes ................................................................................................................................................................................................... 26

Boxplot ....................................................................................................................................................................................................................................... 27

Boxplot and error bars ............................................................................................................................................................................................................... 28

Q-Q plot ..................................................................................................................................................................................................................................... 29

Example ..................................................................................................................................................................................................................................... 33

Exercises 01: Log Transformation & Data Trimming ___________________________________________________________ 34

Slide 6

Linear Regression _______________________________________________________________________________________ 35

Example ............................................................................................................................................................................................................... 35

General purpose of regression ............................................................................................................................................................................. 38

Key Steps in Regression Analysis ........................................................................................................................................................................ 39

Regression model ................................................................................................................................................................................................ 40

Mathematical model .................................................................................................................................................................................................................. 40

Stochastic model ....................................................................................................................................................................................................................... 40

Gauss-Markov Theorem, Independence and Normal Distribution ......................................................................................................................... 42

Regression analysis with SPSS: Some examples ................................................................................................................................................ 43

Simple example (EXAMPLE02)................................................................................................................................................................................................. 43

Step 1: Formulation of the model .............................................................................................................................................................................................. 43

Step 2: Estimation of the model ................................................................................................................................................................................................ 44

Step 3: Verification of the model................................................................................................................................................................................................ 45

Step 3: Verification of the model – t-tests .................................................................................................................................................................................. 46

Step 6. Interpretation of the model ............................................................................................................................................................................................ 47

Back to Step 3: Verification of the model .................................................................................................................................................................................. 48

Step 5: Testing of assumptions ................................................................................................................................................................................................. 50

Violation of the homoscedasticity assumption ........................................................................................................................................................................... 53

Multiple regression ............................................................................................................................................................................................... 54

Many similarities with simple Regression Analysis from above ................................................................................................................................................ 54

What is new? ............................................................................................................................................................................................................................. 54

Multicollinearity ..................................................................................................................................................................................................... 55

Outline ....................................................................................................................................................................................................................................... 55

How to identify multicollinearity ................................................................................................................................................................................................. 56

Multiple regression analysis with SPSS: Some detailed examples ....................................................................................................................... 57

Example of multiple regression (EXAMPLE04) ......................................................................................................................................................................... 57


Step 3: Verification of the model (without dummy for gender) .................................................................................................................................................. 58

SPSS Output regression analysis (EXAMPLE04) ..................................................................................................................................................................... 58

Dummy coding of categorical variables ..................................................................................................................................................................................... 60

Gender as dummy variable ....................................................................................................................................................................................................... 61

Step 1: Formulation of the model (with dummy for gender) ...................................................................................................................................................... 61

Step 3: Verification of the model (with dummy for gender) ....................................................................................................................................................... 62

SPSS Output regression analysis (EXAMPLE04) ..................................................................................................................................................................... 62

Example of multicollinearity ....................................................................................................................................................................................................... 63


SPSS Output regression analysis (Example of multicollinearity) I ............................................................................................................................................ 64

Exercises 02: Regression_________________________________________________________________________________ 66

Analysis of Variance (ANOVA) _____________________________________________________________________________ 67

Example ............................................................................................................................................................................................................... 67

Key steps in analysis of variance .......................................................................................................................................................................... 71

Designs of ANOVA ............................................................................................................................................................................................... 72

Sum of Squares .................................................................................................................................................................................................... 73

Step by step ............................................................................................................................................................................................................................... 73

Basic idea of ANOVA ................................................................................................................................................................................................................ 74

Significance testing of the model ............................................................................................................................................................................................... 75

ANOVA with SPSS: A detailed example ............................................................................................................................................................... 76

Example of one-way ANOVA: Survey of nurse salaries (EXAMPLE05) ................................................................................................................................... 76

SPSS Output ANOVA (EXAMPLE05) – Tests of Between-Subjects Effects I .......................................................................................................................... 77

Partial Eta Squared (partial η2) .................................................................................................................................................................................................. 79

Two-Way ANOVA ...................................................................................................................................................................................................................... 80

Main effects ............................................................................................................................................................................................................................... 81

Interaction effects ...................................................................................................................................................................................................................... 82

Example of two-way ANOVA: Survey of nurse salary (EXAMPLE06) ...................................................................................................................................... 85

Interaction .................................................................................................................................................................................................................................. 86

Requirements of ANOVA ...................................................................................................................................................................................... 89

Slide 8

Exercises 03: ANOVA ____________________________________________________________________________________ 90

Other multivariate Methods _______________________________________________________________________________ 91

Type of Multivariate Statistical Analysis ................................................................................................................................................................ 91

Methods for identifying structures Methods for discovering structures ........................................................................................................................... 91

Choice of Method ...................................................................................................................................................................................................................... 92

Tree of methods (also www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm) ......................................................................................................................... 93

Example of multivariate Methods (categorical / metric) ......................................................................................................................................... 94

Linear discriminant analysis ...................................................................................................................................................................................................... 94

Example of linear discriminant analysis .................................................................................................................................................................................... 95

................................................................................................................................................................................................................................................... 95

Very short introduction to linear discriminant analysis .............................................................................................................................................................. 96

SPSS Output Discriminant analysis (EXAMPLE07) I ................................................................................................................................................................ 99

Some notes about.

Types of Scales

Items measure the value of attributes using a scale.

There are four scale types that are used to capture the attributes of measurement objects

(e.g., people): nominal, ordinal, interval, and ratio scales.

Example from a health survey:

Stevens S.S. (1946): On the Theory of Scales of Measurement; Science, Volume 103, Issue 2684, pp. 677-680

Measurement object

Attribute of Object

Value of Attribute

Type of Scale

Person

Sex

Male / Female

Nominal

Attitude to health

1 to 5

Ordinal

Body temp-erature in °C

Real number

Interval

Income in US $

Real number

Ratio

Metric(SPSS: Scale)

Categorical(SPSS: Ordinal, Nominal)

Slide 10

Nominal scale

◦ Consists of "names" (categories). Names have no specific order.

◦ Must be measured with an unique (statistical) procedure.

◦ Each category is assigned a number (code can be arbitrary but must be unique).

Examples from the Health Survey

◦ Sex is either male or female.

◦ Ethnic group

Ordinal scale

◦ Consists of a series of values

◦ Each category is associated with a number which represents the category's order.

◦ The Likert scale (rating scale) is a special kind of ordinal scale.

Example from the Health Survey

◦ I've been feeling optimistic about the future: None of the time, Rarely, Some of the time .

Slide 12

Metric scales (interval and ratio scales)

◦ Measures the exact value

◦ The actual measured value is assigned

◦ In SPSS metric scales are called "Scale".

Example from the Health Survey for England 2003

◦ Age (in years)

Hierarchy of scales

The nominal scale is the "lowest" while the ratio scale is the "highest".

A scale from a higher level can be used as the scale for a lower level, but not vice versa.

(Example: Based on age in years (ratio scale), a binary variable can be generated to capture

whether a respondent is a minor (nominal scale), but not vice versa.)

Possible statements Example

Cate

gorica

l

Nom

inal

Equality,

inequality (=, ≠)

Sex (male = 0, female = 1): male ≠ female

Ord

ina

l In addition:

Relation larger (>),

smaller (<)

Self-perception of health (1 = "very bad", . 5 = "very

good"): 1 < 2 < 3 < 4 < 5

But "very good" is neither five times better than "very bad"

nor does "very good" have a distance of 4 to "very bad".

Metr

ic

(SP

SS

: "S

cale

")

Inte

rval In addition:

Comparison of differ-

ences

Temperature in °C: Difference between 20° and 15° = dif-

ference between 10° and 15°. But a temperature of 10° is

not twice as warm as 5°. Compare with the Fahrenheit-

scale! 10° C = 50° F, 5° C = 41° F

Ratio

In addition:

Comparison of ratios

Income: $ 8,000 is twice as large as $ 4,000. There is a

true zero point in this scale: $ 0. Division by 1000.

Slide 14

Properties of scales

Level Determination of ... Statistics

Nominal equality or unequality =, ≠ Mode

Ordinal greater, equal or less >, <, = Median

Interval equality of differences (x1 - x2) ≠ (x3 - x4) Arithmetic mean

Ratio equality of ratios (x1 / x2) ≠ (x3 / x4) Geometric meanmetr

iccate

gorical

Level Possible transformation

Nominal one-to-one substitution x1 ~ x2 <=> f(x1) ~ f(x2)

Ordinal monotonic increasing x1 > x2 <=> f(x1) > f(x2)

Interval positiv linear φ' = aφ + b with a > 0

Ratio postiv proportional φ' = aφ with a > 0metr

iccate

gorical

Summary: Type of scales

Statistical analysis assumes that the variables have specific levels of measurement.

Variables that are measured nominal or ordinal are also called categorical variables.

Exact measurements on a metric scale are statistically preferable.

Why does it matter whether a variable is categorical or metric?

For example, it would not make sense to compute an average for gender.

In short, an average requires a variable to be metric.

Sometimes variables are "in between" ordinal and metric.

Example:

A Likert scale with "strongly agree", "agree", "neutral", "disagree" and "strongly disagree".

If it is unclear whether or not the intervals between each of these five values are the same, then

it is an ordinal and not a metric variable.

In order to calculate statistics, it is often assumed that the intervals are equally spaced.

Many circumstances require metric data to be grouped into categories.

Such ordinal categories are sometimes easier to understand than exact metric measurements.

In this process, however, valuable exact information is lost.

Slide 16

Exercise in class: Scales 1. Read "Summary: Type of Scales" above.

2. Which type of scale?

Where do you live? north south east west

Size of T-shirt (XS, S, M, L, XL, XXL)

How much did you spend on food this week? _____ $

Size of shoe in Europe

1 2 3 4 5

� � � ⌧ �

Please mark one box ⌧ per question

2.01Compared with the health of

others in my age, my health isvery bad very good

Distributions

Take an optical impression. Source: http://en.wikipedia.org (Date of access: July, 2014)

Normal

Widely used in statistics (statistical inference).

Poisson

Law of rare events (origin 1898: number of soldiers killed by horse-kicks each year).

Exponential

Queuing model (e.g. average time spent in a queue).

Pareto

Allocation of wealth among indi-viduals of a society ("80-20 rule").

Slide 18

Measure of the shape of a distribution

Skewness (German: Schiefe)

A distribution is symmetric if it looks the same to the

left and right of the center point.

Skewness is a measure of the lack of symmetry.

Range of skewness

Negative values for the skewness indicate distribution that is skewed left.

Positive values for the skewness indicate distribution that is skewed right.

Kurtosis (German: Wölbung)

Kurtosis is a measure of how the distribution is shaped relative to a normal distribution.

A distribution with high kurtosis tend to have a distinct peak near the mean.

A distribution with low kurtosis tend to have a flat top near the mean.

Range of kurtosis

Standard normal distribution has a kurtosis of zero.

Positive values for the kurtosis indicates a "peaked" distribution.

Negative values for the kurtosis indicates a "flat" distribution.

Analyze�Descriptive Statistics�Frequencies...

Example

Dataset "Data_07.sav" (Tschernobyl fallout of radioactivity, measured in becquerel)

Distribution of original data is skewed right.

BQ has skewness 2.588 and kurtosis 7.552

Distinct peak near zero.

Logarithmic transformation

Compute lnbq = ln(bq).

freq bq lnbq.

Log transformed data is slightly skewed right.

LNBQ has skewness .224 and kurtosis -.778

More likely to show normal distribution.

Statistics

23 23

0 0

2.588 .224

.481 .481

7.552 -.778

.935 .935

Valid

Missing

N

Skewness

Std. Error of Skewness

Kurtosis

Std. Error of Kurtosis

BQ LNBQ

Slide 20

Transformation of data

Why transform data?

1. Many statistical models require that the variables (in fact: the errors) are approximately normally distributed.

2. Linear least squares regression assumes that the relationship between two variables is linear. Often we can "straighten" a non-linear relationship by transforming the variables.

3. In some cases it can help you better examine a distribution.

When transformations fail to remedy these problems, another option is to use:

nonparametric methods, which makes fewer assumptions about the data.

Type of transformation

◦ Linear Transformation

Does not change shape of distribution.

◦ Non-linear Transformation

Changes shape of distribution.

Linear transformation

A very useful linear transformation is standardization.

(z-transformation, also called "converting to z-scores" or "taking z-scores")

Transformation rule

ii

ˆx - µz =

σ

ˆ

ˆ

µ mean of sample

σ standard deviation of sample

Original distribution will be transformed to one in which

the mean becomes 0 and

the standard deviation becomes 1

A z-score quantifies the original score in terms of

the number of standard deviations that the score is

from the mean of the distribution.

=> For example use z-scores to filter outliers

Analyze�Descriptive Statistics�Descriptives...

Slide 22

Logarithmic transformation

Works for data that are skewed right.

Works for data where residuals get bigger for bigger values of the dependent variable.

Such trends in the residuals occur often, if the error in the value of an

outcome variable is a percent of the value rather than an absolute value.

For the same percent error, a bigger value of the variable means a bigger absolute error,

so residuals are bigger too.

Taking logs "pulls in" the residuals for the bigger values.

log(Y*error) = log(Y) + log(error)

Transformation rule

f(x) = log(x);x 1

f(x) = log(x +1);x 0

≥≥

size (in cm)

200190180170160150

weig

ht (in k

g)

100

90

80

70

60

50

40

Example: Body size against weight

Logarithmic transformation I

Symmetry

A logarithmic transformation reduces

positive skewness because it compresses

the upper tail of the distribution while

stretching out the lower trail. This is be-

cause the distances between 0.1 and 1, 1

and 10, 10 and 100, and 100 and 1000

are the same in the logarithmic scale.

This is illustrated by the histogram of

data simulated with salary (hourly wag-

es) in a sample of nurses*. In the origi-

nal scale, the data are long-tailed to the

right, but after a logarithmic transfor-

mation is applied, the distribution is

symmetric. The lines between the two

histograms connect original values with

their logarithms to demonstrate the

compression of the upper tail and

stretching of the lower tail.

*More to come in chapter "ANOVA".

Histogram of original data

Histogram of transformed data

Slide 24

Logarithmic transformation II

skewed right

Histogram of original data

Histogram of transformed data

Transformation y = log10(x)

nearly normal distributed

Summary: Data transformation

Linear transformation and logarithmic transformation as discussed above.

Other transformations

Root functions

1/2 1/3f(x) = x ,x ;x 0≥

usable for right skewed distributions

Hyperbola function

-1f(x) = x ;x 1≥

usable for right skewed distributions

Box-Cox-transformation

λf(x) = x ;λ >1p

ln( )1 p−

usable for left skewed distributions

Probit & Logit functions (cf. logistic regression)

pf (p) ln( );p [0,1]

1 p= ∈

−

usable for proportions and percentages

Interpretation and usage

Interpretation is not always easy.

Transformation can influence results significantly.

Look at your data and decide if it makes sense in the context of your study.

Slide 26

Data trimming

Data trimming deals with

◦ Finding outliers and extremes in a data set.

◦ Dealing with outliers: Correction, deletion, discussion, robust estimation

◦ Dealing with missing values: Correction, treatment (SPSS), (also imputation)

◦ Transforming data if necessary (see chapter above).

Finding outliers and extremes

Get an overview over the dataset!

◦ How does distribution looks like?

◦ Arte there any values that are not expected?

Methods?

◦ Use basic statistics: <Analyze> with <Frequencies>, <Explore> and <Descriptives.>

Outliers => e.g. z-scores higher/lower 2 st. dev., extremes => higher/lower 3 st. dev.

◦ Use graphical techniques: Histogram, Boxplot, Q-Q plot, .

Outliers => e.g. as indicated in boxplot

Boxplot

A Boxplot displays the center (median), spread and outliers of a distribution.

See exercise for more details about whiskers, outliers etc.

Boxplots are an excellent tool for detecting

and illustrating location and variation

changes between different groups of data.

incom e

60.0

80.0

100.0

120.0

140.0

19688

83

92

"Box" identifies themiddle 50% of datset

Median

Whisker

Whisker

Outliers (Number in Dataset)

incom e

60.0

80.0

100.0

120.0

140.0

19688

83

92

"Box" identifies themiddle 50% of datset

Median

Whisker

Whisker

Outliers (Number in Dataset)

2 3 4 5 6 7

educ

60.0

80.0

100.0

120.0

140.0

inco

me

196

191

83

65

168

88

190

92

income

inc

om

e

education

Slide 28

Boxplot and error bars

Boxplot Error bars

Keyword "median"

Overview over data and illustration of data

distribution (range, skewness, outliers)

Keyword "mean"

Overview over mean and confidence interval

or standard error

2 3 4 5 6 7

educ

60.0

80.0

100.0

120.0

140.0

inco

me

196

191

83

65

168

88

190

92

2 3 4 5 6 7

educ

74.0

76.0

78.0

80.0

82.0

84.0

86.0

88.0

90.0

92.0

95

% C

I in

co

me

Q-Q plot

The quantile-quantile (q-q) plot is a graphical technique for deciding if two samples come from

populations with the same distribution.

Quantile: the fraction (or percent) of data points below a given value.

For example the 0.5 (or 50%) quantile is the position at which 50% percent of the data fall below

and 50% fall above that value. In fact, the 50% quantile is the median.

Sample Distribution (simulated data)

50% Quantile50% Quantile

Normal Distribution

Slide 30

In the q-q plot, quantiles of the first sample are set against the quantiles of the second sample.

If the two sets come from a population with the same distribution, the points should fall

approximately along a 45-degree reference.

The greater the displacement from this reference line, the greater the evidence for the

conclusion that the two data sets have come from populations with different distributions.

Some advantages of the q-q plot are:

The sample sizes do not need to be equal.

Many distributional aspects can be simultaneously tested.

Difference between Q-Q plot and P-P plot

A q-q plot is better when assessing the goodness of fit in the tail of the distributions.

The normal q-q plot is more sensitive to deviances from normality in the tails of the distribution,

whereas the normal p-p plot is more sensitive to deviances near the mean of the distribution.

Q-Q plot: Plots the quantiles of a varia-ble's distribution against the quantiles of any of a number of test distributions.

P-P plot: Plots a variable's cumulative pro-portions against the cumulative proportions of any of a number of test distributions.

Quantiles of the first sample are set against the quantiles of the second sample.

Sta

nd

ard

Norm

al D

istr

ibu

tion

Sample Distribution (simulated data)

Sta

nd

ard

Norm

al D

istr

ibu

tion

Normal Distribution

Slide 32

Example of q-q plot with simulated data

Normal vs. Standard Normal Sample Distribution vs. Standard Normal

0

100

200

300

Häu

fig

keit

0

100

200

300

Häu

fig

keit

3 4 5 6 7 8 9

Beobachteter Wert

3

4

5

6

7

8

9

Erw

art

ete

r W

ert

vo

n N

orm

al

-2 0 2 4 6 8 10 12 14 16

Beobachteter Wert

-2

0

2

4

6

8

10

12

Erw

art

ete

r W

ert

vo

n N

orm

al

Sta

nd

ard

No

rmal

Sta

nd

ard

No

rmal

Simulated data Simulated data

Test

dis

trib

ution (

SP

SS

)

Test

dis

trib

ution (

SP

SS

)

Sample Distribution Normal

Example

Dataset "Data_07.sav" (Tschernobyl fallout of radioactivity)

Distribution of original data Distribution of log transformed data

Slide 34

Exercises 01: Log Transformation & Data Trimming

Ressources => www.schwarzpartners.ch/ZNZ_2012 => Exercises Analysis => Exercise 01

Linear Regression

Example

Medical research: Dependence of age and systolic blood pressure

140

150

160

170

180

190

200

210

220

230

240

35 40 45 50 55 60 65 70 75 80 85 90

Systo

lic b

loo

d p

ressure

[m

m H

G]

Age [years]

Dataset (EXAMPLE01.SAV)

Sample of n = 10 men

Variables for

◦ age (age)

◦ systolic blood pressure (pressure)

Typical questions

Is there a linear relation between

age and systolic blood pressure?

What is the predicted mean blood

pressure for men aged 67?

Slide 36

The questions

Question in everyday language:

Is there a linear relation between age and systolic blood pressure?

Research question:

What is the relation between age and systolic blood pressure?

What kind of model is best for showing the relation? Is regression analysis the right model?

Statistical question:

Forming hypothesis

H0: "No model" (= No overall model and no significant coefficients)

HA: "Model" (= Overall model and significant coefficients)

Can we reject H0?

The solution

Linear regression equation of age on systolic blood pressure

0 1pressure age u= β + β ⋅ +

0 1

pressure dependent variable

age independent variable

, coefficients

u error term

==

β β =

=

"How-to" in SPSS

Scales

Dependent variable: metric

Independent variable: metric

SPSS

Analyze�Regression�Linear...

Result

Significant linear model

Significant coefficient

pressure 135.2 0.956 age= + ⋅

Predicted mean blood pressure

199.2 135.2 0.956 67= + ⋅

Typical statistical statement in a paper:

There is a linear relation between age and systolic blood pressure.

(Regression: F = 102.763, R2 = .93, p = .000).

Systo

lic b

loo

d p

ressure

[m

mH

G]

Age [years]

140

150

160

170

180

190

200

210

220

230

240

35 40 45 50 55 60 65 70 75 80 85 90

Slide 38

General purpose of regression

◦ Cause analysis

State a relationship between independent variables and the dependent variable.

Example

Is there a model that describes the dependence between blood pressure and age, or do these two variables just form a random pattern?

◦ Impact analysis

Assess the impact of the independent variable to the dependent variable.

Example

If age increases, blood pressure also increases: How strong is the impact? By how much will pressure increase with each additional year?

◦ Prediction

Predict the values of a dependent variable using new values for the independent variable.

Example

Which is the predicted mean systolic blood pressure of men aged 67?

Key Steps in Regression Analysis

1. Formulation of the model

◦ Common sense . (remember the example with storks and babies)

◦ Linearity of relationship plausible

◦ Not too many variables (Principle of parsimony: Simplest solution to a problem)

2. Estimation of the model

◦ Estimation of the model by means of OLS estimation (ordinary least squares)

◦ Decision on procedure: Enter, stepwise regression

3. Verification of the model

◦ Is the model as a whole significant? (i.e. are the coefficients significant as a group?) → F-test

◦ Are the regression coefficients significant? → t-tests (should be performed only if F-test is significant)

◦ How much variation does the regression equation explain? → Coefficient of determination (adjusted R-squared)

4. Considering other aspects (for example, multicollinearity)

5. Testing of assumptions (Gauss-Markov, independence and normal distribution)

6. Interpretation of the model and reporting

Text in italics: Only important in the case of multiple regression – see next chapter.

Slide 40

Regression model

Mathematical model

The linear model describes y as a function of x

= β + β ⋅0 1y x equation of a straight line

The variable y is a linear function of the variable x.

β0 (intercept, constant)

The point where the regression line crosses the Y-axis.

The value of the dependent variable when all of the independent variables = 0.

β1 (regression coefficient)

The increase in the dependent variable per unit change in the

independent variable (also known as "the rise over the run", slope)

Stochastic model

0 1y x u= β +β ⋅ +

The error term u comprises all factors (other than x) that affect y.

These factors are treated as being unobservable.

→ u stands for "unobserved"

More details about mathematics in Christof Luchsinger's part

�

�

ββββ�

�� ∆∆∆∆��

�∆∆∆∆�

y

x

∆=∆��

Stochastic model – Assumptions related to the error term

The error term u is (must be) .

◦ independent of the explanatory variable x

◦ normally distributed with mean 0 and variance σ2: u ~ N(0,σ2)

0 1E(y) x= β +β ⋅

σ

Woold

ridge, Jeffre

y (

2011):

Intr

oducto

ry e

conom

etr

ics.

5th

Editio

n. [S

.l.]: S

outh

-Weste

rn.

0

0

0

Slide 42

Gauss-Markov Theorem, Independence and Normal Distribution

Under the 5 Gauss-Markov assumptions the OLS estimator is the best, linear, unbiased estima-

tor of the true parameters βi, given the present sample.

→ The OLS estimator is BLUE

1. Linear in coefficients y = β0 + β1 ⋅ x + u

2. Random sample of n observations {(xi ,yi ): i = 1,.,n}

3. Zero conditional mean:

The error u has an expected value of 0,

given any values of the explanatory variable

E(ux) = 0

4. Sample variation in explanatory variables.

The xi’s are not constant and not all the same.

x ≠ const

x1 ≠ x2 ≠ . ≠ xn

5. Homoscedasticity:

The error u has the same variance given any value of the

explanatory variable.

Var(ux) = σσσσ2

Independence and normal distribution of error u ~ Normal(0,σσσσ2)

These assumptions need to be tested – among else by analyzing the residuals.

Based on: Wooldridge J. (2005). Introductory Econometrics: A Modern Approach. 3rd edition, South-Western.

Regression analysis with SPSS: Some examples

Simple example (EXAMPLE02)

Dataset: Sample of 99 men by body height and weight

Step 1: Formulation of the model

Regression equation of weight on height

0 1weight height u= β + β ⋅ +

0 1

weight dependent variable

height independent variable

, coefficients

u error term

==

β β =

=

The scatterplot confirms that there could be a

linear relationship between weight and height.

Slide 44

Step 2: Estimation of the model

SPSS: Analyze�Regression�Linear.

Step 3: Verification of the model

SPSS Output (EXAMPLE02) – F-test

The null hypothesis (H0) is that there is no effect of height.

The alternative hypothesis (HA) is that this is not the case.

H0: β1 = 0 (Multiple Regression => H0: β1 = β1 = . = βp = 0)

HA: β1 ≠ 0 (Multiple Regression => HA: βj ≠ 0 for at least one value of j)

Empirical F-value and the appropriate p-value ("Sig.") are computed by SPSS.

In the example, we can reject H0 in favor of HA (Sig. < 0.05).

The overall model is significant (F(1,97) = 116.530, p = .000).

The estimated model is not only a theoretical construct but one that exists in a statistical sense.

Slide 46

Step 3: Verification of the model – t-tests

SPSS Output (EXAMPLE02) – t-test

The Coefficients table provides significance tests for the coefficients.

The significance test evaluates the null hypothesis that the regression coefficient is zero

H0: βi = 0

HA: βi ≠ 0

The t statistic for the height variable (β1) is associated with a p-value of .000 ("Sig.").

This indicates that the null hypothesis can be rejected.

Thus, the coefficient is significantly different from zero.

This holds also for the constant (β0) with Sig. = .000.

Step 6. Interpretation of the model

SPSS Output (EXAMPLE02) – Regression coefficients

i 0 1 iweight height= β + β ⋅

i iweight 120.375 1.086 height= − + ⋅

Unstandardized coefficients show absolute

change of the dependent variable if the

independent variable increases by one unit.

If height increases by 1 cm,

weight increases by 1.086 kg.

Note: The constant -120.375 has no specific

meaning. It's just the intersection with the Y-axis.

Slide 48

Back to Step 3: Verification of the model

SPSS Output (EXAMPLE02) – Coefficient of determination

Tota

l G

ap

Regre

ssio

n

Err

or iy

iy

y

iy = Data point

iy = Estimation (model)

y = Sample mean

Error is also called residual

SPSS Output (EXAMPLE02) – Coefficient of determination I

Summing up squared distances to sum of squares (SS)

SSTotal = SSRegression + SSError

∑∑∑===

−+−=−n

1i

2

ii

n

1i

2

i

n

1i

2

i )yy()yy()yy(

Regression

Total

≤ ≤SS

R Square = 0 R Square 1SS

R Square, the coefficient of determination, is .546.

In the example, about half the variation of weight is explained by the model (R2 = 54.6%).

In bivariate regression, R2 is qual to the squared value of the correlation coefficient of the two

variables (rxy = .739, rxy2 = .546).

The higher R Square, the better the fit.

Slide 50

Step 5: Testing of assumptions

In the example, are the requirements of the Gauss-Markov theorem as well as the other as-

sumptions met?

1. Is the model linear in coefficients Yes, decision for regression model.

2. Is it a random sample? Yes, clinical study.

3. Do the residuals have an expected value of 0

for all values of x? (zero conditional mean)

→ Scatterplot of residuals

4. Is there variation in the explanatory variable? Yes, clinical study.

5. Do the residuals have constant variance

for all values of x? (homoscedasticity)


Are the residuals independent from one another?

Are the residuals normally distributed?


→ (consider Durbin-Watson)

→ Histogram

Scatterplot of standardized predicted values of y vs. standardized residuals

3. Zero conditional mean: The mean values of the residuals do not differ visibly from 0 across

the range of standardized estimated values. → OK

5. Homoscedasticity: Residual plot trumpet-shaped; residuals do not have constant variance.

This Gauss-Markov requirement is violated. → There is heteroscedasticity.

Independence: There is no obvious pattern that indicates that the residuals would be influenc-

ing one another (for example a "wavelike" pattern). → OK

Slide 52

Histogram of standardized residuals

Normal distribution of residuals:

Distribution of the standardized residuals is more or less normal. → OK

Violation of the homoscedasticity assumption

How to diagnose heteroscedasticity

Informal methods:

◦ Look at the scatterplot of standardized predicted y-values vs. standardized residuals.

◦ Graph the data and look for patterns.

Formal methods (not pursued further in this course):

◦ Breusch-Pagan test / Cook-Weisberg test

◦ White test

Corrections

◦ Transformation of the variable: Possible correction in the case of this example is a log transformation of variable weight

◦ Use of robust standard errors (not implemented in SPSS)

◦ Use of Generalized Least Squares (GLS): The estimator is provided with information about the variance and covariance of the errors.

(The last two options are not pursued further in this course.)

Slide 54

Multiple regression

Many similarities with simple Regression Analysis from above

◦ Key steps in regression analysis

◦ General purpose of regression

◦ Mathematical model and stochastic model

◦ Ordinary least squares (OLS) estimates and Gauss-Markov theorem as well as independence and normal distribution of error

All concepts are the same also regarding multiple regression analysis.

What is new?

◦ Concept of multicollinearity

◦ Concept of stepwise conduction of regression analysis

◦ Dummy coding of categorical variables

◦ Standardized regression coefficients

◦ Adjustment of the coefficient of determination ("Adjusted R Square")

Multicollinearity

Outline

Multicollinearity means there is a strong correlation between independent variables.

Perfect collinearity means a variable is a linear combination of other variables.

=> Unique estimate of coefficients not possible because of infinite number of combinations.

Perfect collinearity is rare in real-life data (except the fact that you make a mistake.)

However, correlations or even strong correlations between variables are unavoidable.

Symptoms of multicollinearity

When correlation is strong, standard errors of the parameters become large

and thus t-tests and confidence intervals inaccurate. ◦ The probability is increased that a good predictor will be found non-significant and rejected.

◦ In stepwise regression coefficient estimation is subject to large changes.

◦ There might be coefficients with sign opposite of that expected.

Multicollinearity is . ◦ a severe problem when the research purpose includes causal modelling.

◦ less important where the research purpose is prediction since the predicted values of remain stable relative to each other.

Slide 56

How to identify multicollinearity

If the correlation coefficients between pairs of variables are greater than |0.80|, the variables

should not be used in the same model.

An indicator for multicollinearity reported by SPSS is Tolerance.

◦ Tolerance reflects the percentage of unexplained variance in a variable, given the other independent variables. Tolerance informs about the degree of independence of an independent variable.

◦ Tolerance ranges from 0 (= multicollinear) to 1 (= independent).

◦ Rule of thumb (O'Brien 2007): Tolerance less than .10 → problem with multicollinearity

In addition, SPSS reports the Variance Inflation Factor (VIF) which is simply the inverse of the

Tolerance (1/Tolerance). VIF has a range 1 to infinity.

Multiple regression analysis with SPSS: Some detailed examples

Example of multiple regression (EXAMPLE04)

Dataset: Sample of 198 men and women based on body height and weight and age


Regression of weight on height and age

β +β ⋅ + β ⋅ +0 1 2weight = size age u

β β β0 1 2

weight = dependent variable

size = independent variable

age = independent variable

, , = coefficients

u = error term

Slide 58

Step 3: Verification of the model (without dummy for gender)

SPSS Output regression analysis (EXAMPLE04)

Overall F-test: OK (F(2, 195) = 487.569, p = .000) (table not shown here)

0 1 2weight = height age uβ + β ⋅ + β ⋅ +

weight = 85.933 .812 height .356 age− + ⋅ + ⋅

The unstandardized B coefficients show the absolute change of the dependent variable weight

if the respective independent variable, height or age, changes by one unit.

The Beta coefficients are the standardized regression coefficients.

Their magnitudes reflect their relative importance in predicting weight.

Beta coefficients are only comparable within a model, not between. Moreover, they are highly

influenced by misspecification of the model.

Adding or leaving out variables in the equation will affect the size of the beta coefficients.

SPSS Output regression analysis (EXAMPLE04) I

R Square is influenced by the number of independent variables.

R Square increases with increasing number of variables.

m (1 R Square)Adjusted R Square = R Square

n m 1

⋅ −−

− −

− −

n = number of observations

m = number of independent variables

n m 1= degreesof freedom(df)

Slide 60

Dummy coding of categorical variables

In regression analysis, a dummy variable (also called indicator or binary variable) is one that

takes the values 0 or 1 to indicate the absence or presence of some categorical effect that may

be expected to shift the outcome.

For example, seasonal effects may be captured by creating dummy variables for each of the

seasons. Also gender effects may be treated with dummy coding.

The number of dummy variables is always one less than the number of categories.

recode gender (1 = 1) (2 = 0) into gender_d.

Categorical variable

season season_1 season_2 season_3 season_4

If season = 1 (spring) 1 0 0 0

If season = 2 (summer) 0 1 0 0If season = 3 (fall) 0 0 1 0

If season = 4 (winter) 0 0 0 1

Dummy variables

Categorical variable

gender gender_1 gender_2If gender = 1 (male) 1 0

If gender = 2 (female) 0 1

Dummy variables

SPSS syntax:

Gender as dummy variable

Step 1: Formulation of the model (with dummy for gender)

Women and men have different

mean levels of height and weight.

→ Introduce gender as independent dummy variable

=> Syntax: RECODE gender (1 = 0) (2 = 1) INTO female.

Height Weight

Men 181.19 76.32

Women 170.08 63.95

Total 175.64 70.14

Mean

Slide 62

Step 3: Verification of the model (with dummy for gender)

SPSS Output regression analysis (EXAMPLE04)

Overall F-test: OK (F(3, 194) = 553.586, p = .000) (table not shown here)

− + ⋅ + ⋅ − ⋅weight = 16.949 .417 size .476 age 8.345 female

Switching from male (female = 0) to female (female = 1) lowers weight by 8.345 kg.

Model fits better (Adjusted R square .894 vs. .832) due to "separation" of gender.

Example of multicollinearity

Human resources research in hospitals: Survey of nurse satisfaction and commitment

Dataset Sample of n = 198 nurses


Regression model

β + β ⋅ + β ⋅ + β ⋅ + β ⋅ +20 1 2 3 4salary = age education experience experience u

Why a new variable experience2?

The experience effect on salary is disproportional for younger and older people.

The disproportionality can be described by a quadratic term.

"experience" and "experience2"

are highly correlated!

Slide 64

SPSS Output regression analysis (Example of multicollinearity) I

Tolerance is very low for "experience" and "experience2"

One of the two variables might be eliminated from the model

=> Use stepwise regression? Unfortunately SPSS does not take into account multicollinearity.

SPSS Output regression analysis (Example of multicollinearity) II

Prefer this model, because a not significant constant is difficult to handle.

Slide 66

Exercises 02: Regression


Analysis of Variance (ANOVA)

Example

Research in human resource management: Survey of nurse salaries in hospitals

Data (EXAMPLE05.sav)

Subsample of n = 96 nurses

Among other variables: work experience (3 levels), salary (hourly wage in CHF/h)

Typical questions

Has experience an effect on the level of salary?

Are the results only due to chance?

What is the relation between work experience and salary?

1 2 3 All

All 36.- 38.- 42.- 39.-

Level of Experience

Nurse Salary [CHF/h]

grand mean

Slide 68

Boxplot

The boxplot indicates that salary may differ significantly depending on levels of experience.

- - - grand mean

Questions

Question in everyday language:

Has work experience an effect on salary?

Research question:

Is there a relation between work experience and salary?

What kind of model is suitable for the relation?

Is analysis of variance the right model?

Statistical question:

Forming hypothesis

H0: "No model" (= Not significant factors)

HA: "Model" (= Significant factors)

Can we reject H0?

Solution

Linear model with salary as the dependent variable (ygk = wage of nurse k in group g)

gk g gky y= + α + ε

g

gk

y grand mean

effect of group g

random term

=α =

ε =

Slide 70

"How-to" in SPSS

Scales

Dependent Variable: metric

Independent Variable(s): categorical, part of them metric (called covariates)

SPSS

Analyze�General Linear Model�Univariate...

Results

Overall model significant ("Corrected Model": F(2, 93) = 46.193, p = .000).

experien significant → example interpretation:

There is a main effect of experience (levels 1, 2, 3) on salary, F(2, 93) = 46.193, p = .000. The

value of Adjusted R Squared = .488 shows that 48.8% of the variance in salary around the

grand mean can be predicted by the model (here by experien).

Key steps in analysis of variance

1. Design of experiments

◦ ANOVA is typically used for analyzing the findings of experiments

◦ Oneway ANOVA, Repeated measures ANOVA Multi-factorial ANOVA (two or more factor analysis of variance)

2. Calculating differences and sum of squares

◦ Differences between group means, individual values and grand mean are squared and summed up. This leads to the fundamental equation of ANOVA.

◦ Test statistics for significance test is calculated from the means of the sums of squares.

3. Prerequisites

◦ Data is Independent

◦ Normally distributed variables

◦ Homogeneity of variance between groups

4. Verification of the model and the factors

◦ Is the overall model significant? (F-test)? Are the factors significant?

◦ Are prerequisites met?

5. Checking measures

◦ Adjusted R squared / partial Eta squared

Mixed ANOVA

Slide 72

Designs of ANOVA ◦ One-way ANOVA: one factor analysis of variance

1 dependent variable and 1 independent factor

◦ Multi-factorial ANOVA: two or more factor analysis of variance

1 dependent variable and 2 or more independent factors

◦ MANOVA: multivariate analysis of variance

Extension of ANOVA used to include more than one dependent variable

◦ Repeated measures ANOVA

1 independent variable but measured repeatedly under different conditions

◦ ANCOVA: analysis of COVariance

Model includes a so called covariate (metric variable)

◦ MANCOVA: multivariate analysis of COVariances

◦ Mixed-design ANOVA possible (e.g. two-way ANOVA with repeated measures)

Sum of Squares

Step by step

Survey on hospital nurse salary: Salaries differ by level of experience.

1 2 3Guess: What if y y y ?≈ ≈S

ala

ry [

CH

F/h

]

y

38.6

41.6

42.7

35.9

y

Sa

lary

[C

HF

/h]

y

38.6

41.6

42.7

35.9

y

Expand

y

y

3iy

1 2 3

level of experience

mean of all nurses salary38.6

3y mean of experience level 3

salary of i-th nurse with experience level 3

41.6

42.7

35.91y

A

B

Legend

individual nurse salaries

A

B

part of variation due to experience level

A+B

random part of variation

total variation from mean of all nurses

2y

y

y

3iy

1 2 3

level of experience




41.6

42.7

35.91y

A

B

Legend


A

B


A+B



2y

y

y

3iy

1 2 3

level of experience




41.6

42.7

35.91y

A

B

Legend


A

B


A+B



Legend


A

B


A+B



2y

Slide 74

Basic idea of ANOVA

Total sum of squared variance of differences SStotal is separated into two parts

(SS is short for Sum of Squares)

◦ SSbetween Part of sum of squared difference due to groups ("between groups", treatments) (here: between levels of experience)

◦ SSwithin Part of sum of squared difference due to randomness ("within groups", also SSerror) (here: within each experience group)

Fundamental equation of ANOVA:

g: index for groups from 1 to G (here: G = 3 levels of experience)

k: index for individuals within each group from 1 to Kg (here: K1 = K2 = K3 = 32, Ktotal = K1 + K2 + K3 = 96 nurses)Swithin

= = = = =

− = − + −∑∑ ∑ ∑∑g gK KG G G

2 2 2gk g g gk g

g 1 k 1 g 1 g 1 k 1

(y y) K (y y) (y y )

totalSS betweenSS withinSS

1 2 3 between withinIf y y y then SS SS≈ ≈ ≪

Significance testing of the model

Test statistic F for significance testing is computed by relation of means of sum of squares

=−t

t

total

SSMS

K 1

=−

bb

SSMS

G 1

=−w

w

total

SSMS

K G

Calculating test statistic F and significance testing for the global model

The F-test verifies the hypothesis that the group means are equal:

0 1 2 3H : y y y= =

A i jH : y y for at least one pair ij≠

b

w

MSF

MS=

Mean of SStotal

Mean of SSbetween

Mean of SSwithin

F follows an F-distribution with (G – 1) and (Ktotal – G) degrees of freedom

1 2 3 b wIf y y y then MS MS≈ ≈ ≪

Slide 76

ANOVA with SPSS: A detailed example

Example of one-way ANOVA: Survey of nurse salaries (EXAMPLE05)

SPSS: Analyze��General Linear Model��Univariate...

SPSS Output ANOVA (EXAMPLE05) – Tests of Between-Subjects Effects I

Significant overall model (called "Corrected Model")

Significant constant (called "Intercept")

Significant variable experien

Example interpretation for the main effect of experien:

There is a main effect of experience (levels 1, 2, 3) on salary, F(2, 93) = 46.193, p = .000.

The value of Adjusted R Squared (.488) shows that 48.8% of the variance in salary around the

grand mean can be predicted by the model (here: variable experien).

Slide 78

SPSS Output ANOVA (EXAMPLE05) – Tests of Between-Subjects Effects II

Allocation of sum of squares to terms in the SPSS output

SSbetween reflects the sum of squares of all factors in the model.

In this case (one-way analysis) SSbetween � experien

"Grand mean"

SSbetween

SStotal

SSwithin (= SSerror)

Partial Eta Squared (partial ηηηη2)

Partial Eta Squared compares the amount of variation explained by a particular factor (all other

variables fixed) to the amount of variation that is not explained by any other factor in the model.

This means, we are only considering variation that is not explained by other variables in the

model. Partial η2 indicates what percentage of this variation is explained by a variable.

η =+

2 Effect

Effect Error

SSPartial

SS SS

Example: Experience explains 49.8% of the previously unexplained variation.

Note: The values of partial η2 do not sum up to 100%! (↔ "partial")

In case of one-way ANOVA:

Partial η2 is the proportion of the corrected total variation

that is explained by the model (= R2).

Slide 80

Two-Way ANOVA

Research in human resource management: Survey of nurse salary

Now two factors are in the design

◦ Work experience (Level of experience 1-3): experien

◦ Work position (Position in office or hospital): position

Typical questions

Do work position and experience have an effect on salary? (→ main effects) What "interaction" exists between work position and experience? (→ interaction effects)

1 2 3 All

Office 35.- 37.- 39.- 37.-

Hospital 37.- 40.- 44.- 40.-

All 36.- 38.- 42.- 39.-

Level of Experience

Nurse Salary [CHF/h]

Po

sit

ion

Main effects

The direct effect of an independent variable on the dependent variable is called main effect.

In the example:

◦ The main effect of experien reveals that the nurses′ salaries depend on their level of profes-sional experience.

◦ The main effect of position reveals that the nurses′ salaries depend on whether they work in the office or the hospital.

Profile plots are used as visualization:

Main effect experien Main effect position

If the profile plot shows a (nearly) horizontal line, the main effect in question is presumably not

significant. (Attention: SPSS cuts off lower area of graph, Y-axis often does not start at 0!)

0

5

10

15

20

25

30

35

40

45

1 2 3

experien

sa

lary

0

5

10

15

20

25

30

35

40

45

office hospital

position

sa

lary

Slide 82

Interaction effects

An interaction between experience and position means there is dependency between the two

variables.

The independent variables have a complex influence on the dependent variable.

The factors do not just function additively but act together in a different manner.

An interaction means that the effect of one factor depends on the value of another factor.

experience(factor A)

salaryinteraction

(factor A x B)

position(factor B)

Interaction effects

In the example: The interaction between experien and position means ...

◦ that the effect of work experience on salary is not the same for nurses who work in offices and for nurses who work in the hospital.

◦ that the difference in salary between nurses working in the hospital and nurses working in the office depends on the level of experience.

Profile plots:

Separate lines for position Separate lines for experien

If there is an interaction, the lines are not parallel.

The more the lines deviate from being parallel, the more likely is an interaction.

If there is no interaction, the lines are parallel.

0

5

10

15

20

25

30

35

40

45

1 2 3

hospital

office

experien

sa

lary

0

5

10

15

20

25

30

35

40

45

office hospital

3

2

1

experien

position

sa

lary

Slide 84

Sum of Squares (with interaction)

Again SStotal = SSbetween + SSwithin

With SSbetween = SSExperience + SSPosition + SSExperience x Position

Follows SStotal = (SSExperience + SSPosition + SSExperience x Position) + SSwithin

Where SSExperience x Position is the interaction of both factors simultaneously

Example of two-way ANOVA: Survey of nurse salary (EXAMPLE06)

SPSS: Analyze��General Linear Model��Univariate...

Slide 86

Interaction

Interaction term between fixed factors is calculated by default in ANOVA

Example interpretation (among other duty descriptions):

There is also an interaction of experience and position on salary, F(2, 90) = 18.991, p = .000,

partial η2 = .297.

The interaction term experien * position explains 29.7% of the previously unexplained variance.

Interaction I

Do different levels of experience influence the impact of different levels of position differently?

Yes, if experience has values 2 or 3 then the influence of position is raised.

Simplified: Lines not parallel

Interpretation: Experience is more important in hospitals than in offices.

office

hospital

Slide 88

More on interaction

� Main effect of experien

� Main effect of position

� Interaction



� Interaction



� Interaction



� Interaction



� Interaction



� Interaction

sala

ry

sala

ry

sala

ry

experien experien experien

sala

ry

experien

sala

ry

experien

sala

ry

experien

Requirements of ANOVA

0. Robustness

ANOVA is relatively robust against violations of prerequisites.

1. Sampling

Random sample, no treatment effects

A well designed study avoids violation of this assumption

2. Distribution of residuals

Residuals (= error) are normally distributed

Correction → transformation

3. Homogeneity of variances

Residuals (= error) have constant variance

Correction → weight variances

4. Balanced design

Same sample size in all groups

Correction → weight mean

SPSS automatically corrects unbalanced designs by Sum of Squares "Type III" Syntax: /METHOD = SSTYPE(3)

Slide 90

Exercises 03: ANOVA


Other multivariate Methods

Type of Multivariate Statistical Analysis

Regarding the practical application multivariate methods can be divided into two main parts:

Methods for identifying structures Methods for discovering structures

Independent

Variable (IV)

Price ofproduct

Dependent

Variable(s) (DV)

Quality ofProducts

Quality ofcustomer service

Customersatisfaction

Customersatisfaction

Employeesatisfaction

Motivation ofemployee

Also called dependence analysis be-

cause methods are used to test direct

dependencies between variables.

Variables are divided into independent

variables and dependent variable(s).

Also called interdependence analysis

because methods are used to discover

dependencies between variables.

This is especially the case with explora-

tory data analysis (EDA).

Slide 92

Choice of Method

Methods for identifying structures

(Dependence Analysis)

Regression Analysis

Analysis of Variance (ANOVA)

Discriminant Analysis

Contingency Analysis

(Conjoint Analysis)

Methods for discovering structures

(Interdependence Analysis)

Factor Analysis

Cluster Analysis

Multidimensional Scaling (MDS)

Independent Variable (IV)

metric categorical

Dependent Variable

(DV)

metric Regression analysis Analysis of Variance (ANOVA)

categorical Discriminant analysis Contingency analysis

Tree of methods (also www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm)

(See also www.methodenberatung.uzh.ch (in German))

Data Analysis

Descriptive Inductive

Univariate Bivariate MultivariateCorrelation t-Test

χ2 Independence

t-Test

χ2 Adjustment

Dependence Interdependence

DV metric DV not metric

IV not metricIV metric IV not metricIV metric

not metricmetric

Regression ANOVA

Conjoint

Discriminant Contingency

Cluster

Factor

MDS

Univariate Bivariate

DV = dependent variable IV = independent variable

Slide 94

Example of multivariate Methods (categorical / metric)

Linear discriminant analysis

Linear discriminant analysis (LDA) is used to find the linear combination of features which

best separates two or more groups in a sample.

The resulting combination may be used to classify groups in a sample.

(Example: Credit card debt, debt to income ratio, income => predict bankrupt risk of clients)

LDA is closely related to ANOVA and logistic regression analysis, which also attempt to express

one dependent variable as a linear combination of other variables.

LDA is an alternative to logistic regression, which is frequently used in place of LDA.

Logistic regression is preferred when data are not normal in distribution or group sizes

are very unequal.

Example of linear discriminant analysis

Data from measures of body length of

two subspecies of puma (South & North America)

100

105

110

115

120

125

130

135

140

150 160 170 180 190 200 210 220 230 240 250

x1 [cm]

x2

[c

m]

Species x1 x2

1 191 131

1 185 134

1 200 137

1 173 127

1 171 118

1 160 118

1 188 134

1 186 129

1 174 131

1 163 115

2 186 107

2 211 122

2 201 114

2 242 131

2 184 108

2 211 118

2 217 122

2 223 127

2 208 125

2 199 124

Species 1 = North America, 2 = South America

x1 body length: nose to top of tail

x2 body length: nose to root of tail

Other names for puma

cougar

mountain lion

catamount

panther

Slide 96

Very short introduction to linear discriminant analysis

Dependent Variable (also called discriminant variable): categorical

◦ Puma's example: type (two subspecies of puma)

Independent Variables: metric

◦ Puma's example: x1 & x2 (different measures of body length)

Goal

Discrimination between groups

◦ Puma's example: discrimination between two subspecies

Estimate a function for discriminating between group

i 1 i,1 2 i,2 iY = α+β x +β x +u

i

1 2

i,1 i,2

i

Y discriminant variable

α,β ,β coefficients

x ,x measurement of body lenght

u error term

Sketch of LDA

Data from measurement of body-length of two subspecies of puma

100

105

110

115

120

125

130

135

140

150 160 170 180 190 200 210 220 230 240 250

x1 [cm]

x2 [

cm

]

100

105

110

115

120

125

130

135

140

150 160 170 180 190 200 210 220 230 240 250

x1 [cm]

x2 [

cm

]

Slide 98

SPSS-Example of linear discriminant analysis (EXAMPLE07)

DISCRIMINANT

/GROUPS=species(1 2)

/VARIABLES=x1 x2

/ANALYSIS ALL

/PRIORS SIZE

/STATISTICS=MEAN STDDEV UNIVF BOXM COEFF RAW TABLE

/CLASSIFY=NONMISSING POOLED MEANSUB .

SPSS Output Discriminant analysis (EXAMPLE07) I

Both coefficients significant

i 1 i,1 2 i,2 iY = α +β x +β x + ε

i i,1 i,2 iY = 4.588 +.131× x -.243× x + ε

Slide 100

The two subspecies of pumas can be com-

pletely classified (100%)

See also plot above that is generated with

i i,1 i,2 iY = 4.588 +.131× x -.243× x +

-5

-4

-3

-2

-1

0

1

2

3

4

5

1 1 1 1 1 1 A 1 1 1 1 2 2 2 2 2 2 2 B 2 2 2

subspecies of puma [0,1]

dis

cri

min

an

t vari

ab

le Y

x1 x2

A 175 120

B 200 110

"Found" two pumas A & B:

x1 x2

A 175 120

B 200 110

What subspecies are they?

Use

i i,1 i,2 iY = 4.588 +.131× x -.243× x +

to determine their subspecies.

crash course in statistics - schwarz & partners · slide 9 some notes about. types of scales...

Documents