regression analysis using spss - stat modeller

Regression

Analysis using

SPSS

Delivered byHiren Kakkad | CEO & Co-founder

Stat Modeller, Vadodarawww.statmodeller.com

http://www.statmodeller.com/

Correlation

• In real life, we frequently find that a group of two or more

variables move together

• When they move together, we can say they are

correlated.

• The variables may be (Y1, Y2), (X1, X2) or (X, Y)

Some Examples

Year of Experience

Breakdown

Credit Score

Salary

Equipment Life

Loan Amount

Independent Variables Dependent Variables

Types of Correlation

Curvilinear relationship

Correlation

• If we found relationship between x and y

variables, we can go for developing the

model which can give prediction for y

when x is known.

• That prediction model is known as

Regression Analysis

Regression Analysis

• Determine whether the independent variables explain a significant variation in the dependent variable

Whether a relationship exists

• Determine how much of the variation in the dependent variable can be explained by the independent variables

Strength of the relationship

• Determine the structure or form of the relationshipMathematical

equation

• Predict the values of the dependent variablePrediction of new

values

Regression Analysis

• Only one dependent and one independent variable

• Predict Co2 vs. Engine Size

• Independent variable (x): Engine size

• Dependent Variable (y): Co2 Emissions

Simple Linear

Regression

• One dependent variable and multiple independent variables

• Predict Co2 Vs. Engine Size and Cylinders

• Independent variables (xs): Engine size, Cylinders

• Dependent Variable (y): Co2 Emissions

Multiple Linear

Regression

Simple Linear Regression

Assumptions of Regression Analysis

• Random error 𝜀 is normally distributed

• The correlation between dependent variable y

and independent variable x should be very high

• Data is collected must be random

Regression Analysis Understanding


𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝜀Dependent

Variable

Intercept Coefficient

Independent

Variable

Error

Ice-cream Sales

Temp.

𝐼𝑐𝑒 𝐶𝑟𝑒𝑎𝑚 𝑆𝑎𝑙𝑒𝑠 = 𝛽0 + 𝛽1 𝑇𝑒𝑚𝑝 + 𝜀

Data of Ice Cream Sales

Temperature °C (x)

Ice Cream Sales (y)

𝑥𝑖 − 𝑥2 𝑦𝑖 − 𝑦2 𝑥𝑖 − 𝑥2

(𝑦𝑖−𝑦2)𝑥𝑖 − 𝑥2 2 (𝑦𝑖−𝑦2)2

14.2 210 -4.5 -187.5 839.1 20.0 35156.316.4 320 -2.3 -77.5 176.3 5.2 6006.311.9 180 -6.8 -217.5 1473.6 45.9 47306.315.2 327 -3.5 -70.5 245.0 12.1 4970.318.5 401 -0.2 3.5 -0.6 0.0 12.322.1 518 3.4 120.5 412.7 11.7 14520.319.4 407 0.7 9.5 6.9 0.5 90.325.1 609 6.4 211.5 1358.9 41.3 44732.323.4 539 4.7 141.5 668.6 22.3 20022.318.1 416 -0.6 18.5 -10.6 0.3 342.322.6 440 3.9 42.5 166.8 15.4 1806.317.2 403 -1.5 5.5 -8.1 2.2 30.3

𝑥 = 18.675 𝑦 = 397.5 5328.5 177.0 174995.0

𝛽1 =Σ(𝑥𝑖− 𝑥)(𝑦𝑖 − 𝑦)

Σ(𝑥𝑖− 𝑥)2

𝛽1 =5328.5

177.0

𝛽1 = 30.10

𝛽0 = 𝑦 − 𝛽1 𝑥

𝛽0 = 397.5 − 30.10 * 18.675

𝛽0 = −164.70

𝐼𝑐𝑒 𝐶𝑟𝑒𝑎𝑚 𝑆𝑎𝑙𝑒𝑠 = −164.70 + 30.10 𝑇𝑒𝑚𝑝

Case Study

• This dataset contains a subset of the fuel economy data that the

EPA (Environmental Protection Agency) makes available

on http://fueleconomy.gov.

• It contains only models which had a new release every year

between 1999 and 2008.

www.statmodeller.com 13

http://fueleconomy.gov/

Simple Linear Regression

ENGINESIZE CO2EMISSIONS

0 2.0 196

1 2.4 221

2 1.5 136

3 3.5 255

4 3.5 244

5 3.5 230

6 3.5 232

7 3.7 255

8 3.7 267

9 2.4 ???

Using above data, can we predict this value of CO2

Emissions?

Dependent VariableIndependent Variable

𝐶𝑜2 𝐸𝑚𝑖𝑠𝑠𝑖𝑜𝑛𝑠 = 𝛽0 + 𝛽1 𝐸𝑛𝑔𝑖𝑛𝑒 𝑆𝑖𝑧𝑒

Let’s Make Scatter Plot

ENGINESIZE CO2EMISSIONS

0 2.0 196

1 2.4 221

2 1.5 136

3 3.5 255

4 3.5 244

5 3.5 230

6 3.5 232

7 3.7 255

8 3.7 267

9 2.4 ???


Emissions?

Benefits of Linear Regression

• Very Fast

• No Parameter Tuning (unlike setting k in K-NN algorithm)

• Easy to understand, highly interpretable

Steps for the Simple Linear Regression

•Using Scatter Plot

•Using correlation coefficient

Check the X and Y Relationship

•Data is independent of order (runs test)

•Error term is normally distributed (k-s test or Shapiro test)

Check for the Assumptions •Estimate the value of

𝛽0 𝑎𝑛𝑑 𝛽1

Develop a model

•Derive 𝑟2 𝑣𝑎𝑙𝑢𝑒 of the model

• See the p-value of model and coefficient

Check Model Accuracy •Predict value using a

model

Use for Prediction

Multiple Linear Regression

Multiple Linear Regression

ENGINESIZE

CYLINDERS

FUELCONSUMPTION_COMB

CO2EMISSIONS

0 2.0 4 8.5 196

1 2.4 4 9.6 221

2 1.5 4 5.9 136

3 3.5 6 11.1 255

4 3.5 6 10.6 244

5 3.5 6 10.0 230

6 3.5 6 10.1 232

7 3.7 6 11.1 255

8 3.7 6 11.6 267

9 2.4 4 9.2 ???


Emissions?


𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯+ 𝛽𝑘𝑥𝑘 + 𝜀Dependent

Variable

Intercept Coefficient

Independent

Variable

Error

𝐶𝑜2 𝐸𝑚𝑖𝑠𝑠𝑖𝑜𝑛 = 𝛽0 + 𝛽1 𝐸𝑛𝑔𝑖𝑛𝑒 𝑠𝑖𝑧𝑒 + 𝛽2 𝐶𝑦𝑙𝑖𝑛𝑑𝑒𝑟𝑠 + 𝛽3 𝐹𝑢𝑒𝑙 𝐶𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 𝐶𝑜𝑚𝑏 + 𝜀

Assumptions of Multiple Linear RegressionSource: http://www.restore.ac.uk/srme/www/fac/soc/wie/research-new/srme/modules/mod3/3/index.html

http://www.restore.ac.uk/srme/www/fac/soc/wie/research-new/srme/modules/mod3/3/index.html

1. Linear Relationship

• The model is a roughly linear one. This is slightly different from simple linear

regression as we have multiple explanatory variables. This time we want the

outcome variable to have a roughly linear relationship with each of the

explanatory variables, taking into account the other explanatory variables in

the model.

2. Homoscedasticity

• Homoscedasticity

assumption means that the

variance around

the regression line is the same for

all values of the predictor variable

(X).

• We can check this by plot the

standardized residuals (error)

against the predicted values.

3. Outliers/influential cases

• As with simple linear regression, it is important to look out for

cases which may have a disproportionate influence over your

regression model.

4. Multicollinearity

• Multicollinearity exists when two or more

of the explanatory variables are highly

correlated.

• It also suggests that the two variables

may actually represent the same

underlying factor.

• It can be also checked by VIF values. VIF

>10 means, there is problem of

Multicollinearity.

5. Normally distributed residuals

• Error 𝑦 − 𝑦 should be normally

distributed

6. Autocorrelation

• Autocorrelation refers to the degree of

correlation between the values of the

same variables across different

observations in the data. ... In a

regression analysis, autocorrelation of

the regression residuals can also occur if

the model is incorrectly specified.

• A common method of testing for

autocorrelation is the Durbin-Watson test

Let’s run this in SPSS

Stat ModellerROBUST KIT OF SOLUTIONS

About Us

Stat Modeller is formed in 2019 providing services related to

training and consultancy for Operational Excellence, Application

of Statistical Tools and Data Science Tools to solve the problems

of various segments.

We have a team of experts who are having vast experience in

academic, industries, research, consulting etc.

Why Stat ModellerData analysis is an immense part of any problem solving or research. In industry as well as

in research, data plays a vital role. Data is collected in a large quantity. But the challenges

are which technique to be used and how?

To overcome these challenges, Stat Modeller provides the solutions to the industries,

organizations, institutes, universities and individuals who are looking for their data to

analyze with right techniques. Stat Modeller has a team of experts who are having vast

experience in industry, academic, research, consulting etc. who are committed to provide

reliable and quick service to our valuable clients. Client satisfaction is our ultimate

objective.

Services

Domain

Data Science

Business Transformation

in Industries

Research Projects

Training in Institutes/

Universities

Services

Data

Science

• Machine Learning

• R and R Studio

• Python

• SAS

• SPSS

• Minitab

• Excel and Advance Excel

Business Transformation

• Six Sigma

• Lean

• 5-S

• Kaizen

• Kanban

• QMS

• SPC and SQC and many more

Research Projects

• Research Projects

• Survey Analysis

• Marketing Research etc.

Institutes/ Universities

• Workshops

• Trainings

• Certification Course for Students etc.

Our Expertise

Clients in various Domain

Agro Economics

Agro Business Management

Dairy Economics

Home Science

Mechanical Engineering

Pharmaceutical Sciences

Financial Management

Management Studies

Business Studies

Marketing Management

Library Science

Workshop on Basics of SPSS at

BVM College of Engineering, Vallabh Vidyanagar

Workshop on Role of SPSS in Research at

DDU, Nadiad

3 Days Workshop on Basics of Python at

Department of Statistics, Sardar Patel University

2 + 1 Days Workshop on BASE SAS at

Department of Statistics, Sardar Patel University

Training on R at

Mumbai University

Training on R at

AERC, Vallabh Vidyanagar

Training on R at

FDP, SPU

Training on R at

HRDC, Gujarat University

Training on R at

Charusat University, Changa

Training on SPSS at

Charusat University, Changa

Mr. Hiren Kakkado CEO & Co-founder of Stat Modeller

o More than 8 years of industrial experience

o Certified Lean Six Sigma Black Belt

o Certified Auditor for ISO 9001

o Trained 3500+ participants

o Guided 100+ Improvement projects

o Assisted 25+ Research Projects

o Trainer for in R, Python, SPSS, Minitab, Power BI, Excel, Advanced Excel

Mr. Mehul Gandhio Business Associate of Stat Modeller

o More than 7 years of industrial experience

o Trained Lean Six Sigma Black Belt

o Certified Auditor for ISO 9001

o Trained 150+ participants

o Guided 5+ Improvement projects &

Process Time Study

o Trainer - Excel, Advanced Excel, 5S, Kaizen, Quality Tools, ISO 9001 and many more.

Contact us

D-503, Sharnam Happy Homes,

Sayaji Township Road,

Sayajipura,Vadodara - 390019

+91 9898233268

[email protected]

www.statmodeller.com

You can register onhttps://statmodeller.com/events/

https://statmodeller.com/events/

Thank you

Any Questions?

regression analysis using spss - stat modeller

Documents