regression analysis using spss - stat modeller
TRANSCRIPT
Regression
Analysis using
SPSS
Delivered byHiren Kakkad | CEO & Co-founder
Stat Modeller, Vadodarawww.statmodeller.com
Correlation
• In real life, we frequently find that a group of two or more
variables move together
• When they move together, we can say they are
correlated.
• The variables may be (Y1, Y2), (X1, X2) or (X, Y)
Some Examples
Year of Experience
Breakdown
Credit Score
Salary
Equipment Life
Loan Amount
Independent Variables Dependent Variables
Types of Correlation
Curvilinear relationship
Correlation
• If we found relationship between x and y
variables, we can go for developing the
model which can give prediction for y
when x is known.
• That prediction model is known as
Regression Analysis
Regression Analysis
• Determine whether the independent variables explain a significant variation in the dependent variable
Whether a relationship exists
• Determine how much of the variation in the dependent variable can be explained by the independent variables
Strength of the relationship
• Determine the structure or form of the relationshipMathematical
equation
• Predict the values of the dependent variablePrediction of new
values
Regression Analysis
• Only one dependent and one independent variable
• Predict Co2 vs. Engine Size
• Independent variable (x): Engine size
• Dependent Variable (y): Co2 Emissions
Simple Linear
Regression
• One dependent variable and multiple independent variables
• Predict Co2 Vs. Engine Size and Cylinders
• Independent variables (xs): Engine size, Cylinders
• Dependent Variable (y): Co2 Emissions
Multiple Linear
Regression
Simple Linear Regression
Assumptions of Regression Analysis
• Random error 𝜀 is normally distributed
• The correlation between dependent variable y
and independent variable x should be very high
• Data is collected must be random
Regression Analysis Understanding
Regression Analysis Understanding
𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝜀Dependent
Variable
Intercept Coefficient
Independent
Variable
Error
Ice-cream Sales
Temp.
𝐼𝑐𝑒 𝐶𝑟𝑒𝑎𝑚 𝑆𝑎𝑙𝑒𝑠 = 𝛽0 + 𝛽1 𝑇𝑒𝑚𝑝 + 𝜀
Data of Ice Cream Sales
Temperature °C (x)
Ice Cream Sales (y)
𝑥𝑖 − 𝑥2 𝑦𝑖 − 𝑦2 𝑥𝑖 − 𝑥2
(𝑦𝑖−𝑦2)𝑥𝑖 − 𝑥2 2 (𝑦𝑖−𝑦2)2
14.2 210 -4.5 -187.5 839.1 20.0 35156.316.4 320 -2.3 -77.5 176.3 5.2 6006.311.9 180 -6.8 -217.5 1473.6 45.9 47306.315.2 327 -3.5 -70.5 245.0 12.1 4970.318.5 401 -0.2 3.5 -0.6 0.0 12.322.1 518 3.4 120.5 412.7 11.7 14520.319.4 407 0.7 9.5 6.9 0.5 90.325.1 609 6.4 211.5 1358.9 41.3 44732.323.4 539 4.7 141.5 668.6 22.3 20022.318.1 416 -0.6 18.5 -10.6 0.3 342.322.6 440 3.9 42.5 166.8 15.4 1806.317.2 403 -1.5 5.5 -8.1 2.2 30.3
𝑥 = 18.675 𝑦 = 397.5 5328.5 177.0 174995.0
𝛽1 =Σ(𝑥𝑖− 𝑥)(𝑦𝑖 − 𝑦)
Σ(𝑥𝑖− 𝑥)2
𝛽1 =5328.5
177.0
𝛽1 = 30.10
𝛽0 = 𝑦 − 𝛽1 𝑥
𝛽0 = 397.5 − 30.10 * 18.675
𝛽0 = −164.70
𝐼𝑐𝑒 𝐶𝑟𝑒𝑎𝑚 𝑆𝑎𝑙𝑒𝑠 = −164.70 + 30.10 𝑇𝑒𝑚𝑝
Case Study
• This dataset contains a subset of the fuel economy data that the
EPA (Environmental Protection Agency) makes available
on http://fueleconomy.gov.
• It contains only models which had a new release every year
between 1999 and 2008.
www.statmodeller.com 13
Simple Linear Regression
ENGINESIZE CO2EMISSIONS
0 2.0 196
1 2.4 221
2 1.5 136
3 3.5 255
4 3.5 244
5 3.5 230
6 3.5 232
7 3.7 255
8 3.7 267
9 2.4 ???
Using above data, can we predict this value of CO2
Emissions?
Dependent VariableIndependent Variable
𝐶𝑜2 𝐸𝑚𝑖𝑠𝑠𝑖𝑜𝑛𝑠 = 𝛽0 + 𝛽1 𝐸𝑛𝑔𝑖𝑛𝑒 𝑆𝑖𝑧𝑒
Let’s Make Scatter Plot
ENGINESIZE CO2EMISSIONS
0 2.0 196
1 2.4 221
2 1.5 136
3 3.5 255
4 3.5 244
5 3.5 230
6 3.5 232
7 3.7 255
8 3.7 267
9 2.4 ???
Using above data, can we predict this value of CO2
Emissions?
Benefits of Linear Regression
• Very Fast
• No Parameter Tuning (unlike setting k in K-NN algorithm)
• Easy to understand, highly interpretable
Steps for the Simple Linear Regression
•Using Scatter Plot
•Using correlation coefficient
Check the X and Y Relationship
•Data is independent of order (runs test)
•Error term is normally distributed (k-s test or Shapiro test)
Check for the Assumptions •Estimate the value of
𝛽0 𝑎𝑛𝑑 𝛽1
Develop a model
•Derive 𝑟2 𝑣𝑎𝑙𝑢𝑒 of the model
• See the p-value of model and coefficient
Check Model Accuracy •Predict value using a
model
Use for Prediction
Multiple Linear Regression
Multiple Linear Regression
ENGINESIZE
CYLINDERS
FUELCONSUMPTION_COMB
CO2EMISSIONS
0 2.0 4 8.5 196
1 2.4 4 9.6 221
2 1.5 4 5.9 136
3 3.5 6 11.1 255
4 3.5 6 10.6 244
5 3.5 6 10.0 230
6 3.5 6 10.1 232
7 3.7 6 11.1 255
8 3.7 6 11.6 267
9 2.4 4 9.2 ???
Using above data, can we predict this value of CO2
Emissions?
Regression Analysis Understanding
𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯+ 𝛽𝑘𝑥𝑘 + 𝜀Dependent
Variable
Intercept Coefficient
Independent
Variable
Error
𝐶𝑜2 𝐸𝑚𝑖𝑠𝑠𝑖𝑜𝑛 = 𝛽0 + 𝛽1 𝐸𝑛𝑔𝑖𝑛𝑒 𝑠𝑖𝑧𝑒 + 𝛽2 𝐶𝑦𝑙𝑖𝑛𝑑𝑒𝑟𝑠 + 𝛽3 𝐹𝑢𝑒𝑙 𝐶𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 𝐶𝑜𝑚𝑏 + 𝜀
Assumptions of Multiple Linear RegressionSource: http://www.restore.ac.uk/srme/www/fac/soc/wie/research-new/srme/modules/mod3/3/index.html
1. Linear Relationship
• The model is a roughly linear one. This is slightly different from simple linear
regression as we have multiple explanatory variables. This time we want the
outcome variable to have a roughly linear relationship with each of the
explanatory variables, taking into account the other explanatory variables in
the model.
2. Homoscedasticity
• Homoscedasticity
assumption means that the
variance around
the regression line is the same for
all values of the predictor variable
(X).
• We can check this by plot the
standardized residuals (error)
against the predicted values.
3. Outliers/influential cases
• As with simple linear regression, it is important to look out for
cases which may have a disproportionate influence over your
regression model.
4. Multicollinearity
• Multicollinearity exists when two or more
of the explanatory variables are highly
correlated.
• It also suggests that the two variables
may actually represent the same
underlying factor.
• It can be also checked by VIF values. VIF
>10 means, there is problem of
Multicollinearity.
5. Normally distributed residuals
• Error 𝑦 − 𝑦 should be normally
distributed
6. Autocorrelation
• Autocorrelation refers to the degree of
correlation between the values of the
same variables across different
observations in the data. ... In a
regression analysis, autocorrelation of
the regression residuals can also occur if
the model is incorrectly specified.
• A common method of testing for
autocorrelation is the Durbin-Watson test
Let’s run this in SPSS
Stat ModellerROBUST KIT OF SOLUTIONS
About Us
Stat Modeller is formed in 2019 providing services related to
training and consultancy for Operational Excellence, Application
of Statistical Tools and Data Science Tools to solve the problems
of various segments.
We have a team of experts who are having vast experience in
academic, industries, research, consulting etc.
Why Stat ModellerData analysis is an immense part of any problem solving or research. In industry as well as
in research, data plays a vital role. Data is collected in a large quantity. But the challenges
are which technique to be used and how?
To overcome these challenges, Stat Modeller provides the solutions to the industries,
organizations, institutes, universities and individuals who are looking for their data to
analyze with right techniques. Stat Modeller has a team of experts who are having vast
experience in industry, academic, research, consulting etc. who are committed to provide
reliable and quick service to our valuable clients. Client satisfaction is our ultimate
objective.
Services
Domain
Data Science
Business Transformation
in Industries
Research Projects
Training in Institutes/
Universities
Services
Data
Science
• Machine Learning
• R and R Studio
• Python
• SAS
• SPSS
• Minitab
• Excel and Advance Excel
Business Transformation
• Six Sigma
• Lean
• 5-S
• Kaizen
• Kanban
• QMS
• SPC and SQC and many more
Research Projects
• Research Projects
• Survey Analysis
• Marketing Research etc.
Institutes/ Universities
• Workshops
• Trainings
• Certification Course for Students etc.
Our Expertise
Clients in various Domain
Agro Economics
Agro Business Management
Dairy Economics
Home Science
Mechanical Engineering
Pharmaceutical Sciences
Financial Management
Management Studies
Business Studies
Marketing Management
Library Science
Workshop on Basics of SPSS at
BVM College of Engineering, Vallabh Vidyanagar
Workshop on Role of SPSS in Research at
DDU, Nadiad
3 Days Workshop on Basics of Python at
Department of Statistics, Sardar Patel University
2 + 1 Days Workshop on BASE SAS at
Department of Statistics, Sardar Patel University
Training on R at
Mumbai University
Training on R at
AERC, Vallabh Vidyanagar
Training on R at
FDP, SPU
Training on R at
HRDC, Gujarat University
Training on R at
Charusat University, Changa
Training on SPSS at
Charusat University, Changa
Mr. Hiren Kakkado CEO & Co-founder of Stat Modeller
o More than 8 years of industrial experience
o Certified Lean Six Sigma Black Belt
o Certified Auditor for ISO 9001
o Trained 3500+ participants
o Guided 100+ Improvement projects
o Assisted 25+ Research Projects
o Trainer for in R, Python, SPSS, Minitab, Power BI, Excel, Advanced Excel
Mr. Mehul Gandhio Business Associate of Stat Modeller
o More than 7 years of industrial experience
o Trained Lean Six Sigma Black Belt
o Certified Auditor for ISO 9001
o Trained 150+ participants
o Guided 5+ Improvement projects &
Process Time Study
o Trainer - Excel, Advanced Excel, 5S, Kaizen, Quality Tools, ISO 9001 and many more.
Contact us
D-503, Sharnam Happy Homes,
Sayaji Township Road,
Sayajipura,Vadodara - 390019
+91 9898233268
www.statmodeller.com
You can register onhttps://statmodeller.com/events/
Thank you
Any Questions?