data mining linear and logistic regression data mining/6... · 07/02/2017 3 finding a and b •in...

14
07/02/2017 1 Data Mining Linear and Logistic Regression Michael Li 1 of 26 © University of Stirling 2017 CSCU9T6 Information Systems Regression In statistical modelling, regression analysis is a statistical process for estimating the relationships among variables. Regression models are built from data to predict the average you would expect one variable to have, given you know the value of one or more others. Simple linear regression maps one variable onto the mean value of another. © University of Stirling 2017 CSCU9T6 Information Systems 2 of 26

Upload: vothien

Post on 28-Apr-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

07/02/2017

1

Data Mining Linear and Logistic Regression

Michael Li

1 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

Regression

• In statistical modelling, regression analysis is a statistical process for estimating the relationships among variables.

• Regression models are built from data to predict the average you would expect one variable to have, given you know the value of one or more others.

• Simple linear regression maps one variable onto the mean value of another.

© University of Stirling 2017 CSCU9T6 Information Systems 2 of 26

07/02/2017

2

Example: weight-height relation

0

10

20

30

40

50

60

70

80

90

0 20 40 60 80 100 120

We

igh

t

Height

Weight against Height

iii abxy 3 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

Simple Linear Regression

• To find the best values for a and b, simple linear regression uses a method known as ordinary least squares (OLS)

• Least squares means that the sum of the squared distance between each data point and its associated prediction is minimised

• That is, it minimises

n

i i1

2

4 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

07/02/2017

3

Finding a and b

• In the case of simple linear regression, a and b can be calculated as follows:

n

i i

n

i ii

xx

yyxxb

1

2

1

)(

))((

xbya

5 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

Multiple Regression

• With multiple inputs, the general form of linear regression is

• The parameters in b are calculated as

iiiii bxbxbxby ...3322110

YXXXb TT 1)(

6 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

XbY

07/02/2017

4

Stats Packages

• Many statistics packages (such as SPSS) offer multiple regression

• Assumes there is a linear relationship between the inputs and the output

• Widely used in many fields

– Trend line

– Risk of investment

7 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

Logistic Regression

• But what if one of the variables is a class, rather than a number?

• For example, let’s say we have data describing height and gender

• When we want to predict height from gender, it is easy – just calculate the average height of males and that of females, and that is it

• What if you want to predict gender from height?

8 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

07/02/2017

5

Logistic Regression

• There is no ‘average’ gender for a given height

• Better to predict the probability of being male (or female) given a height value

• One way to do this is to recode the classes, for example Male =0 and Female = 1

• Then you can do a regression

9 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

Linear Class Regression

• Problems

– Probability values go outside [0,1]

– Violates other assumptions made by linear regression

y = -0.0277x + 2.1686

-1

-0.5

0

0.5

1

1.5

2

0 20 40 60 80 100 120

Gender Code

abxxcP )|(

10 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

07/02/2017

6

There is a Better Way

• Leave the class labels as they are (Male, Female, in this case)

• Calculate a probability based on log odds

11 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

Odds

• The odds of an event (being male, for example) are

• 0.5/0.5 = 1

• 0.75/0.25 = 3

• So odds mean ‘times as probable’

)(1

)(

cp

cP

12 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

07/02/2017

7

Odds and Probability

• Lacks a desirable symmetry as the odds of male are not opposite the odds of female

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 10 12

Pro

bab

ility

Odds

13 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

Log Odds

• Note that ln(x) = -ln(1/x)

• So we take the log odds and get a function known as the logit

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-6 -4 -2 0 2 4 6

Pro

bab

ility

Log Odds (Logit)

)(1

)(ln

cP

cP

14 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

07/02/2017

8

Logistic Regression

• Instead of trying to predict P(c|x)=ax + b

• We can predict the log odds given x

• Solving this equation (later ...) gives us the logistic regression curve we need

baxxcP

xcP

)|(1

)|(ln

15 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

Logit to Probability

• OK, but if I say “The logit of x being male is 0.8”, you may not know what I mean

• We can get back to probabilities:

)(1

1

1)|(

)|(1

)|(

)|(1

)|(ln

baxbax

bax

bax

ee

excP

excP

xcP

baxxcP

xcP

16 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

07/02/2017

9

Finding a and b

• All we need to do now is solve the set of equations that result from plugging our data into

• But there is a problem

• For a given x (height) we don’t have a probability measure, we have a 1 or 0

17 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

)(1

1)|(

baxexcP

Maximum Likelihood

• Let’s say we want to guess a parameter that predicts a probability (which, in this case we do, but this is more general ...)

• We can test a candidate value for the parameter using Maximum Likelihood

• Likelihood is the reverse of a conditional probability:

)|()|( xyPyxL 18 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

07/02/2017

10

Maximum Likelihood

• Tossing a coin

• Probability distribution of tossing this coin

• Assume that we received 40 heads in 100 tossing, what is the probability of head?

© University of Stirling 2017 19 of 26 CSCU9T6 Information Systems

Maximum Likelihood

)|()|( xyPyxL

40 heads in 100 tossing

© University of Stirling 2017 20 of 26 CSCU9T6 Information Systems

5.0)( headP

4.0)( headP

07/02/2017

11

Likelihood of a Model

• Call our data set D and imagine we want to estimate a single parameter, a

• The likelihood of the parameter, given the data is

• The probability of the data is

)|()|( aDPDaL

Dd

adpaDP )|()|(

21 of 26 © University of Stirling 20167 CSCU9T6 Information Systems

Likelihood of a Model

• The likelihood of a model is a measure of how well the parameters guess at the true distribution, without ever needing to know the true distribution

• Note that P(c|x) does not appear in the formula, and we don’t need to know it

• P(d|a) is the estimate by the model of the probability of each data point

22 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

07/02/2017

12

Maximum Likelihood Logistic Regression

1. Pick a value for a and b

2. Plug those values into for every value of x in the data

3. Find the product of all of these values by multiplying them together

4. Record that value as the likelihood

5. Choose better values for a and b and repeat

)(1

1)(

baxexP

23 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

Log Likelihood

• One more problem to fix ...

• Multiplying many small probabilities together soon suffers from arithmetic underflow – the number is too small to represent or compare

• The solution is to take logs and sum because

)ln()ln()ln( abba

24 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

07/02/2017

13

An example of using logistic regression

• Can I get a mortgage with my credit rating?

25 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

Credit score

Result

85 1

75 1

73 0

64 0

69 1

0

0.2

0.4

0.6

0.8

1

1.2

0 20 40 60 80 100 120

P(F

ail)

Credit score

P(Fail | score)

Logistic regression

• “Rule of Ten”: A widely-used rule of thumb states that logistic regression models give stable values for the explanatory variables if based on a minimum of about 10 events.

• Sampling: As a rule of thumb, sampling controls at a rate of five times the number of cases will produce sufficient control data.

• Convergence: In some instances the model may not reach convergence.

26 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

07/02/2017

14

In Weka

WEKA tutorial: http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf