data mining linear and logistic regression data mining/6... · 07/02/2017 3 finding a and b •in...

07/02/2017

1

Data Mining Linear and Logistic Regression

Michael Li

1 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

Regression

• In statistical modelling, regression analysis is a statistical process for estimating the relationships among variables.

• Regression models are built from data to predict the average you would expect one variable to have, given you know the value of one or more others.

• Simple linear regression maps one variable onto the mean value of another.

© University of Stirling 2017 CSCU9T6 Information Systems 2 of 26

07/02/2017

2

Example: weight-height relation

0

10

20

30

40

50

60

70

80

90

0 20 40 60 80 100 120

We

igh

t

Height

Weight against Height

iii abxy 3 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

Simple Linear Regression

• To find the best values for a and b, simple linear regression uses a method known as ordinary least squares (OLS)

• Least squares means that the sum of the squared distance between each data point and its associated prediction is minimised

• That is, it minimises

n

i i1

2


07/02/2017

3

Finding a and b

• In the case of simple linear regression, a and b can be calculated as follows:

n

i i

n

i ii

xx

yyxxb

1

2

1

)(

))((

xbya


Multiple Regression

• With multiple inputs, the general form of linear regression is

• The parameters in b are calculated as

iiiii bxbxbxby ...3322110

YXXXb TT 1)(


XbY

07/02/2017

4

Stats Packages

• Many statistics packages (such as SPSS) offer multiple regression

• Assumes there is a linear relationship between the inputs and the output

• Widely used in many fields

– Trend line

– Risk of investment


Logistic Regression

• But what if one of the variables is a class, rather than a number?

• For example, let’s say we have data describing height and gender

• When we want to predict height from gender, it is easy – just calculate the average height of males and that of females, and that is it

• What if you want to predict gender from height?


07/02/2017

5

Logistic Regression

• There is no ‘average’ gender for a given height

• Better to predict the probability of being male (or female) given a height value

• One way to do this is to recode the classes, for example Male =0 and Female = 1

• Then you can do a regression


Linear Class Regression

• Problems

– Probability values go outside [0,1]

– Violates other assumptions made by linear regression

y = -0.0277x + 2.1686

-1

-0.5

0

0.5

1

1.5

2

0 20 40 60 80 100 120

Gender Code

abxxcP )|(


07/02/2017

6

There is a Better Way

• Leave the class labels as they are (Male, Female, in this case)

• Calculate a probability based on log odds


Odds

• The odds of an event (being male, for example) are

• 0.5/0.5 = 1

• 0.75/0.25 = 3

• So odds mean ‘times as probable’

)(1

)(

cp

cP


07/02/2017

7

Odds and Probability

• Lacks a desirable symmetry as the odds of male are not opposite the odds of female

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 10 12

Pro

bab

ility

Odds


Log Odds

• Note that ln(x) = -ln(1/x)

• So we take the log odds and get a function known as the logit

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-6 -4 -2 0 2 4 6

Pro

bab

ility

Log Odds (Logit)

)(1

)(ln

cP

cP


07/02/2017

8

Logistic Regression

• Instead of trying to predict P(c|x)=ax + b

• We can predict the log odds given x

• Solving this equation (later ...) gives us the logistic regression curve we need

baxxcP

xcP

)|(1

)|(ln


Logit to Probability

• OK, but if I say “The logit of x being male is 0.8”, you may not know what I mean

• We can get back to probabilities:

)(1

1

1)|(

)|(1

)|(

)|(1

)|(ln

baxbax

bax

bax

ee

excP

excP

xcP

baxxcP

xcP


07/02/2017

9

Finding a and b

• All we need to do now is solve the set of equations that result from plugging our data into

• But there is a problem

• For a given x (height) we don’t have a probability measure, we have a 1 or 0


)(1

1)|(

baxexcP

Maximum Likelihood

• Let’s say we want to guess a parameter that predicts a probability (which, in this case we do, but this is more general ...)

• We can test a candidate value for the parameter using Maximum Likelihood

• Likelihood is the reverse of a conditional probability:

)|()|( xyPyxL 18 of 26 © University of Stirling 2017 CSCU9T6 Information Systems

07/02/2017

10

Maximum Likelihood

• Tossing a coin

• Probability distribution of tossing this coin

• Assume that we received 40 heads in 100 tossing, what is the probability of head?

© University of Stirling 2017 19 of 26 CSCU9T6 Information Systems

Maximum Likelihood

)|()|( xyPyxL

40 heads in 100 tossing

© University of Stirling 2017 20 of 26 CSCU9T6 Information Systems

5.0)( headP

4.0)( headP

07/02/2017

11

Likelihood of a Model

• Call our data set D and imagine we want to estimate a single parameter, a

• The likelihood of the parameter, given the data is

• The probability of the data is

)|()|( aDPDaL

Dd

adpaDP )|()|(


Likelihood of a Model

• The likelihood of a model is a measure of how well the parameters guess at the true distribution, without ever needing to know the true distribution

• Note that P(c|x) does not appear in the formula, and we don’t need to know it

• P(d|a) is the estimate by the model of the probability of each data point


07/02/2017

12

Maximum Likelihood Logistic Regression

1. Pick a value for a and b

2. Plug those values into for every value of x in the data

3. Find the product of all of these values by multiplying them together

4. Record that value as the likelihood

5. Choose better values for a and b and repeat

)(1

1)(

baxexP


Log Likelihood

• One more problem to fix ...

• Multiplying many small probabilities together soon suffers from arithmetic underflow – the number is too small to represent or compare

• The solution is to take logs and sum because

)ln()ln()ln( abba


07/02/2017

13

An example of using logistic regression

• Can I get a mortgage with my credit rating?


Credit score

Result

85 1

75 1

73 0

64 0

69 1

…

0

0.2

0.4

0.6

0.8

1

1.2

0 20 40 60 80 100 120

P(F

ail)

Credit score

P(Fail | score)

Logistic regression

• “Rule of Ten”: A widely-used rule of thumb states that logistic regression models give stable values for the explanatory variables if based on a minimum of about 10 events.

• Sampling: As a rule of thumb, sampling controls at a rate of five times the number of cases will produce sufficient control data.

• Convergence: In some instances the model may not reach convergence.


07/02/2017

14

In Weka

WEKA tutorial: http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf

data mining linear and logistic regression data mining/6... · 07/02/2017 3 finding a and b •in...

Documents