data mining taylor statistics 202: data...

31
Statistics 202: Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 6 Based in part on slides from textbook, slides of Susan Holmes c Jonathan Taylor October 29, 2012 1/1

Upload: others

Post on 05-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningWeek 6

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

October 29, 2012

1 / 1

Page 2: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Part I

Other classification techniques

2 / 1

Page 3: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Rule based classifiers

Rule based classifiers

Examples:

if Refund==No and Marital_Status==Married =⇒Cheat=No

if Blood_Type==Warm (Lay_Eggs==Yes) =⇒Class=Birds

LHS=rule antecedent or condition

RHS=rule consequent

3 / 1

Page 4: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Rule based classifiers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Rule-based Classifier (Example)

R1: (Give Birth = no) ! (Can Fly = yes) " Birds R2: (Give Birth = no) ! (Live in Water = yes) " Fishes R3: (Give Birth = yes) ! (Blood Type = warm) " Mammals R4: (Give Birth = no) ! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians 4 / 1

Page 5: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Rule based classifiers

Rule based classifiers

A rule covers an case or instance, if the features satisfythe antecedent or condition.

The coverage of a rule on a dataset is the fraction of casesthe rule covers.

The accuracy of a rule is the fraction of cases it covers forwhich it the consequent agrees with the label.

Ideal rules are rules with high coverage and high accuracy.

5 / 1

Page 6: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Rule based classifiers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Rule Coverage and Accuracy

! Coverage of a rule: –  Fraction of records

that satisfy the antecedent of a rule

! Accuracy of a rule: –  Fraction of records

that satisfy both the antecedent and consequent of a rule (Status=Single) ! No

Coverage = 40%, Accuracy = 50%

6 / 1

Page 7: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Rule based classifiers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

How does Rule-based Classifier Work?

R1: (Give Birth = no) ! (Can Fly = yes) " Birds

R2: (Give Birth = no) ! (Live in Water = yes) " Fishes R3: (Give Birth = yes) ! (Blood Type = warm) " Mammals R4: (Give Birth = no) ! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians

A lemur triggers rule R3, so it is classified as a mammal

A turtle triggers both R4 and R5 A dogfish shark triggers none of the rules

7 / 1

Page 8: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Rule based classifiers

Rule based classifiers

A rule-based classifier is determined by a set of rules.

The rules are mutually exclusive if each case is covered byone and only one rule.

The rules are exhaustive if each case is covered by at leastone rule.

A decision tree is an example of a rule-based classifier.

8 / 1

Page 9: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Rule based classifiers

9 / 1

Page 10: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Rule based classifiers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

From Decision Trees To Rules

Rules are mutually exclusive and exhaustive

Rule set contains as much information as the tree

10 / 1

Page 11: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Rule based classifiers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9

Rules Can Be Simplified

Initial Rule: (Refund=No) ! (Status=Married) " No

Simplified Rule: (Status=Married) " No We can simplify the Married==No rule.

11 / 1

Page 12: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Rule based classifiers

Possible issues

Rules may not be mutually exclusive: they can be ordered,or use majority rule.

Rules may not be exhaustive: use a default class.

Finding rules

Try to extract directly from data (see PRIM in ESL).

Extracted from a decision tree (some tree software such asC4.5 does this).

12 / 1

Page 13: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Learning a rule

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

PRIM (Patient Rule Induction)

13 / 1

Page 14: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Removing a discovered rule to find additional rules

14 / 1

Page 15: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Rule based classifiers

Evaluating a rule

Accuracy(R)= 1 - Misclassification Error(R) = nc (R)n(R)

Laplace(R)= nc (R)+1n(R)+k where k is the number of classes . . .

M-estimate(R)=nc (R)+αpi(R)

n(R)+α where i(R) is the majorityclass and pi is a set of prior probabilities on each of the kclasses and α > 0 is some prior weight.

The Laplace rule is effectively a posterior estimate of theprobability of an instance belonging to a class based on aprior probability . . .

15 / 1

Page 16: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Rule based classifiers

Properties of rule-based classifiers

Flexible like decision trees (even more flexible).

As expressive as decision trees.

Easy to classify new cases (once ordering established).

Can be extracted from decision trees.

Disadvantage: there is no clear loss function they areoptimizing . . .

16 / 1

Page 17: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Nearest-neighbour classifiers

Nearest-neighbour classifiers

For each xxx find k nearest neighbours in the cases datamatrix XXX .

Let k∗(xxx) denote the majority rule among the k-nearestneighbours.

Assign f (xxx) = k∗(xxx).

Easy to describe, but can be expensive to compute manynearest neighbours . . .

17 / 1

Page 18: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Nearest neighbour classifier

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34

Nearest Neighbor Classifiers

! Basic idea: –  If it walks like a duck, quacks like a duck, then

it’s probably a duck

Training Records

Test Record

Compute Distance

Choose k of the “nearest” records

18 / 1

Page 19: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Choice of k

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38

Nearest Neighbor Classification…

! Choosing the value of k: –  If k is too small, sensitive to noise points –  If k is too large, neighborhood may include points from

other classes

19 / 1

Page 20: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Nearest neighbour classifier k = 1

20 / 1

Page 21: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Nearest neighbour classifier k = 3

21 / 1

Page 22: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Nearest neighbour classifier k = 10

22 / 1

Page 23: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Nearest neighbour classifier k = 20

23 / 1

Page 24: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Linear Discriminant Analysis using (sepal.width,sepal.length)

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

24 / 1

Page 25: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Naive Bayes classifiers

Bayesian classifiers

In LDA / QDA, we used Bayes’ Theorem to classify.

Formally, given a partition E1, . . . ,Ek Bayes’ Theorem says

P(Ei |A) =P(A|Ei )P(Ei )

P(A)

=P(A|Ei )P(Ei )∑kj=1 P(A|Ej)P(Ej)

We applied this to the events Ei = {Y = c} andA = {X = xxx} yielding

P(Y = c|X = xxx) =P(X = xxx |Y = c)P(Y = c)∑kj=1 P(X = xxx |Y = j)P(Y = j)

∝ P(X = xxx |Y = c)P(Y = c)

In LDA/QDA P(X = xxx |Y = c) = φµi ,Σi(xxx)dxxx with φµ,Σ a

multivariate Gaussian density.

25 / 1

Page 26: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Naive Bayes classifiers

Naive Bayes classifiers

A Naive Bayes classifier applies Bayes’ Theorem based onseveral features xxx = (xxx1, . . . ,xxxp)

P(Y = c|X1 = xxx1, . . . ,Xp = xxxp)

∝ P(X1 = xxx1, . . . ,Xp = xxxp|Y = c)P(Y = c)

To use the classifier, we need to estimate the RHS above.

26 / 1

Page 27: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Naive Bayes classifiers

Naive Bayes classifiers

The Naive Bayes classifier assumes that the features areindependent given the class label. Therefore,

P(Y = c |X1 = xxx1, . . . ,Xp = xxxp)

(p∏

l=1

P(Xl = xxx l |Y = c)

)P(Y = c)

Fits a discriminant model within each feature.

For continuous features, typically a 1-dimensional QDAmodel is used (i.e. Gaussian within each class).

27 / 1

Page 28: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Naive Bayes classifiers

Laplace smoothing

For discrete features, it uses a discrete estimate

P(Xj = l |Y = c) =# {iXXX ij = l ,YYY i = c}

# {YYY i = c}.

The discrete estimates have the problem that if noinstances of class c had feature Xj = l then any newinstances xxx with xxx j = l cannot be classified to class c .

One solution: use the Laplace smoothed probabilities

P(Xj = l |Y = c) =# {i : XXX ij = l ,YYY i = c}+ α

# {YYY i = c}+ α · k.

28 / 1

Page 29: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Naive Bayes classifier

29 / 1

Page 30: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Quadratic Discriminant Analysis using(sepal.width, sepal.length)

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

30 / 1

Page 31: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/...Statistics 202: Data Mining c Jonathan Taylor Rule based classi ers Rule based

Statistics 202:Data Mining

c©JonathanTaylor

Naive Bayes classifiers

Properties

Features may not be conditionally independent, which mayforce some bias.

Continuous features may not be Gaussian (LDA suffersfrom this also . . . )

Can be slow to predict new instances (at least e1071’simplementation).

31 / 1