240-650 principles of pattern recognition

61
240-650: Chapter 2: Bayesian Decision 1 Montri Karnjanadecha [email protected] .th http:// fivedots.coe.psu. ac.th/~montri 240-650 Principles of Pattern Recognition

Upload: tamas

Post on 21-Mar-2016

40 views

Category:

Documents


1 download

DESCRIPTION

240-650 Principles of Pattern Recognition. Montri Karnjanadecha [email protected] http://fivedots.coe.psu.ac.th/~montri. Chapter 2. Bayesian Decision Theory. Statistical Approach to Pattern Recognition. A Simple Example. Suppose that we are given two classes w 1 and w 2 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

1

Montri [email protected]://fivedots.coe.psu.ac.th/~montri

240-650 Principles of Pattern

Recognition

Page 2: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

2

Chapter 2

Bayesian Decision Theory

Page 3: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

3

Statistical Approach to Pattern Recognition

Page 4: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

4

A Simple Example• Suppose that we are given two classes 1

and 2– P(1) = 0.7– P(2) = 0.3– No measurement is given

• Guessing– What shall we do to recognize a given input?– What is the best we can do statistically? Why?

Page 5: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

5

A More Complicated Example

• Suppose that we are given two classes– A single measurement x– P(1|x) and P(2|x) are given graphically

Page 6: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

6

A Bayesian Example• Suppose that we are given two classes

– A single measurement x– We are given p(x|1) and p(x|2) this time

Page 7: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

7

A Bayesian Example – cont.

Page 8: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

8

Bayesian Decision Theory

• Bayes formula

• In case of two categories

• In English, it can be expressed as

)()|()(

)()|()|(

)()|()()|(),(

jjjj

j

jjjj

PxpxpPxp

xP

PxpxpxPxp

evidencepriorlikelihoodposterior x

2

1

)()|()(j

jj Pxpxp

Page 9: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

9

Bayesian Decision Theory – cont.

• A posterior probability– The probability of the state of nature being

given that feature value x has been measured• Likelihood

– is the likelihood of with respect to x

• Evidence– The evidence factor can be viewed as a

scaling factor that guarantees that the posterior probabilities sum to one.

j

j)|( jxp

Page 10: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

10

Bayesian Decision Theory – cont.

• Whenever we observe a particular x, the prob. of error is

• The average prob. of error is given by

12

21

decide weif)|( decide weif)|(

)|(

xPxP

xerrorP

dxxpxerrorPdxxerrorPerrorP )()|(),()(

Page 11: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

11

Bayesian Decision Theory – cont.• Bayes decision ruleDecide 1 if P(1|x) > P(2|x); otherwise decide 2

• Prob. of errorP(error|x)=min[P(1|x), P(2|x)]

• If we ignore the “evidence”, the decision rule becomes:

Decide 1 if P(x|1) P(1) > P(x|2) P(2)Otherwise decide 2

Page 12: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

12

Bayesian Decision Theory--continuous features

• Feature space– In general, an input can be represented by a

vector, a point in a d-dimensional Euclidean space Rd

• Loss function– The loss function states exactly how costly

each action is and is used to convert a probability determination into a decision

– Written as )|( ji

Page 13: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

13

Loss Function

• Describe the loss incurred for taking action i

when the state of nature is j

)|( ji

Page 14: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

14

Conditional Risk

• Suppose we observe a particular x• We take action i

• If the true state of nature is j

• By definition we will incur the loss i|j)• We can minimize our expected loss by

selecting the action that minimize the condition risk, R(i|x)

xx ||)|(1

j

c

jjii PR

Page 15: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

15

Bayesian Decision Theory

• Suppose that there are c categories{1, 2, ..., c}

• Conditional risk

• Risk is the average expected loss

)|()|()|(1

xx j

c

jjii PR

xxxx dpRR )()|)((

Page 16: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

16

Bayesian Decision Theory

• Bayes decision rule– For a given x, select the action i for which

the conditional risk is minimum

– The resulting minimum overall risk is called the Bayes risk, denoted as R*, which is the best performance that can be achieved

)|(min* xRi ii

Page 17: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

17

Two-Category Classification

• Let ij = (i|j)• Conditional risk

• Fundamental decision rule

Decide 1 if R(1|x) < R(2|x)

)|()|()|()|()|()|(

2221212

2121111

xxxxxx

PPRPPR

Page 18: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

18

Two-Category Classification – cont.

• The decision rule can be written in several ways– Decide 1 if one of the followings is true

)()(

)|()|(

)()|()()()|()()|()()|()(

1

2

1121

2212

2

1

222212111121

2221211121

PP

pp

PpPpPP

xx

xxxx

Likelihood Ratio

These rules are

equivalent

Page 19: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

19

Minimum-Error-Rate Classification

• A special case of the Bayes decision rule with the following zero-one loss function

– Assigns no loss to correct decision– Assigns unit loss to any error– All errors are equally costly

j i if 1ji if 0

)|( ji

Page 20: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

20

Minimum-Error-Rate Classification

• Conditional risk

x

x

xx

|1

|

|||1

j

ijj

j

c

jjii

P

P

PR

Page 21: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

21

Minimum-Error-Rate Classification

• We should select i that maximizes the posterior probability

• For minimum error rate:

Decide

)|( xjP

ijPP jii allfor )|()|( if xx

Page 22: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

22

Minimum-Error-Rate Classification

Page 23: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

23

Classifiers, Discriminant Functions, and Decision Surfaces

• There are many ways to represent pattern classifiers

• One of the most useful is in terms of a set of discriminant functions gi(x), i=1,…,c

• The classifier assigns a feature vector x to class if

jigg ji allfor )()( xx

i

Page 24: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

24

The Multicategory Classifier

Page 25: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

25

Classifiers, Discriminant Functions, and Decision Surfaces

• There are many equivalent discriminant functions– i.e., the classification results will be the

same even though they are different functions

– For example, if f is a monotonically increasing function, then

))(()( xgfxg ii

Page 26: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

26

Classifiers, Discriminant Functions, and Decision Surfaces

• Some of discriminant functions are easier to understand or to compute

Page 27: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

27

Decision Regions

• The effect of any decision is to divide the feature space into c decision regions, R1, ..., Rc

– The regions are separated with decision boundaries, where ties occur among the largest discriminant functions

iji ijgg Rxxx then , allfor )()( If

Page 28: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

28

Decision Regions – cont.

Page 29: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

29

Two-Category Case (Dichotomizer)

• Two-category case is a special case– Instead of two discriminant functions, a

single one can be used

)()(ln

)|()|(ln)(

)|()|()(

)()()(

2

1

2

1

21

21

PP

ppg

PPg

ggg

xxx

xxx

xxx

Page 30: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

30

The Normal Density• Univariate Gaussian Density

• Mean

• Variance

2

21exp

21)(

xxp

dxxxpx

)(

dxxpxx

)(222

Page 31: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

31

The Normal Density

Page 32: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

32

The Normal Density

• Central Limit Theorem

– The aggregate effect of the sum of a large number of small, independent random disturbances will lead to a Gaussian distribution

– Gaussian is often a good model for the actual probability distribution

Page 33: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

33

The Multivariate Normal Density

• Multivariate Density (in d dimension)

Abbreviation

μxΣμx

Σx 1

2/12/ 21exp

21)( td

p

Σμx ,)( Np

Page 34: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

34

The Multivariate Normal Density

• Mean

• Covariance matrix

• The ijth component of

xxxxμ dp )(

xxμxμxμxμxΣ dptt )(

Σ

jjiiij xx

Page 35: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

35

Statistically Independence

– If xi and xj are statistically independence then

– The covariance matrix will become a diagonal matrix where all off-diagonal elements are zero

0ij

Page 36: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

36

Whitening Transform

2/1ΦΛAw

matrix whose columns are the orthonormal eigenvectors of Σ

Diagonal matrix of the corresponding eigenvalues

of Σ

Page 37: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

37

Whitening Transform

Page 38: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

38

Squared Mahalanobis Distance from x to μxΣμx 12 tr

Constant density

Principle axes of hyperellipsiods are given by the eigenvectors of Length of axes are determined by eigenvalues of

Page 39: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

39

Discriminant Functions for the Normal Density

• Minimum distance classifier

• If the density are multivariate normal– i.e., if

Then we have:

),()|( iii Np Σμx )|( ip x

)(lnln212ln

221)( 1

iiiit

ii Pdg ΣμxΣμxx

)(ln)|(ln)( iii Ppg xx

Page 40: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

40

Discriminant Functions for the Normal Density

• Case 1:– Features are statistically independence and

each feature has the same variance

– Where || . || denotes the Euclidean norm

IΣ 2i

)(ln2

)( 2

2

ii

i Pg

μx

x

)()(2i

tii μxμxμx

Page 41: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

41

Case 1: i = 2I

Page 42: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

42

Linear Discriminant Function

• It is not necessary to compute distances– Expanding the form yields

– The term is the same for all i– We have the following linear discriminant

function

)()( it

i μxμx

)(ln22

1)( 2 iiti

ti

ti Pg

μμxμxxx

xxt

0)( itii wg xwx

Page 43: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

43

Linear Discriminant Function

where

and

ii μw2

1

)(ln2

10

Pw t

i

μμ

Threshold or bias for the ith category

Page 44: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

44

Linear Machine

• A classifier that uses linear discriminant functions is called a linear machine

• Its decision surfaces are pieces of hyperplanes defined by the linear equations

for the two categories with the highest posterior probabilities. For our case this equation can be written as

)()( xx ji gg

0)( 0 xxw t

Page 45: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

45

Linear Machine

Where

And

If then the second term vanishes

It is called a minimum-distance classifier

jμμw

jij

i

ji

ji PP μμ

μμμμx

)()(ln

21

20

)()( ji PP

Page 46: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

46

Priors change -> decision boundaries shift

Page 47: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

47

Priors change -> decision boundaries shift

Page 48: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

48

Priors change -> decision boundaries shift

Page 49: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

49

Case 2: i =

• Covariance matrices for all of the classes are identical but otherwise arbitrary

• The cluster for the ith class is centered about i

• Discriminant function: )(ln

21)( 1

iit

ii Pg μxΣμxx

Can be ignored if prior probabilities are the same for all classes

Page 50: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

50

Case 2: Discriminant function

Where

and

0)( itii wg xwx

ii μΣw 1

)(ln21 1

0 Pw ti μΣμ

Page 51: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

51

For 2-category case

• If Ri and Rj are contiguous, the boundary between them has the equation

where

and

0)( 0 xxw t

jμμΣw

1

ji

jit

ji

jiji

PPμμ

μμΣμμμμx

10

)(/)(ln21

Page 52: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

52

Page 53: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

53

Page 54: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

54

Case 3: i = arbitrary

• In general, the covariance matrices are different for each category

• The only term that can be dropped is the (d/2) ln 2 term

Page 55: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

55

Case 3: i = arbitraryThe discriminant functions are

Where

and

0)( itii

ti wg xwxWxx

1

21 ii ΣW

iii 1Σw

)(lnln21

21 1

0 Pw it

i ΣμΣμ

Page 56: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

56

Two-category case

• The decision surface are hyperquadrics (hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids,…)

Page 57: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

57

Page 58: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

58

Page 59: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

59

Page 60: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

60

Page 61: 240-650  Principles of Pattern Recognition

240-650: Chapter 2: Bayesian Decision Theory

61

Example