240-650 principles of pattern recognition

240-650: Chapter 2: Bayesian Decision Theory

1

Montri [email protected]://fivedots.coe.psu.ac.th/~montri

240-650 Principles of Pattern

Recognition


2

Chapter 2

Bayesian Decision Theory


3

Statistical Approach to Pattern Recognition


4

A Simple Example• Suppose that we are given two classes 1

and 2– P(1) = 0.7– P(2) = 0.3– No measurement is given

• Guessing– What shall we do to recognize a given input?– What is the best we can do statistically? Why?


5

A More Complicated Example

• Suppose that we are given two classes– A single measurement x– P(1|x) and P(2|x) are given graphically


6

A Bayesian Example• Suppose that we are given two classes

– A single measurement x– We are given p(x|1) and p(x|2) this time


7

A Bayesian Example – cont.


8


• Bayes formula

• In case of two categories

• In English, it can be expressed as

)()|()(

)()|()|(

)()|()()|(),(

jjjj

j

jjjj

PxpxpPxp

xP

PxpxpxPxp

evidencepriorlikelihoodposterior x

2

1

)()|()(j

jj Pxpxp


9

Bayesian Decision Theory – cont.

• A posterior probability– The probability of the state of nature being

given that feature value x has been measured• Likelihood

– is the likelihood of with respect to x

• Evidence– The evidence factor can be viewed as a

scaling factor that guarantees that the posterior probabilities sum to one.

j

j)|( jxp


10

Bayesian Decision Theory – cont.

• Whenever we observe a particular x, the prob. of error is

• The average prob. of error is given by

12

21

decide weif)|( decide weif)|(

)|(

xPxP

xerrorP

dxxpxerrorPdxxerrorPerrorP )()|(),()(


12

Bayesian Decision Theory--continuous features

• Feature space– In general, an input can be represented by a

vector, a point in a d-dimensional Euclidean space Rd

• Loss function– The loss function states exactly how costly

each action is and is used to convert a probability determination into a decision

– Written as )|( ji


13

Loss Function

• Describe the loss incurred for taking action i

when the state of nature is j

)|( ji


14

Conditional Risk

• Suppose we observe a particular x• We take action i

• If the true state of nature is j

• By definition we will incur the loss i|j)• We can minimize our expected loss by

selecting the action that minimize the condition risk, R(i|x)

xx ||)|(1

j

c

jjii PR


15


• Suppose that there are c categories{1, 2, ..., c}

• Conditional risk

• Risk is the average expected loss

)|()|()|(1

xx j

c

jjii PR

xxxx dpRR )()|)((


16


• Bayes decision rule– For a given x, select the action i for which

the conditional risk is minimum

– The resulting minimum overall risk is called the Bayes risk, denoted as R*, which is the best performance that can be achieved

)|(min* xRi ii


17

Two-Category Classification

• Let ij = (i|j)• Conditional risk

• Fundamental decision rule

Decide 1 if R(1|x) < R(2|x)

)|()|()|()|()|()|(

2221212

2121111

xxxxxx

PPRPPR


18

Two-Category Classification – cont.

• The decision rule can be written in several ways– Decide 1 if one of the followings is true

)()(

)|()|(

)()|()()()|()()|()()|()(

1

2

1121

2212

2

1

222212111121

2221211121

PP

pp

PpPpPP

xx

xxxx

Likelihood Ratio

These rules are

equivalent


19

Minimum-Error-Rate Classification

• A special case of the Bayes decision rule with the following zero-one loss function

– Assigns no loss to correct decision– Assigns unit loss to any error– All errors are equally costly

j i if 1ji if 0

)|( ji


20


• Conditional risk

x

x

xx

|1

|

|||1

j

ijj

j

c

jjii

P

P

PR


21


• We should select i that maximizes the posterior probability

• For minimum error rate:

Decide

)|( xjP

ijPP jii allfor )|()|( if xx


22



23

Classifiers, Discriminant Functions, and Decision Surfaces

• There are many ways to represent pattern classifiers

• One of the most useful is in terms of a set of discriminant functions gi(x), i=1,…,c

• The classifier assigns a feature vector x to class if

jigg ji allfor )()( xx

i


24

The Multicategory Classifier


25


• There are many equivalent discriminant functions– i.e., the classification results will be the

same even though they are different functions

– For example, if f is a monotonically increasing function, then

))(()( xgfxg ii


26


• Some of discriminant functions are easier to understand or to compute


27

Decision Regions

• The effect of any decision is to divide the feature space into c decision regions, R1, ..., Rc

– The regions are separated with decision boundaries, where ties occur among the largest discriminant functions

iji ijgg Rxxx then , allfor )()( If


28

Decision Regions – cont.


29

Two-Category Case (Dichotomizer)

• Two-category case is a special case– Instead of two discriminant functions, a

single one can be used

)()(ln

)|()|(ln)(

)|()|()(

)()()(

2

1

2

1

21

21

PP

ppg

PPg

ggg

xxx

xxx

xxx


30

The Normal Density• Univariate Gaussian Density

• Mean

• Variance

2

21exp

21)(

xxp

dxxxpx

)(

dxxpxx

)(222


31

The Normal Density


32

The Normal Density

• Central Limit Theorem

– The aggregate effect of the sum of a large number of small, independent random disturbances will lead to a Gaussian distribution

– Gaussian is often a good model for the actual probability distribution


33

The Multivariate Normal Density

• Multivariate Density (in d dimension)

Abbreviation

μxΣμx

Σx 1

2/12/ 21exp

21)( td

p

Σμx ,)( Np


34

The Multivariate Normal Density

• Mean

• Covariance matrix

• The ijth component of

xxxxμ dp )(

xxμxμxμxμxΣ dptt )(

Σ

jjiiij xx


35

Statistically Independence

– If xi and xj are statistically independence then

– The covariance matrix will become a diagonal matrix where all off-diagonal elements are zero

0ij


36

Whitening Transform

2/1ΦΛAw

matrix whose columns are the orthonormal eigenvectors of Σ

Diagonal matrix of the corresponding eigenvalues

of Σ


37

Whitening Transform


38

Squared Mahalanobis Distance from x to μxΣμx 12 tr

Constant density

Principle axes of hyperellipsiods are given by the eigenvectors of Length of axes are determined by eigenvalues of


39

Discriminant Functions for the Normal Density

• Minimum distance classifier

• If the density are multivariate normal– i.e., if

Then we have:

),()|( iii Np Σμx )|( ip x

)(lnln212ln

221)( 1

iiiit

ii Pdg ΣμxΣμxx

)(ln)|(ln)( iii Ppg xx


40

Discriminant Functions for the Normal Density

• Case 1:– Features are statistically independence and

each feature has the same variance

– Where || . || denotes the Euclidean norm

IΣ 2i

)(ln2

)( 2

2

ii

i Pg

μx

x

)()(2i

tii μxμxμx


41

Case 1: i = 2I


42

Linear Discriminant Function

• It is not necessary to compute distances– Expanding the form yields

– The term is the same for all i– We have the following linear discriminant

function

)()( it

i μxμx

)(ln22

1)( 2 iiti

ti

ti Pg

μμxμxxx

xxt

0)( itii wg xwx


43

Linear Discriminant Function

where

and

ii μw2

1

)(ln2

10

Pw t

i

μμ

Threshold or bias for the ith category


44

Linear Machine

• A classifier that uses linear discriminant functions is called a linear machine

• Its decision surfaces are pieces of hyperplanes defined by the linear equations

for the two categories with the highest posterior probabilities. For our case this equation can be written as

)()( xx ji gg

0)( 0 xxw t


45

Linear Machine

Where

And

If then the second term vanishes

It is called a minimum-distance classifier

jμμw

jij

i

ji

ji PP μμ

μμμμx

)()(ln

21

20

)()( ji PP


46

Priors change -> decision boundaries shift


47



48



49

Case 2: i =

• Covariance matrices for all of the classes are identical but otherwise arbitrary

• The cluster for the ith class is centered about i

• Discriminant function: )(ln

21)( 1

iit

ii Pg μxΣμxx

Can be ignored if prior probabilities are the same for all classes


50

Case 2: Discriminant function

Where

and

0)( itii wg xwx

ii μΣw 1

)(ln21 1

0 Pw ti μΣμ


51

For 2-category case

• If Ri and Rj are contiguous, the boundary between them has the equation

where

and

0)( 0 xxw t

jμμΣw

1

ji

jit

ji

jiji

PPμμ

μμΣμμμμx

10

)(/)(ln21


52


53


54

Case 3: i = arbitrary

• In general, the covariance matrices are different for each category

• The only term that can be dropped is the (d/2) ln 2 term


55

Case 3: i = arbitraryThe discriminant functions are

Where

and

0)( itii

ti wg xwxWxx

1

21 ii ΣW

iii 1Σw

)(lnln21

21 1

0 Pw it

i ΣμΣμ


56

Two-category case

• The decision surface are hyperquadrics (hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids,…)


57


58


59


60


61

Example

240-650 principles of pattern recognition

Documents