computational intelligence: methods and applications lecture 12 bayesian decisions: foundation of...

Computational Intelligence: Computational Intelligence: Methods and ApplicationsMethods and Applications

Lecture 12

Bayesian decisions: foundation of learning

Włodzisław Duch

Dept. of Informatics, UMK

Google: W Duch

LearningLearningLearningLearning• Learning from data requires a model of data.

• Traditionally parametric models of different phenomena were developed in science and engineering; parametric models are easy to interpret, so if the phenomenon is simple enough and a theory exist construct a model and use algorithmic approach.

• Empirical non-parametric modeling is data driven, goal oriented. It dominates in biological systems. Learn from data!

• Given some examples = training data, create a model of data that answers specific question, estimating those characteristics of the data that may be useful to make future predictions.

• Learning = estimate parameters of the (non-parametric) model; paradoxically, non-parametric models have a lot of parameters.

• Many other approaches to learning exist, but no time to explore ...

ProbabilityProbabilityProbabilityProbabilityTo talk about prediction errors probability concepts are needed.

Samples XX divided into K categories, called classes 1 ... K

More general, i is a state of nature that we would like to predict.

Pk = P(k), a priori (unconditional) probability of observing X k

If nothing else is known than one should predict that a new sample X belongs to the majority class:

; arg maxc kk

Majority classifier: assigns all new X to the most frequent class.

Example: weather prediction system – the weather tomorrow will be the same as yesterday (high accuracy prediction!).

Conditional probabilityConditional probabilityConditional probabilityConditional probabilityPredictions should never be worse than for the majority classifier! Usually class-conditional probability is also known or may easily be

measured, the condition here is that X k

| | k k kP P P C X X X

Joint probability of observing X from k

Is the knowledge of conditional probability sufficient to make predictions?

No! We are interested in the posterior P.

, | k k kP P P X X

| , / k kP P P X X X Fig. 2.1, from Duda, Hart, Stork, Pattern Classification (Wiley).

Bayes ruleBayes ruleBayes ruleBayes rulePosterior conditional probabilities are normalized:

Bayes rule for 2 classes isderived from this:

P(X) is an unconditional probability of selecting sample X; usually it is just 1/n, where n=number of all samples.

For P1=2/3 and P2=1/3 previous figure is:

| |i i iP P P P X X X

Fig. 2.2, from Duda, Hart, Stork, Pattern Classification (Wiley).

Bayes decisionsBayes decisionsBayes decisionsBayes decisionsBayes decision: given a sample X select class 1 if:

Probability of an error is:

Average error is:

1 2| |P P X X

1 2| min | , |P P P X X X

| |P E P P P d

X X X X

Bayes decision rule minimizes average error selecting smallest P(|X)

Using Bayes rule and multiplying both sides by P(X):

1 1 2 2| |P P P P X X

LikelihoodLikelihoodLikelihoodLikelihood

Bayesian approach to learning: use data to model probabilities.

Bayes decision depends on the likelihood ratio:

For equal a priori probabilities class conditional probabilities decide.

On a finite data sample given for training the error is:

|P P X

The assumption here is that the P(X) is reflected in the frequency of samples for different X.

1 1 2 2

P P P P

2D decision regions2D decision regions2D decision regions2D decision regionsFor Gaussian distribution of class conditional probabilities:

Decision boundaries in 2D are hyperbolic, decision region R2 is disconnected. The ellipsis show high constant values of Pk(X).

ExampleExampleExampleExample

Let 1 be the state of nature called “dengue”,

and 2 the opposite, no dengue.

Let prior probability for people in Singapore be P(1)=0.1%

Let test T be accurate in 99%, so that the positive outcome of the test for people with dengue is P(T=+|1) = 0.99, and negative for healthy people is also P(T=|2) = 0.99.

What is the chance that you have dengue if your test is positive?

What is the probability P(1|T=+)?

P(T=+) = P(1,T=+)+P(2,T=+) = 0.99*0.001+0.01*0.999=0.011

P(1|T=+)=P(T=+|1)P(1)/P(T=+) = 0.99*0.001/0.011 = 0.09, or 9%

Use this calculator to check:

http://StatPages.org/bayes.html

Statistical theory of decisionsStatistical theory of decisionsStatistical theory of decisionsStatistical theory of decisionsDecisions carry risk, costs and losses.

Consider general decision procedure:

{1.. K}, states of nature

{1.. a}, actions that may be taken

(i,j), cost, loss or risk associated with action j in state i

Example: classification decisions

Ĉ: X {1.. K, D, O},

Action i is assigning to vector X a class label 1 .. K, or

D – no confidence in classification, reject/leave sample as unclassified

O – outlier, untypical case, perhaps a new class (used rarely).

Errors and lossesErrors and lossesErrors and lossesErrors and lossesUnconditional probability of wrong (non-optimal) action (decision),

if the true state is k, and prediction was wrong:

1ˆ ˆ .. |k k K kP P C C C X X

ˆ |D k D kP P C C X

No action (no decision), or rejection of sample X if the true class

is k, has probability:

, 1 if , 1

ifkl k l

k l l K

Assume simplest 0/1 loss function: no cost if optimal decision is taken, identical costs for all errors and some costs for no action (rejection):

... and risks... and risks... and risks... and risksRisk of the decision making procedure Ĉ when the true state is k,

with K+1=D

where Pkl are elements of the confusion matrix P:

ˆ ˆ, , |

k l kl k d D kl

R C E C C

11 12 1 1

21 22 2 1

...ˆ| ( )... ... ... ...

K K KK

P P PP C C

Note that rows of P correspond to the true k classes, and columns to

the predicted l classes, l =1..K+1 or classifier’s decisions.

... and more risks... and more risks... and more risks... and more risksTrace of the confusion matrix:

is equal to accuracy of the classifier, ignoring costs of mistakes.

Total risk of the decision procedure Ĉ:

ˆ ˆ ,K K K

k k kl kl kk k l

k k d D kk

R C P R C P P

k kj jj

X XConditional risk of assigning sample X

to class k is:

For special case of costs of mistakes =1 and rejection =d

computational intelligence: methods and applications lecture 12 bayesian decisions: foundation of...

Documents

towards ci foundations włodzisław duch department of...

how to learn highly non-separable data włodzisław duch...

brain-inspired cognitive architectures włodzisław duch...

the universe, mind and intelligence without god?...

computational intelligence: methods and applications lecture...

cost action b27, wg1 theoretical study on oscillation &...

włodzisław duch nicolaus copernicus university, toruń,...

computational intelligence for information selection filters...

what is data mining? włodzisław duch dept. of informatics,...

od autyzmu do adhd: komputerowe symulacje. włodzisław duch...

cognitive science and music włodzisław duch department of...

understanding of data using computational intelligence...

minimum spanning trees displaying semantic similarity...

data mining ii the fuzzy way włodzisław duch dept. of...

neurocognitive informatics manifesto włodzisław duch...

włodzisław duch department of computer science,

are we automata? włodzisław duch department of informatics...

otwarta nauka włodzisław duch katedra informatyki...

towards science of dm włodzisław duch department of...

neural networks for high-level intelligence and cognition...