vu pattern recognition ii - uni salzburghelmut/teaching/patternrecognition/prii.pdf · vu pattern...
TRANSCRIPT
![Page 1: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/1.jpg)
VU Pattern Recognition II
VU Pattern Recognition II
Helmut A. Mayer
Department of Computer SciencesUniversity of Salzburg
WS 13/14
![Page 2: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/2.jpg)
VU Pattern Recognition II
Outline
1 Introduction
2 Statistical ClassifiersBayesian Decision Theory
3 Nonparametric TechniquesDensity Estimationk–Nearest–Neigbor Estimation
4 Linear Discriminant FunctionsDecision Surfaces
5 Neural Networks
6 Nonmetric MethodsClassification and Regression Trees
7 Stochastic MethodsSimulated Annealing
8 Projects
![Page 3: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/3.jpg)
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
![Page 4: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/4.jpg)
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
![Page 5: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/5.jpg)
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
![Page 6: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/6.jpg)
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
![Page 7: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/7.jpg)
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
![Page 8: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/8.jpg)
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
![Page 9: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/9.jpg)
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
![Page 10: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/10.jpg)
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
![Page 11: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/11.jpg)
VU Pattern Recognition II
Introduction
Salmon or Sea Bass
FIGURE 1.1. The objects to be classified are first sensed by a transducer (camera),whose signals are preprocessed. Next the features are extracted and finally the clas-sification is emitted, here either “salmon” or “sea bass.” Although the information flowis often chosen to be from the source to the classifier, some systems employ informationflow in which earlier levels of processing can be altered based on the tentative or pre-liminary response in later levels (gray arrows). Yet others combine two or more stagesinto a unified step, such as simultaneous segmentation and feature extraction. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
![Page 12: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/12.jpg)
VU Pattern Recognition II
Introduction
Fish Length Histogram
salmon sea bass
length
count
l*
0
2
4
6
8
10
12
16
18
20
22
5 10 2015 25
FIGURE 1.2. Histograms for the length feature for the two categories. No single thresh-old value of the length will serve to unambiguously discriminate between the two cat-egories; using length alone, we will have some errors. The value marked l∗ will lead tothe smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 13: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/13.jpg)
VU Pattern Recognition II
Introduction
Fish Lightness Histogram
2 4 6 8 100
2
4
6
8
10
12
14
lightness
count
x*
salmon sea bass
FIGURE 1.3. Histograms for the lightness feature for the two categories. No singlethreshold value x∗ (decision boundary) will serve to unambiguously discriminate be-tween the two categories; using lightness alone, we will have some errors. The value x∗
marked will lead to the smallest number of errors, on average. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.
![Page 14: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/14.jpg)
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)
2D Decision Boundary
![Page 15: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/15.jpg)
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)
Minimization of cost (risk)Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)
2D Decision Boundary
![Page 16: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/16.jpg)
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)Minimization of cost (risk)
Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)
2D Decision Boundary
![Page 17: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/17.jpg)
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)
2D Decision Boundary
![Page 18: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/18.jpg)
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)2D Decision Boundary
![Page 19: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/19.jpg)
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)
2D Decision Boundary
![Page 20: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/20.jpg)
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)2D Decision Boundary
![Page 21: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/21.jpg)
VU Pattern Recognition II
Introduction
2D Feature Space
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.4. The two features of lightness and width for sea bass and salmon. The darkline could serve as a decision boundary of our classifier. Overall classification error onthe data shown is lower than if we use only one feature as in Fig. 1.3, but there willstill be some errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 22: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/22.jpg)
VU Pattern Recognition II
Introduction
Overfitting
?
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries thatare complicated. While such a decision may lead to perfect classification of our trainingsamples, it would lead to poor performance on future patterns. The novel test pointmarked ? is evidently most likely a salmon, whereas the complex decision boundaryshown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 23: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/23.jpg)
VU Pattern Recognition II
Introduction
Generalization
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff be-tween performance on the training set and simplicity of classifier, thereby giving thehighest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 24: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/24.jpg)
VU Pattern Recognition II
Introduction
Related Fields
Statistical Hypothesis Testing
Image Processing
Regression (age ↔ weight)
Interpolation
Density Estimation
![Page 25: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/25.jpg)
VU Pattern Recognition II
Introduction
Related Fields
Statistical Hypothesis Testing
Image Processing
Regression (age ↔ weight)
Interpolation
Density Estimation
![Page 26: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/26.jpg)
VU Pattern Recognition II
Introduction
Related Fields
Statistical Hypothesis Testing
Image Processing
Regression (age ↔ weight)
Interpolation
Density Estimation
![Page 27: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/27.jpg)
VU Pattern Recognition II
Introduction
Related Fields
Statistical Hypothesis Testing
Image Processing
Regression (age ↔ weight)
Interpolation
Density Estimation
![Page 28: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/28.jpg)
VU Pattern Recognition II
Introduction
Related Fields
Statistical Hypothesis Testing
Image Processing
Regression (age ↔ weight)
Interpolation
Density Estimation
![Page 29: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/29.jpg)
VU Pattern Recognition II
Introduction
Pattern Recognition Systems
post-processing
classification
feature extraction
segmentation
sensing
input
decision
adjustments for missing features
adjustments for context
costs
FIGURE 1.7. Many pattern recognition systems can be partitioned into componentssuch as the ones shown here. A sensor converts images or sounds or other physicalinputs into signal data. The segmentor isolates sensed objects from the background orfrom other objects. A feature extractor measures object properties that are useful forclassification. The classifier uses these features to assign the sensed object to a cate-gory. Finally, a post processor can take account of other considerations, such as theeffects of context and the costs of errors, to decide on the appropriate action. Althoughthis description stresses a one-way or “bottom-up” flow of data, some systems employfeedback from higher levels back down to lower levels (gray arrows). From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
![Page 30: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/30.jpg)
VU Pattern Recognition II
Introduction
Feature Extraction
Features ↔ Classification
Invariant Features (translation, rotation, scale)
Deformation (e.g. Cropping)
Feature Selection (Filter, Wrapper)
![Page 31: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/31.jpg)
VU Pattern Recognition II
Introduction
Feature Extraction
Features ↔ Classification
Invariant Features (translation, rotation, scale)
Deformation (e.g. Cropping)
Feature Selection (Filter, Wrapper)
![Page 32: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/32.jpg)
VU Pattern Recognition II
Introduction
Feature Extraction
Features ↔ Classification
Invariant Features (translation, rotation, scale)
Deformation (e.g. Cropping)
Feature Selection (Filter, Wrapper)
![Page 33: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/33.jpg)
VU Pattern Recognition II
Introduction
Feature Extraction
Features ↔ Classification
Invariant Features (translation, rotation, scale)
Deformation (e.g. Cropping)
Feature Selection (Filter, Wrapper)
![Page 34: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/34.jpg)
VU Pattern Recognition II
Introduction
Post Processing
Error Rate, Risk (weighted error)
Context (IC* *IN)
Multiple Classifiers (subspaces, fusion)
![Page 35: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/35.jpg)
VU Pattern Recognition II
Introduction
Post Processing
Error Rate, Risk (weighted error)
Context (IC* *IN)
Multiple Classifiers (subspaces, fusion)
![Page 36: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/36.jpg)
VU Pattern Recognition II
Introduction
Post Processing
Error Rate, Risk (weighted error)
Context (IC* *IN)
Multiple Classifiers (subspaces, fusion)
![Page 37: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/37.jpg)
VU Pattern Recognition II
Introduction
Design Cycle
collect data
choose features
choose model
train classifier
evaluate classifier
end
start
prior knowledge(e.g., invariances)
FIGURE 1.8. The design of a pattern recognition system involves a design cycle similarto the one shown here. Data must be collected, both to train and to test the system. Thecharacteristics of the data impact both the choice of appropriate discriminating featuresand the choice of models for the different categories. The training process uses some orall of the data to determine the system parameters. The results of evaluation may callfor repetition of various steps in this process in order to obtain satisfactory results. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
![Page 38: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/38.jpg)
VU Pattern Recognition II
Introduction
Learning and Adaptation
Learning is Parameter Tuning
Supervised Learning (teacher)
Reinforcement Learning (critic)
Unsupervised Learning (clustering)
![Page 39: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/39.jpg)
VU Pattern Recognition II
Introduction
Learning and Adaptation
Learning is Parameter Tuning
Supervised Learning (teacher)
Reinforcement Learning (critic)
Unsupervised Learning (clustering)
![Page 40: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/40.jpg)
VU Pattern Recognition II
Introduction
Learning and Adaptation
Learning is Parameter Tuning
Supervised Learning (teacher)
Reinforcement Learning (critic)
Unsupervised Learning (clustering)
![Page 41: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/41.jpg)
VU Pattern Recognition II
Introduction
Learning and Adaptation
Learning is Parameter Tuning
Supervised Learning (teacher)
Reinforcement Learning (critic)
Unsupervised Learning (clustering)
![Page 42: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/42.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Probabilities
State of Nature ω = ω1 (class)
A Priori Probability P(ω1) (prior)
Decision Rule P(ω1) > P(ω2)→ ω1
Class–Conditional Probability Density Function p(x |ω)
![Page 43: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/43.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Probabilities
State of Nature ω = ω1 (class)
A Priori Probability P(ω1) (prior)
Decision Rule P(ω1) > P(ω2)→ ω1
Class–Conditional Probability Density Function p(x |ω)
![Page 44: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/44.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Probabilities
State of Nature ω = ω1 (class)
A Priori Probability P(ω1) (prior)
Decision Rule P(ω1) > P(ω2)→ ω1
Class–Conditional Probability Density Function p(x |ω)
![Page 45: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/45.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Probabilities
State of Nature ω = ω1 (class)
A Priori Probability P(ω1) (prior)
Decision Rule P(ω1) > P(ω2)→ ω1
Class–Conditional Probability Density Function p(x |ω)
![Page 46: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/46.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Class–Conditional Probability Density
9 10 11 12 13 14 15
0.1
0.2
0.3
0.4
p(x|ωi)
x
ω1
ω2
FIGURE 2.1. Hypothetical class-conditional probability density functions show theprobability density of measuring a particular feature value x given the pattern is incategory ωi . If x represents the lightness of a fish, the two curves might describe thedifference in lightness of populations of two types of fish. Density functions are normal-ized, and thus the area under each curve is 1.0. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.
![Page 47: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/47.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Bayes Decision Rule
Joint Probability Densityp(ωj , x) = P(ωj |x)p(x) = p(x |ωj)P(ωj)
Bayes Formula P(ωj |x) =p(x |ωj )P(ωj )
p(x)
Decision Rule P(ω1|x) > P(ω2|x)→ ω1
![Page 48: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/48.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Bayes Decision Rule
Joint Probability Densityp(ωj , x) = P(ωj |x)p(x) = p(x |ωj)P(ωj)
Bayes Formula P(ωj |x) =p(x |ωj )P(ωj )
p(x)
Decision Rule P(ω1|x) > P(ω2|x)→ ω1
![Page 49: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/49.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Bayes Decision Rule
Joint Probability Densityp(ωj , x) = P(ωj |x)p(x) = p(x |ωj)P(ωj)
Bayes Formula P(ωj |x) =p(x |ωj )P(ωj )
p(x)
Decision Rule P(ω1|x) > P(ω2|x)→ ω1
![Page 50: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/50.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Posterior Probabilities
0.2
0.4
0.6
0.8
1
P(ωi|x)
x
ω1
ω2
9 10 11 12 13 14 15
FIGURE 2.2. Posterior probabilities for the particular priors P(ω1) = 2/3 and P(ω2)
= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in thiscase, given that a pattern is measured to have feature value x = 14, the probability it isin category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sumto 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 51: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/51.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Error Probabilities
Error P(error |x) =
{P(ω1|x) if ω2
P(ω2|x) if ω1
Average Error ProbabilityP(error) =
∫∞−∞ p(error , x)dx =
∫∞−∞ P(error |x)p(x)dx
Bayes Rule minimizes P(error)
![Page 52: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/52.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Error Probabilities
Error P(error |x) =
{P(ω1|x) if ω2
P(ω2|x) if ω1
Average Error ProbabilityP(error) =
∫∞−∞ p(error , x)dx =
∫∞−∞ P(error |x)p(x)dx
Bayes Rule minimizes P(error)
![Page 53: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/53.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Error Probabilities
Error P(error |x) =
{P(ω1|x) if ω2
P(ω2|x) if ω1
Average Error ProbabilityP(error) =
∫∞−∞ p(error , x)dx =
∫∞−∞ P(error |x)p(x)dx
Bayes Rule minimizes P(error)
![Page 54: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/54.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Generalized Bayes Rule
Feature vector ~x ∈ Rd
Classes ω1 . . . ωc
Bayes Formula P(ωj |~x) =p(~x |ωj )P(ωj )
p(~x)
![Page 55: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/55.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Generalized Bayes Rule
Feature vector ~x ∈ Rd
Classes ω1 . . . ωc
Bayes Formula P(ωj |~x) =p(~x |ωj )P(ωj )
p(~x)
![Page 56: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/56.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Generalized Bayes Rule
Feature vector ~x ∈ Rd
Classes ω1 . . . ωc
Bayes Formula P(ωj |~x) =p(~x |ωj )P(ωj )
p(~x)
![Page 57: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/57.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Dichotomizer
0
0.1
0.2
0.3
decisionboundary
p(x|ω2)P(ω2)
R1
R2
p(x|ω1)P(ω1)
R2
0
5
0
5
FIGURE 2.6. In this two-dimensional two-category classifier, the probability densitiesare Gaussian, the decision boundary consists of two hyperbolas, and thus the decisionregion R2 is not simply connected. The ellipses mark where the density is 1/e timesthat at the peak of the distribution. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 58: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/58.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
The Normal Density
Randomized Prototype Vectors with Mean ~µ→Normal Distribution
Expected ValueE [f (x)] =
∫∞−∞ f (x)p(x)dx (continous)
E [f (x)] =∑
x∈D f (x)P(x) (discrete)
Univariate Normal Density
p(x) = 1√2πσ
e−12
( x−µσ
)2
E [x ] = µ E [(x − µ)2] = σ2
![Page 59: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/59.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
The Normal Density
Randomized Prototype Vectors with Mean ~µ→Normal Distribution
Expected ValueE [f (x)] =
∫∞−∞ f (x)p(x)dx (continous)
E [f (x)] =∑
x∈D f (x)P(x) (discrete)
Univariate Normal Density
p(x) = 1√2πσ
e−12
( x−µσ
)2
E [x ] = µ E [(x − µ)2] = σ2
![Page 60: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/60.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
The Normal Density
Randomized Prototype Vectors with Mean ~µ→Normal Distribution
Expected ValueE [f (x)] =
∫∞−∞ f (x)p(x)dx (continous)
E [f (x)] =∑
x∈D f (x)P(x) (discrete)
Univariate Normal Density
p(x) = 1√2πσ
e−12
( x−µσ
)2
E [x ] = µ E [(x − µ)2] = σ2
![Page 61: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/61.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Normal Distribution
x
2.5% 2.5%
σ
p(x)
µ + σ µ + 2σµ - σµ - 2σ µ
FIGURE 2.7. A univariate normal distribution has roughly 95% of its area in the range|x − µ| ≤ 2σ , as shown. The peak of the distribution has value p(µ) = 1/
√2πσ . From:
Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
![Page 62: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/62.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Multivariate Density
Multivariate Normal Density
p(x) = 1
2πd2 |Σ|
12e−
12
(~x−~µ)tΣ−1(~x−~µ)
Covariance Matrix Σ (d × d)E [~x ] = ~µ E [(~x − ~µ)(~x − ~µ)t ] = ΣE [xi ] = µi E [(xi − µi )(xj − µj)] = σij
![Page 63: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/63.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Multivariate Density
Multivariate Normal Density
p(x) = 1
2πd2 |Σ|
12e−
12
(~x−~µ)tΣ−1(~x−~µ)
Covariance Matrix Σ (d × d)E [~x ] = ~µ E [(~x − ~µ)(~x − ~µ)t ] = ΣE [xi ] = µi E [(xi − µi )(xj − µj)] = σij
![Page 64: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/64.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
2D Gaussian
x2
x1
µ
FIGURE 2.9. Samples drawn from a two-dimensional Gaussian lie in a cloud centeredon the mean �. The ellipses show lines of equal probability density of the Gaussian.From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copy-right c© 2001 by John Wiley & Sons, Inc.
![Page 65: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/65.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Four Categories
R3
R2
R1
R4
R4
FIGURE 2.16. The decision regions for four normal distributions. Even with such a lownumber of categories, the shapes of the boundary regions can be rather complex. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
![Page 66: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/66.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Classification Errors
Bayes Error: overlapping densities, inherent problem property
Model Error: incorrect model
Estimation Error: finite sample of training data
![Page 67: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/67.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Classification Errors
Bayes Error: overlapping densities, inherent problem property
Model Error: incorrect model
Estimation Error: finite sample of training data
![Page 68: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/68.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Classification Errors
Bayes Error: overlapping densities, inherent problem property
Model Error: incorrect model
Estimation Error: finite sample of training data
![Page 69: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/69.jpg)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Bayes Error and Dimensionality
x3
x1
x2
FIGURE 3.3. Two three-dimensional distributions have nonoverlapping densities, andthus in three dimensions the Bayes error vanishes. When projected to a subspace—here,the two-dimensional x1 − x2 subspace or a one-dimensional x1 subspace—there canbe greater overlap of the projected distributions, and hence greater Bayes error. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
![Page 70: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/70.jpg)
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Unknown Densities
Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data
P that pattern ~x falls in region R, P =∫R p(~x)d~x
n patterns, probability that k patterns are in RPk =
(nk
)Pk(1− P)n−k E [k] = nP
Assuming small region R →p(~x) ' const →
∫R p(~x)d~x ' p(~x)V → p(~x) '
knV
![Page 71: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/71.jpg)
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Unknown Densities
Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data
P that pattern ~x falls in region R, P =∫R p(~x)d~x
n patterns, probability that k patterns are in RPk =
(nk
)Pk(1− P)n−k E [k] = nP
Assuming small region R →p(~x) ' const →
∫R p(~x)d~x ' p(~x)V → p(~x) '
knV
![Page 72: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/72.jpg)
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Unknown Densities
Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data
P that pattern ~x falls in region R, P =∫R p(~x)d~x
n patterns, probability that k patterns are in RPk =
(nk
)Pk(1− P)n−k E [k] = nP
Assuming small region R →p(~x) ' const →
∫R p(~x)d~x ' p(~x)V → p(~x) '
knV
![Page 73: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/73.jpg)
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Unknown Densities
Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data
P that pattern ~x falls in region R, P =∫R p(~x)d~x
n patterns, probability that k patterns are in RPk =
(nk
)Pk(1− P)n−k E [k] = nP
Assuming small region R →p(~x) ' const →
∫R p(~x)d~x ' p(~x)V → p(~x) '
knV
![Page 74: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/74.jpg)
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Relative Probability
1k/n
0.5
1
relativeprobability
0
1005020
P = 0.7
FIGURE 4.1. The relative probability an estimate given by Eq. 4 will yield a particularvalue for the probability density, here where the true probability was chosen to be 0.7.Each curve is labeled by the total number of patterns n sampled, and is scaled to givethe same maximum (at the true probability). The form of each curve is binomial, asgiven by Eq. 2. For large n, such binomials peak strongly at the true probability. In thelimit n → ∞, the curve approaches a delta function, and we are guaranteed that ourestimate will give the true probability. From: Richard O. Duda, Peter E. Hart, and DavidG. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 75: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/75.jpg)
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Sample Size
Estimate p(~x) 'knV is dependent on size of V
if V → 0, p(~x) would be exact, but no more samples in V
Assuming infinite pattern set with decreasing Vn
n–th estimate pn(~x) =knnVn
For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞
knn = 0
Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows
Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors
![Page 76: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/76.jpg)
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Sample Size
Estimate p(~x) 'knV is dependent on size of V
if V → 0, p(~x) would be exact, but no more samples in V
Assuming infinite pattern set with decreasing Vn
n–th estimate pn(~x) =knnVn
For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞
knn = 0
Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows
Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors
![Page 77: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/77.jpg)
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Sample Size
Estimate p(~x) 'knV is dependent on size of V
if V → 0, p(~x) would be exact, but no more samples in V
Assuming infinite pattern set with decreasing Vn
n–th estimate pn(~x) =knnVn
For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞
knn = 0
Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows
Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors
![Page 78: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/78.jpg)
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Sample Size
Estimate p(~x) 'knV is dependent on size of V
if V → 0, p(~x) would be exact, but no more samples in V
Assuming infinite pattern set with decreasing Vn
n–th estimate pn(~x) =knnVn
For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞
knn = 0
Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows
Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors
![Page 79: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/79.jpg)
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Point Density Estimation
n = 1 n = 4 n = 9 n = 16 n = 100
...
...
...
...
Vn =1/ √n
kn = √n
FIGURE 4.2. There are two leading methods for estimating the density at a point, hereat the center of each square. The one shown in the top row is to start with a large volumecentered on the test point and shrink it according to a function such as Vn = 1/
√n. The
other method, shown in the bottom row, is to decrease the volume in a data-dependentway, for instance letting the volume enclose some number kn = √
n of sample points.The sequences in both cases represent random variables that generally converge andallow the true density at the test point to be calculated. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.
![Page 80: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/80.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Prototype Estimation
Estimate density at arbitrary ~x by kn nearest neighbors of ~x
pn(~x) =knnVn
(neighbors are training patterns)
Dense neighbors → small Vn → good resolutionSparse neighbors → large Vn → bad resolution
Problem: often∫pn(~x)d~x > 1
![Page 81: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/81.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Prototype Estimation
Estimate density at arbitrary ~x by kn nearest neighbors of ~x
pn(~x) =knnVn
(neighbors are training patterns)
Dense neighbors → small Vn → good resolutionSparse neighbors → large Vn → bad resolution
Problem: often∫pn(~x)d~x > 1
![Page 82: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/82.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Prototype Estimation
Estimate density at arbitrary ~x by kn nearest neighbors of ~x
pn(~x) =knnVn
(neighbors are training patterns)
Dense neighbors → small Vn → good resolutionSparse neighbors → large Vn → bad resolution
Problem: often∫pn(~x)d~x > 1
![Page 83: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/83.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
1D kNN Estimate
x
p(x)
3 5
FIGURE 4.10. Eight points in one dimension and the k-nearest-neighbor density esti-mates, for k = 3 and 5. Note especially that the discontinuities in the slopes in theestimates generally lie away from the positions of the prototype points. From: RichardO. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001by John Wiley & Sons, Inc.
![Page 84: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/84.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
2D kNN Estimate
0
p(x)
x1
x2
FIGURE 4.11. The k-nearest-neighbor estimate of a two-dimensional density for k = 5.Notice how such a finite n estimate can be quite “jagged,” and notice that disconti-nuities in the slopes generally occur along lines away from the positions of the pointsthemselves. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifi-cation. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 85: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/85.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Unimodal and Bimodal 1D kNN Estimates
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
n=1
kn=1
n=16
kn=4
n=256
kn=16
n= ∞kn= ∞
FIGURE 4.12. Several k-nearest-neighbor estimates of two unidimensional densities:a Gaussian and a bimodal distribution. Notice how the finite n estimates can be quite“spiky.” From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 86: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/86.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Estimation of A Posteriori Probabilities
Samples of different classes, what is P(ωi |~x)?
Estimate for pn(~x , ωi ) =kinV (in arbitrary V )
Estimate for P(ωi |~x) = pn(~x ,ωi )∑cj=1 pn(~x ,ωj )
= kik
With n→∞ and Bayes Rule: optimal performance (Parzenand kNN)
![Page 87: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/87.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Estimation of A Posteriori Probabilities
Samples of different classes, what is P(ωi |~x)?
Estimate for pn(~x , ωi ) =kinV (in arbitrary V )
Estimate for P(ωi |~x) = pn(~x ,ωi )∑cj=1 pn(~x ,ωj )
= kik
With n→∞ and Bayes Rule: optimal performance (Parzenand kNN)
![Page 88: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/88.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Estimation of A Posteriori Probabilities
Samples of different classes, what is P(ωi |~x)?
Estimate for pn(~x , ωi ) =kinV (in arbitrary V )
Estimate for P(ωi |~x) = pn(~x ,ωi )∑cj=1 pn(~x ,ωj )
= kik
With n→∞ and Bayes Rule: optimal performance (Parzenand kNN)
![Page 89: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/89.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Nearest Neigbor Rule
Single nearest neigbor is ~x ′ (k = 1)Class label of ~x ′ is θ′ (random variable)P(θ′ = ωi ) = P(ωi |~x ′) ' P(ωi |~x) (for large n)
Assumption of 1NN: P(ωi |~x ′) is largest probabilityIf true (e.g., P ' 1, or P ' 1
c ), then 1NN close to Bayes Error
Average error probability P(e) =∫P(e|~x)p(~x)d~x
P(e|~x) = 1− P(ωi |~x ′) is ”minimum” P∗(e|~x)P∗(e) =
∫P∗(e|~x)p(~x)d~x
1NN error P = limn→∞ Pn(e)P∗ ≤ P ≤ P∗(2− c
c−1P∗)
![Page 90: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/90.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Nearest Neigbor Rule
Single nearest neigbor is ~x ′ (k = 1)Class label of ~x ′ is θ′ (random variable)P(θ′ = ωi ) = P(ωi |~x ′) ' P(ωi |~x) (for large n)
Assumption of 1NN: P(ωi |~x ′) is largest probabilityIf true (e.g., P ' 1, or P ' 1
c ), then 1NN close to Bayes Error
Average error probability P(e) =∫P(e|~x)p(~x)d~x
P(e|~x) = 1− P(ωi |~x ′) is ”minimum” P∗(e|~x)P∗(e) =
∫P∗(e|~x)p(~x)d~x
1NN error P = limn→∞ Pn(e)P∗ ≤ P ≤ P∗(2− c
c−1P∗)
![Page 91: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/91.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Nearest Neigbor Rule
Single nearest neigbor is ~x ′ (k = 1)Class label of ~x ′ is θ′ (random variable)P(θ′ = ωi ) = P(ωi |~x ′) ' P(ωi |~x) (for large n)
Assumption of 1NN: P(ωi |~x ′) is largest probabilityIf true (e.g., P ' 1, or P ' 1
c ), then 1NN close to Bayes Error
Average error probability P(e) =∫P(e|~x)p(~x)d~x
P(e|~x) = 1− P(ωi |~x ′) is ”minimum” P∗(e|~x)P∗(e) =
∫P∗(e|~x)p(~x)d~x
1NN error P = limn→∞ Pn(e)P∗ ≤ P ≤ P∗(2− c
c−1P∗)
![Page 92: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/92.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Nearest Neigbor Rule
Single nearest neigbor is ~x ′ (k = 1)Class label of ~x ′ is θ′ (random variable)P(θ′ = ωi ) = P(ωi |~x ′) ' P(ωi |~x) (for large n)
Assumption of 1NN: P(ωi |~x ′) is largest probabilityIf true (e.g., P ' 1, or P ' 1
c ), then 1NN close to Bayes Error
Average error probability P(e) =∫P(e|~x)p(~x)d~x
P(e|~x) = 1− P(ωi |~x ′) is ”minimum” P∗(e|~x)P∗(e) =
∫P∗(e|~x)p(~x)d~x
1NN error P = limn→∞ Pn(e)P∗ ≤ P ≤ P∗(2− c
c−1P∗)
![Page 93: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/93.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Voronoi Tesselation
x1
x2
x1
x3
FIGURE 4.13. In two dimensions, the nearest-neighbor algorithm leads to a partition-ing of the input space into Voronoi cells, each labeled by the category of the trainingpoint it contains. In three dimensions, the cells are three-dimensional, and the decisionboundary resembles the surface of a crystal. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 94: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/94.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
1NN Error Rate Bounds
c - 1c
P*
P
P = P
*
P =
2P
*
c - 1c
FIGURE 4.14. Bounds on the nearest-neighbor error rate P in a c-category problemgiven infinite training data, where P∗ is the Bayes error (Eq. 52). At low error rates, thenearest-neighbor error rate is bounded above by twice the Bayes rate. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
![Page 95: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/95.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
k–Nearest–Neigbor Rule
Straight–forward extension: k neighbors
Majority voting: P(ωm|~x) is largest probability (mostprototypes in class m)
If k →∞ then k–NN rule becomes optimal
![Page 96: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/96.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
k–Nearest–Neigbor Rule
Straight–forward extension: k neighbors
Majority voting: P(ωm|~x) is largest probability (mostprototypes in class m)
If k →∞ then k–NN rule becomes optimal
![Page 97: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/97.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
k–Nearest–Neigbor Rule
Straight–forward extension: k neighbors
Majority voting: P(ωm|~x) is largest probability (mostprototypes in class m)
If k →∞ then k–NN rule becomes optimal
![Page 98: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/98.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
5NN in 2D
x
x2
x1
FIGURE 4.15. The k-nearest-neighbor query starts at the test point x and grows a spher-ical region until it encloses k training samples, and it labels the test point by a majorityvote of these samples. In this k = 5 case, the test point x would be labeled the categoryof the black points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 99: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/99.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
kNN Error Rate Bounds
0 0.1 0.2 0.3 0.4
0.1
0.2
0.3
0.4
Bayes Rate
P*
P
135
915
0.5
FIGURE 4.16. The error rate for the k-nearest-neighbor rule for a two-category problemis bounded by Ck(P∗) in Eq. 54. Each curve is labeled by k; when k = ∞, the estimatedprobabilities match the true probabilities and thus the error rate is equal to the Bayesrate, that is, P = P∗. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 100: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/100.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Metrics
What is a distance?
Properties of Metrics
Nonnegativity: D(~a,~b) ≥ 0
Reflexivity: D(~a,~b) = 0 iff ~a = ~b
Symmetry: D(~a,~b) = D(~b,~a)
Triangle inequality: D(~a,~b) + D(~b,~c) ≥ D(~a,~c)
Scaling of feature values equivalent to changing the metric
![Page 101: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/101.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Metrics
What is a distance?
Properties of Metrics
Nonnegativity: D(~a,~b) ≥ 0
Reflexivity: D(~a,~b) = 0 iff ~a = ~b
Symmetry: D(~a,~b) = D(~b,~a)
Triangle inequality: D(~a,~b) + D(~b,~c) ≥ D(~a,~c)
Scaling of feature values equivalent to changing the metric
![Page 102: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/102.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Metrics
What is a distance?
Properties of Metrics
Nonnegativity: D(~a,~b) ≥ 0
Reflexivity: D(~a,~b) = 0 iff ~a = ~b
Symmetry: D(~a,~b) = D(~b,~a)
Triangle inequality: D(~a,~b) + D(~b,~c) ≥ D(~a,~c)
Scaling of feature values equivalent to changing the metric
![Page 103: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/103.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Scaling is Change of Metric
x1
x2 x2
x x
αx1
FIGURE 4.18. Scaling the coordinates of a feature space can change the distance rela-tionships computed by the Euclidean metric. Here we see how such scaling can changethe behavior of a nearest-neighbor classifer. Consider the test point x and its nearestneighbor. In the original space (left), the black prototype is closest. In the figure at theright, the x1 axis has been rescaled by a factor 1/3; now the nearest prototype is the redone. If there is a large disparity in the ranges of the full data in each dimension, a com-mon procedure is to rescale all the data to equalize such ranges, and this is equivalentto changing the metric in the original space. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 104: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/104.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Class of Metrics
Minkowski Metric (Lk Norm) Lk(~a,~b) = (∑d
i=1 |ai − bi |k)1k
L1 Norm: Manhattan distance
L2 Norm: Euclidean distance
L∞ Norm: Maximum of projected distances
![Page 105: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/105.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Class of Metrics
Minkowski Metric (Lk Norm) Lk(~a,~b) = (∑d
i=1 |ai − bi |k)1k
L1 Norm: Manhattan distance
L2 Norm: Euclidean distance
L∞ Norm: Maximum of projected distances
![Page 106: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/106.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Class of Metrics
Minkowski Metric (Lk Norm) Lk(~a,~b) = (∑d
i=1 |ai − bi |k)1k
L1 Norm: Manhattan distance
L2 Norm: Euclidean distance
L∞ Norm: Maximum of projected distances
![Page 107: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/107.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Class of Metrics
Minkowski Metric (Lk Norm) Lk(~a,~b) = (∑d
i=1 |ai − bi |k)1k
L1 Norm: Manhattan distance
L2 Norm: Euclidean distance
L∞ Norm: Maximum of projected distances
![Page 108: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/108.jpg)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Minkowski Metric
1
42
∞
0,0,0
1,0,0
0,1,01,1,1
FIGURE 4.19. Each colored surface consists of points a distance 1.0 from the origin,measured using different values for k in the Minkowski metric (k is printed in red). Thusthe white surfaces correspond to the L1 norm (Manhattan distance), the light gray spherecorresponds to the L2 norm (Euclidean distance), the dark gray ones correspond to theL4 norm, and the pink box corresponds to the L∞ norm. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.
![Page 109: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/109.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Discriminant Functions
Assumption: we know the form of discriminant functions(not probability densities)
Problem: determine parameters of discriminant functions
Method: gradient descent of criterion functions(based on training set)
![Page 110: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/110.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Discriminant Functions
Assumption: we know the form of discriminant functions(not probability densities)
Problem: determine parameters of discriminant functions
Method: gradient descent of criterion functions(based on training set)
![Page 111: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/111.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Discriminant Functions
Assumption: we know the form of discriminant functions(not probability densities)
Problem: determine parameters of discriminant functions
Method: gradient descent of criterion functions(based on training set)
![Page 112: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/112.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linear Classifier
x0=1
x1
. . .w2
w0
w1
wd
g(x)
x2 xd. . .
bias unit
input units
output unit
FIGURE 5.1. A simple linear classifier having d input units, each corresponding to thevalues of the components of an input vector. Each input feature value xi is multipliedby its corresponding weight wi; the effective input at the output unit is the sum all theseproducts,
∑wixi. We show in each unit its effective input-output function. Thus each of
the d input units is linear, emitting exactly the value of its corresponding feature value.The single bias unit unit always emits the constant value 1.0. The single output unitemits a +1 if wtx + w0 > 0 or a −1 otherwise. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.
![Page 113: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/113.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linear Discriminant Functions
Linear discriminant function g(~x) = ~w t~x + w0
(weight vector ~w , bias w0)
Two classes: g(~x) > 0→ ω1, else ω2
or ~w t~x > −w0
Decision surface is hyperplane, ~x1, ~x2 on boundary~w t~x1 + w0 = ~w t~x2 + w0 → ~w t(~x1 − ~x2) = 0(~w is normal vector)
Hyperplane H divides space in two half–spacesR1 is positive side (g(~x) > 0), R2 is negative side (g(~x) < 0)
![Page 114: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/114.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linear Discriminant Functions
Linear discriminant function g(~x) = ~w t~x + w0
(weight vector ~w , bias w0)
Two classes: g(~x) > 0→ ω1, else ω2
or ~w t~x > −w0
Decision surface is hyperplane, ~x1, ~x2 on boundary~w t~x1 + w0 = ~w t~x2 + w0 → ~w t(~x1 − ~x2) = 0(~w is normal vector)
Hyperplane H divides space in two half–spacesR1 is positive side (g(~x) > 0), R2 is negative side (g(~x) < 0)
![Page 115: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/115.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linear Discriminant Functions
Linear discriminant function g(~x) = ~w t~x + w0
(weight vector ~w , bias w0)
Two classes: g(~x) > 0→ ω1, else ω2
or ~w t~x > −w0
Decision surface is hyperplane, ~x1, ~x2 on boundary~w t~x1 + w0 = ~w t~x2 + w0 → ~w t(~x1 − ~x2) = 0(~w is normal vector)
Hyperplane H divides space in two half–spacesR1 is positive side (g(~x) > 0), R2 is negative side (g(~x) < 0)
![Page 116: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/116.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linear Discriminant Functions
Linear discriminant function g(~x) = ~w t~x + w0
(weight vector ~w , bias w0)
Two classes: g(~x) > 0→ ω1, else ω2
or ~w t~x > −w0
Decision surface is hyperplane, ~x1, ~x2 on boundary~w t~x1 + w0 = ~w t~x2 + w0 → ~w t(~x1 − ~x2) = 0(~w is normal vector)
Hyperplane H divides space in two half–spacesR1 is positive side (g(~x) > 0), R2 is negative side (g(~x) < 0)
![Page 117: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/117.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Multiple Classes
Variant: c dichotomizers (ωi , not ωi )
Variant: c(c−1)2 dichotomizers (all class pairs)
Variant: linear machine, discriminant functionsgi (~x), i = 1, . . . , c
Decision boundarygi (~x) = gj(~x)→ (~wi − ~wj)
t~x + (wi0 − wj0) = 0
(~wi − ~wj) ⊥ Hij , r =gi (~x)−gj (~x)||~wi−~wj ||
![Page 118: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/118.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Multiple Classes
Variant: c dichotomizers (ωi , not ωi )
Variant: c(c−1)2 dichotomizers (all class pairs)
Variant: linear machine, discriminant functionsgi (~x), i = 1, . . . , c
Decision boundarygi (~x) = gj(~x)→ (~wi − ~wj)
t~x + (wi0 − wj0) = 0
(~wi − ~wj) ⊥ Hij , r =gi (~x)−gj (~x)||~wi−~wj ||
![Page 119: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/119.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Multiple Classes
Variant: c dichotomizers (ωi , not ωi )
Variant: c(c−1)2 dichotomizers (all class pairs)
Variant: linear machine, discriminant functionsgi (~x), i = 1, . . . , c
Decision boundarygi (~x) = gj(~x)→ (~wi − ~wj)
t~x + (wi0 − wj0) = 0
(~wi − ~wj) ⊥ Hij , r =gi (~x)−gj (~x)||~wi−~wj ||
![Page 120: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/120.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Multiple Classes
Variant: c dichotomizers (ωi , not ωi )
Variant: c(c−1)2 dichotomizers (all class pairs)
Variant: linear machine, discriminant functionsgi (~x), i = 1, . . . , c
Decision boundarygi (~x) = gj(~x)→ (~wi − ~wj)
t~x + (wi0 − wj0) = 0
(~wi − ~wj) ⊥ Hij , r =gi (~x)−gj (~x)||~wi−~wj ||
![Page 121: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/121.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Dichotomizers in a Four–class Problem
ω1
not ω1
ω1
not ω2ω2
not ω3ω3
not ω4
ω4
ambiguous region
ambiguous region
ω1 ω1
ω1
ω2
ω2
ω2
ω3
ω3
ω3
ω4
ω4ω4
ω2
ω4
ω3
ω3ω2
ω1
ω4
H13H12
H14
H23 H24
H34
FIGURE 5.3. Linear decision boundaries for a four-class problem. The top figure showsωi/not ωi dichotomies while the bottom figure shows ωi/ωj dichotomies and the corre-sponding decision boundaries Hij. The pink regions have ambiguous category assign-ments. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 122: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/122.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linear Machines in Multi–class Problems
R1 R2
R3
R4
R5
ω1 R2
R3
R1
ω2 ω1ω3
ω5
ω2ω3
ω4
H15 H25
H24
H14
H35
H13
H34
H23
H12
H23
H13
FIGURE 5.4. Decision boundaries produced by a linear machine for a three-class prob-lem and a five-class problem. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 123: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/123.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Generalized Linear Discriminant Functions
More complex decision boundariese.g., quadratic discriminantg(~x) = w0 +
∑di=1 wixi +
∑di=1
∑dj=1 wijxixj
Generalized LDF g(~x) =∑d
i=1 aiyi (~x) = ~at~yd yi (~x) functions map points from d–dimensional ~x–space tod–dimensional ~y–space
Example g(x) = a1 + a2x + a3x2, ~y =
1xx2
Decision boundary is linear in ~y–spaceTransformed density p(x) is degenerateIf d is large, huge number of parameters(requires large training data set)
![Page 124: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/124.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Generalized Linear Discriminant Functions
More complex decision boundariese.g., quadratic discriminantg(~x) = w0 +
∑di=1 wixi +
∑di=1
∑dj=1 wijxixj
Generalized LDF g(~x) =∑d
i=1 aiyi (~x) = ~at~yd yi (~x) functions map points from d–dimensional ~x–space tod–dimensional ~y–space
Example g(x) = a1 + a2x + a3x2, ~y =
1xx2
Decision boundary is linear in ~y–spaceTransformed density p(x) is degenerateIf d is large, huge number of parameters(requires large training data set)
![Page 125: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/125.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Generalized Linear Discriminant Functions
More complex decision boundariese.g., quadratic discriminantg(~x) = w0 +
∑di=1 wixi +
∑di=1
∑dj=1 wijxixj
Generalized LDF g(~x) =∑d
i=1 aiyi (~x) = ~at~yd yi (~x) functions map points from d–dimensional ~x–space tod–dimensional ~y–space
Example g(x) = a1 + a2x + a3x2, ~y =
1xx2
Decision boundary is linear in ~y–spaceTransformed density p(x) is degenerateIf d is large, huge number of parameters(requires large training data set)
![Page 126: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/126.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Generalized Linear Discriminant Functions
More complex decision boundariese.g., quadratic discriminantg(~x) = w0 +
∑di=1 wixi +
∑di=1
∑dj=1 wijxixj
Generalized LDF g(~x) =∑d
i=1 aiyi (~x) = ~at~yd yi (~x) functions map points from d–dimensional ~x–space tod–dimensional ~y–space
Example g(x) = a1 + a2x + a3x2, ~y =
1xx2
Decision boundary is linear in ~y–spaceTransformed density p(x) is degenerateIf d is large, huge number of parameters(requires large training data set)
![Page 127: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/127.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
From 1D to 3D
0
-1
0
1
2
y2
0
2
4
y3
0.51
1.5
2
2.5
y1
1-1 20-2x
R1R1 R2
y = ( )1x
x2
R2
R1
ˆ
ˆ
FIGURE 5.5. The mapping y = (1, x, x2)t takes a line and transforms it to a parabolain three dimensions. A plane splits the resulting y-space into regions corresponding totwo categories, and this in turn gives a nonsimply connected decision region in theone-dimensional x-space. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 128: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/128.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
From 2D to 3D
y2w
R2R1
R1
R2
R1
x1
x2
x1
x2y1
y3
Hy =
( )x 1
x 2
αx 1x 2
ˆ
ˆˆ
FIGURE 5.6. The two-dimensional input space x is mapped through a polynomial func-tion f to y. Here the mapping is y1 = x1, y2 = x2 and y3 ∝ x1x2. A linear discriminantin this transformed space is a hyperplane, which cuts the surface. Points to the positiveside of the hyperplane H correspond to category ω1, and those beneath it correspond tocategory ω2. Here, in terms of the x space, R1 is a not simply connected. From: RichardO. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001by John Wiley & Sons, Inc.
![Page 129: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/129.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linearly Separable Dichotomy
Two classes, samples ~yi , ~at~yi > 0→ ω1, ~at~yi < 0→ ω2
”Normalization” of ω2: ~yi = −~yi → ~at~yi > 0 ∀~yi
Solution region defines all possible values of ~aintersection of n half–spaces (~at~yi = 0)
Margin b > 0, ~at~yi ≥ b, new solution region has distance b||~yi ||
from old boundaries
![Page 130: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/130.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linearly Separable Dichotomy
Two classes, samples ~yi , ~at~yi > 0→ ω1, ~at~yi < 0→ ω2
”Normalization” of ω2: ~yi = −~yi → ~at~yi > 0 ∀~yi
Solution region defines all possible values of ~aintersection of n half–spaces (~at~yi = 0)
Margin b > 0, ~at~yi ≥ b, new solution region has distance b||~yi ||
from old boundaries
![Page 131: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/131.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linearly Separable Dichotomy
Two classes, samples ~yi , ~at~yi > 0→ ω1, ~at~yi < 0→ ω2
”Normalization” of ω2: ~yi = −~yi → ~at~yi > 0 ∀~yi
Solution region defines all possible values of ~aintersection of n half–spaces (~at~yi = 0)
Margin b > 0, ~at~yi ≥ b, new solution region has distance b||~yi ||
from old boundaries
![Page 132: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/132.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linearly Separable Dichotomy
Two classes, samples ~yi , ~at~yi > 0→ ω1, ~at~yi < 0→ ω2
”Normalization” of ω2: ~yi = −~yi → ~at~yi > 0 ∀~yi
Solution region defines all possible values of ~aintersection of n half–spaces (~at~yi = 0)
Margin b > 0, ~at~yi ≥ b, new solution region has distance b||~yi ||
from old boundaries
![Page 133: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/133.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Solution Region and Normalization
y1
y2
separating plane
solution region
y1
y2
"separating" plane
solution region
aa
FIGURE 5.8. Four training samples (black for ω1, red for ω2) and the solution region infeature space. The figure on the left shows the raw data; the solution vectors leads to aplane that separates the patterns from the two categories. In the figure on the right, thered points have been “normalized”—that is, changed in sign. Now the solution vectorleads to a plane that places all “normalized” points on the same side. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
![Page 134: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/134.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Solution Region with Margins
solutionregion
y1
y2
y3
a1
a2
solutionregion
a2
a1
y1
y2
y3
b/||y2 ||
b/||y 1||
b/||y
3||
}
}
}
FIGURE 5.9. The effect of the margin on the solution region. At the left is the case ofno margin (b = 0) equivalent to a case such as shown at the left in Fig. 5.8. At the rightis the case b > 0, shrinking the solution region by margins b/‖yi‖. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
![Page 135: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/135.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Gradient Descent Solutions
Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗
Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))
Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1
2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix
Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J
J(~a) ∼ ~a2 → H = const.→ η = const.
Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)
![Page 136: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/136.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Gradient Descent Solutions
Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗
Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))
Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1
2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix
Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J
J(~a) ∼ ~a2 → H = const.→ η = const.
Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)
![Page 137: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/137.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Gradient Descent Solutions
Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗
Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))
Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1
2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix
Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J
J(~a) ∼ ~a2 → H = const.→ η = const.
Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)
![Page 138: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/138.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Gradient Descent Solutions
Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗
Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))
Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1
2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix
Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J
J(~a) ∼ ~a2 → H = const.→ η = const.
Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)
![Page 139: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/139.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Gradient Descent Solutions
Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗
Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))
Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1
2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix
Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J
J(~a) ∼ ~a2 → H = const.→ η = const.
Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)
![Page 140: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/140.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Gradient and Newton Descent
a1
a2
J(a)
FIGURE 5.10. The sequence of weight vectors given by a simple gradient descentmethod (red) and by Newton’s (second order) algorithm (black). Newton’s method typi-cally leads to greater improvement per step, even when using optimal learning rates forboth methods. However the added computational burden of inverting the Hessian ma-trix used in Newton’s method is not always justified, and simple gradient descent maysuffice. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 141: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/141.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Perceptron Criterion Function
”Normalized” inequalities ~at~yi > 0Perceptron criterion Jp(~a) =
∑~y∈Y −~at~y
(Y is set of misclassified patterns)
Gradient ~∇Jp =∑
~y∈Y −~y
Update rule ~a(k + 1) = ~a(k) + η(k)∑
~y∈Yk ~y
Batch vs. single–sample correction
![Page 142: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/142.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Perceptron Criterion Function
”Normalized” inequalities ~at~yi > 0Perceptron criterion Jp(~a) =
∑~y∈Y −~at~y
(Y is set of misclassified patterns)
Gradient ~∇Jp =∑
~y∈Y −~y
Update rule ~a(k + 1) = ~a(k) + η(k)∑
~y∈Yk ~y
Batch vs. single–sample correction
![Page 143: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/143.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Perceptron Criterion Function
”Normalized” inequalities ~at~yi > 0Perceptron criterion Jp(~a) =
∑~y∈Y −~at~y
(Y is set of misclassified patterns)
Gradient ~∇Jp =∑
~y∈Y −~y
Update rule ~a(k + 1) = ~a(k) + η(k)∑
~y∈Yk ~y
Batch vs. single–sample correction
![Page 144: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/144.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Perceptron Criterion Function
”Normalized” inequalities ~at~yi > 0Perceptron criterion Jp(~a) =
∑~y∈Y −~at~y
(Y is set of misclassified patterns)
Gradient ~∇Jp =∑
~y∈Y −~y
Update rule ~a(k + 1) = ~a(k) + η(k)∑
~y∈Yk ~y
Batch vs. single–sample correction
![Page 145: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/145.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Minimum Squared–Error Procedures
Set of equalities ~at~yi = bibi > 0 are arbitrary constants
Solve Y~a = ~bY is n × (d + 1) matrix containing all training vectors
If Y nonsingular ~a = Y−1~b, however Y mostly rectangular!
Minimizing ~e = Y~a− ~b leads toY tY~a = Y t~b → ~a = (Y tY )−1Y t~b = Y †~bY † is pseudoinverse (d + 1)× n matrix
![Page 146: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/146.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Minimum Squared–Error Procedures
Set of equalities ~at~yi = bibi > 0 are arbitrary constants
Solve Y~a = ~bY is n × (d + 1) matrix containing all training vectors
If Y nonsingular ~a = Y−1~b, however Y mostly rectangular!
Minimizing ~e = Y~a− ~b leads toY tY~a = Y t~b → ~a = (Y tY )−1Y t~b = Y †~bY † is pseudoinverse (d + 1)× n matrix
![Page 147: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/147.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Minimum Squared–Error Procedures
Set of equalities ~at~yi = bibi > 0 are arbitrary constants
Solve Y~a = ~bY is n × (d + 1) matrix containing all training vectors
If Y nonsingular ~a = Y−1~b, however Y mostly rectangular!
Minimizing ~e = Y~a− ~b leads toY tY~a = Y t~b → ~a = (Y tY )−1Y t~b = Y †~bY † is pseudoinverse (d + 1)× n matrix
![Page 148: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/148.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Minimum Squared–Error Procedures
Set of equalities ~at~yi = bibi > 0 are arbitrary constants
Solve Y~a = ~bY is n × (d + 1) matrix containing all training vectors
If Y nonsingular ~a = Y−1~b, however Y mostly rectangular!
Minimizing ~e = Y~a− ~b leads toY tY~a = Y t~b → ~a = (Y tY )−1Y t~b = Y †~bY † is pseudoinverse (d + 1)× n matrix
![Page 149: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/149.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Support Vector Machines
Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)
Linear discriminant g(~y) = ~at~y
Distance of ~yk to H is zkg(~yk )||a|| ≥ b
zk = ±1 (normalization), b is margin
Maximize b with constrained ||a|| = 1b → minimize ||a|| with
inequality constraints
Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers
![Page 150: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/150.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Support Vector Machines
Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)
Linear discriminant g(~y) = ~at~y
Distance of ~yk to H is zkg(~yk )||a|| ≥ b
zk = ±1 (normalization), b is margin
Maximize b with constrained ||a|| = 1b → minimize ||a|| with
inequality constraints
Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers
![Page 151: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/151.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Support Vector Machines
Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)
Linear discriminant g(~y) = ~at~y
Distance of ~yk to H is zkg(~yk )||a|| ≥ b
zk = ±1 (normalization), b is margin
Maximize b with constrained ||a|| = 1b → minimize ||a|| with
inequality constraints
Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers
![Page 152: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/152.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Support Vector Machines
Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)
Linear discriminant g(~y) = ~at~y
Distance of ~yk to H is zkg(~yk )||a|| ≥ b
zk = ±1 (normalization), b is margin
Maximize b with constrained ||a|| = 1b → minimize ||a|| with
inequality constraints
Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers
![Page 153: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/153.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Support Vector Machines
Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)
Linear discriminant g(~y) = ~at~y
Distance of ~yk to H is zkg(~yk )||a|| ≥ b
zk = ±1 (normalization), b is margin
Maximize b with constrained ||a|| = 1b → minimize ||a|| with
inequality constraints
Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers
![Page 154: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/154.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Maximal Margin SVM
Maximize margin b using the Kuhn–Tucker functionalL(~a, ~α) = 1
2 ||~a||2 −
∑nk=1 αk [zk~a
t~yk − 1]
Resulting in dual problem (quadratic optimization)L(~α) =
∑nk=1 αk − 1
2
∑nk,j αkαjzkzj~y
tj ~yk
with constraints∑nk=1 zkαk = 0 αk ≥ 0
Then ~a∗ =∑n
i=1 ziα∗i ~yi (non–zero αi indicates support
vector)
Maximal margin b∗ = (∑n
i=1 αi∗)−
12
![Page 155: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/155.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Maximal Margin SVM
Maximize margin b using the Kuhn–Tucker functionalL(~a, ~α) = 1
2 ||~a||2 −
∑nk=1 αk [zk~a
t~yk − 1]
Resulting in dual problem (quadratic optimization)L(~α) =
∑nk=1 αk − 1
2
∑nk,j αkαjzkzj~y
tj ~yk
with constraints∑nk=1 zkαk = 0 αk ≥ 0
Then ~a∗ =∑n
i=1 ziα∗i ~yi (non–zero αi indicates support
vector)
Maximal margin b∗ = (∑n
i=1 αi∗)−
12
![Page 156: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/156.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Maximal Margin SVM
Maximize margin b using the Kuhn–Tucker functionalL(~a, ~α) = 1
2 ||~a||2 −
∑nk=1 αk [zk~a
t~yk − 1]
Resulting in dual problem (quadratic optimization)L(~α) =
∑nk=1 αk − 1
2
∑nk,j αkαjzkzj~y
tj ~yk
with constraints∑nk=1 zkαk = 0 αk ≥ 0
Then ~a∗ =∑n
i=1 ziα∗i ~yi (non–zero αi indicates support
vector)
Maximal margin b∗ = (∑n
i=1 αi∗)−
12
![Page 157: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/157.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Maximal Margin SVM
Maximize margin b using the Kuhn–Tucker functionalL(~a, ~α) = 1
2 ||~a||2 −
∑nk=1 αk [zk~a
t~yk − 1]
Resulting in dual problem (quadratic optimization)L(~α) =
∑nk=1 αk − 1
2
∑nk,j αkαjzkzj~y
tj ~yk
with constraints∑nk=1 zkαk = 0 αk ≥ 0
Then ~a∗ =∑n
i=1 ziα∗i ~yi (non–zero αi indicates support
vector)
Maximal margin b∗ = (∑n
i=1 αi∗)−
12
![Page 158: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/158.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Maximal Margin Hyperplane
y1
y2
R2
optimal hyperplane
max
imum
mar
gin
b
max
imum
mar
gin
b
R1
FIGURE 5.19. Training a support vector machine consists of finding the optimal hyper-plane, that is, the one with the maximum distance from the nearest training patterns.The support vectors are those (nearest) patterns, a distance b from the hyperplane. Thethree support vectors are shown as solid dots. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.
![Page 159: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/159.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Soft Margin SVM
Maximal margin SVM is sensitive to outliers, demands linearseparability for solution
Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)
Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1
2 ||~a||2 + C
2
∑nk=1 ξ
2i −
∑nk=1 αk [zk~a
t~yk − 1 + ξi ]
Again ~a∗ =∑n
i=1 ziα∗i ~yi
Maximal margin b∗ = (∑n
i=1 αi∗ − 1
C |α∗i |2)−
12
Depends on parameter C !
![Page 160: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/160.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Soft Margin SVM
Maximal margin SVM is sensitive to outliers, demands linearseparability for solution
Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)
Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1
2 ||~a||2 + C
2
∑nk=1 ξ
2i −
∑nk=1 αk [zk~a
t~yk − 1 + ξi ]
Again ~a∗ =∑n
i=1 ziα∗i ~yi
Maximal margin b∗ = (∑n
i=1 αi∗ − 1
C |α∗i |2)−
12
Depends on parameter C !
![Page 161: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/161.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Soft Margin SVM
Maximal margin SVM is sensitive to outliers, demands linearseparability for solution
Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)
Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1
2 ||~a||2 + C
2
∑nk=1 ξ
2i −
∑nk=1 αk [zk~a
t~yk − 1 + ξi ]
Again ~a∗ =∑n
i=1 ziα∗i ~yi
Maximal margin b∗ = (∑n
i=1 αi∗ − 1
C |α∗i |2)−
12
Depends on parameter C !
![Page 162: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/162.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Soft Margin SVM
Maximal margin SVM is sensitive to outliers, demands linearseparability for solution
Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)
Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1
2 ||~a||2 + C
2
∑nk=1 ξ
2i −
∑nk=1 αk [zk~a
t~yk − 1 + ξi ]
Again ~a∗ =∑n
i=1 ziα∗i ~yi
Maximal margin b∗ = (∑n
i=1 αi∗ − 1
C |α∗i |2)−
12
Depends on parameter C !
![Page 163: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/163.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Soft Margin SVM
Maximal margin SVM is sensitive to outliers, demands linearseparability for solution
Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)
Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1
2 ||~a||2 + C
2
∑nk=1 ξ
2i −
∑nk=1 αk [zk~a
t~yk − 1 + ξi ]
Again ~a∗ =∑n
i=1 ziα∗i ~yi
Maximal margin b∗ = (∑n
i=1 αi∗ − 1
C |α∗i |2)−
12
Depends on parameter C !
![Page 164: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/164.jpg)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Soft Margin SVM
Maximal margin SVM is sensitive to outliers, demands linearseparability for solution
Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)
Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1
2 ||~a||2 + C
2
∑nk=1 ξ
2i −
∑nk=1 αk [zk~a
t~yk − 1 + ξi ]
Again ~a∗ =∑n
i=1 ziα∗i ~yi
Maximal margin b∗ = (∑n
i=1 αi∗ − 1
C |α∗i |2)−
12
Depends on parameter C !
![Page 165: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/165.jpg)
VU Pattern Recognition II
Neural Networks
Multilayer Neural Networks
Real-world problems: linear discriminant often not sufficient
NNs also implement nonlinear mapping to higher dimension
Learning finds mapping AND linear discriminant
Error–backpropagation is least square fit to Bayes discriminantfunctions
NNs motivated by biology, but can be explained without it
![Page 166: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/166.jpg)
VU Pattern Recognition II
Neural Networks
Multilayer Neural Networks
Real-world problems: linear discriminant often not sufficient
NNs also implement nonlinear mapping to higher dimension
Learning finds mapping AND linear discriminant
Error–backpropagation is least square fit to Bayes discriminantfunctions
NNs motivated by biology, but can be explained without it
![Page 167: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/167.jpg)
VU Pattern Recognition II
Neural Networks
Multilayer Neural Networks
Real-world problems: linear discriminant often not sufficient
NNs also implement nonlinear mapping to higher dimension
Learning finds mapping AND linear discriminant
Error–backpropagation is least square fit to Bayes discriminantfunctions
NNs motivated by biology, but can be explained without it
![Page 168: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/168.jpg)
VU Pattern Recognition II
Neural Networks
Multilayer Neural Networks
Real-world problems: linear discriminant often not sufficient
NNs also implement nonlinear mapping to higher dimension
Learning finds mapping AND linear discriminant
Error–backpropagation is least square fit to Bayes discriminantfunctions
NNs motivated by biology, but can be explained without it
![Page 169: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/169.jpg)
VU Pattern Recognition II
Neural Networks
Multilayer Neural Networks
Real-world problems: linear discriminant often not sufficient
NNs also implement nonlinear mapping to higher dimension
Learning finds mapping AND linear discriminant
Error–backpropagation is least square fit to Bayes discriminantfunctions
NNs motivated by biology, but can be explained without it
![Page 170: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/170.jpg)
VU Pattern Recognition II
Neural Networks
XOR Net
-1 1
-1
1
biashidden j
output k
input i
11
1 1
.5
-1.5
.7-.4-1
x1 x2
x1
x2
z=+1
z=-1
z=-1
0
1-1
0
1
-1
0
1
-1
0
1-1
0
1
-1
0
1
-1
0
1-1
0
1
-1
0
1
-1
R2
R2
R1
y1 y2
z
zk
wkj
wji
x1
x2
x1
x2
x1
x2
y1 y2
FIGURE 6.1. The two-bit parity or exclusive-OR problem can be solved by a three-layer network. At the bottom is the two-dimensional feature x1x2-space, along with thefour patterns to be classified. The three-layer network is shown in the middle. The inputunits are linear and merely distribute their feature values through multiplicative weightsto the hidden units. The hidden and output units here are linear threshold units, eachof which forms the linear sum of its inputs times their associated weight to yield net,and emits a +1 if this net is greater than or equal to 0, and −1 otherwise, as shownby the graphs. Positive or “excitatory” weights are denoted by solid lines, negative or“inhibitory” weights by dashed lines; each weight magnitude is indicated by the line’sthickness, and is labeled. The single output unit sums the weighted signals from thehidden units and bias to form its net, and emits a +1 if its net is greater than or equalto 0 and emits a −1 otherwise. Within each unit we show a graph of its input-outputor activation function—f (net) versus net. This function is linear for the input units, aconstant for the bias, and a step or sign function elsewhere. We say that this networkhas a 2-2-1 fully connected topology, describing the number of units (other than thebias) in successive layers. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 171: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/171.jpg)
VU Pattern Recognition II
Neural Networks
Network Components
Neurons and synaptic connections (weights)
Net activation netj =∑d
i=1 xiwji + wj0 =∑d
i=0 xiwji ≡ ~wjt~x
Neuron output zk = f (netk), activation function
Common activation function class is sigmoid, e.g.,f (x) = 1
1+e−cx
Basic topologies: feed–forward and recurrent
![Page 172: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/172.jpg)
VU Pattern Recognition II
Neural Networks
Network Components
Neurons and synaptic connections (weights)
Net activation netj =∑d
i=1 xiwji + wj0 =∑d
i=0 xiwji ≡ ~wjt~x
Neuron output zk = f (netk), activation function
Common activation function class is sigmoid, e.g.,f (x) = 1
1+e−cx
Basic topologies: feed–forward and recurrent
![Page 173: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/173.jpg)
VU Pattern Recognition II
Neural Networks
Network Components
Neurons and synaptic connections (weights)
Net activation netj =∑d
i=1 xiwji + wj0 =∑d
i=0 xiwji ≡ ~wjt~x
Neuron output zk = f (netk), activation function
Common activation function class is sigmoid, e.g.,f (x) = 1
1+e−cx
Basic topologies: feed–forward and recurrent
![Page 174: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/174.jpg)
VU Pattern Recognition II
Neural Networks
Network Components
Neurons and synaptic connections (weights)
Net activation netj =∑d
i=1 xiwji + wj0 =∑d
i=0 xiwji ≡ ~wjt~x
Neuron output zk = f (netk), activation function
Common activation function class is sigmoid, e.g.,f (x) = 1
1+e−cx
Basic topologies: feed–forward and recurrent
![Page 175: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/175.jpg)
VU Pattern Recognition II
Neural Networks
Network Components
Neurons and synaptic connections (weights)
Net activation netj =∑d
i=1 xiwji + wj0 =∑d
i=0 xiwji ≡ ~wjt~x
Neuron output zk = f (netk), activation function
Common activation function class is sigmoid, e.g.,f (x) = 1
1+e−cx
Basic topologies: feed–forward and recurrent
![Page 176: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/176.jpg)
VU Pattern Recognition II
Neural Networks
A 2–4–1 Network
y1
y2
y4
y3
y3 y4y2y1
x1 x2
z1
z1
x1
x2
FIGURE 6.2. A 2-4-1 network (with bias) along with the response functions at different units; each hiddenoutput unit has sigmoidal activation function f (·). In the case shown, the hidden unit outputs are paired inopposition thereby producing a “bump” at the output unit. Given a sufficiently large number of hidden units,any continuous function from input to output can be approximated arbitrarily well by such a network. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley& Sons, Inc.
![Page 177: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/177.jpg)
VU Pattern Recognition II
Neural Networks
NN Decision Boundaries
two layer
three layer
x1 x2
x1
x2
...
x1 x2
R1
R2
R1
R2
R2
R1
x2
x1
FIGURE 6.3. Whereas a two-layer network classifier can only implement a linear deci-sion boundary, given an adequate number of hidden units, three-, four- and higher-layernetworks can implement arbitrary decision boundaries. The decision regions need notbe convex or simply connected. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 178: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/178.jpg)
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
![Page 179: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/179.jpg)
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
![Page 180: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/180.jpg)
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
![Page 181: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/181.jpg)
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
![Page 182: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/182.jpg)
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
![Page 183: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/183.jpg)
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
![Page 184: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/184.jpg)
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
![Page 185: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/185.jpg)
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
![Page 186: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/186.jpg)
VU Pattern Recognition II
Neural Networks
Error–Backpropagation I
x1
x2
H1
H2
I1
I2 z2
z1
y1
y2
w12
w22
w21
w11v11
v21
v12
v22
Hj =∑n
i=1 vj ,ixi Ik =∑h
j=1 wk,jyjyj = f (Hj), zk = f (Ik)
Error E (p) = 12
∑mk=1 (t
(p)k − z
(p)k )2
Output Layer: ∆wk,j = −η ∂E∂wk,j
∂E∂wk,j
= ∂E∂Ik
∂Ik∂wk,j
= ∂E∂Ik
yj∂E∂Ik
= ∂E∂zk
∂zk∂Ik
= −(tk − zk)f ′(Ik)∂E∂wk,j
= −(tk − zk)f ′(Ik)yj mit δk = (tk − zk)f ′(Ik)
∆wk,j = ηδkyj
![Page 187: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/187.jpg)
VU Pattern Recognition II
Neural Networks
Error–Backpropagation I
x1
x2
H1
H2
I1
I2 z2
z1
y1
y2
w12
w22
w21
w11v11
v21
v12
v22
Hj =∑n
i=1 vj ,ixi Ik =∑h
j=1 wk,jyjyj = f (Hj), zk = f (Ik)
Error E (p) = 12
∑mk=1 (t
(p)k − z
(p)k )2
Output Layer: ∆wk,j = −η ∂E∂wk,j
∂E∂wk,j
= ∂E∂Ik
∂Ik∂wk,j
= ∂E∂Ik
yj∂E∂Ik
= ∂E∂zk
∂zk∂Ik
= −(tk − zk)f ′(Ik)∂E∂wk,j
= −(tk − zk)f ′(Ik)yj mit δk = (tk − zk)f ′(Ik)
∆wk,j = ηδkyj
![Page 188: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/188.jpg)
VU Pattern Recognition II
Neural Networks
Error–Backpropagation I
x1
x2
H1
H2
I1
I2 z2
z1
y1
y2
w12
w22
w21
w11v11
v21
v12
v22
Hj =∑n
i=1 vj ,ixi Ik =∑h
j=1 wk,jyjyj = f (Hj), zk = f (Ik)
Error E (p) = 12
∑mk=1 (t
(p)k − z
(p)k )2
Output Layer: ∆wk,j = −η ∂E∂wk,j
∂E∂wk,j
= ∂E∂Ik
∂Ik∂wk,j
= ∂E∂Ik
yj∂E∂Ik
= ∂E∂zk
∂zk∂Ik
= −(tk − zk)f ′(Ik)∂E∂wk,j
= −(tk − zk)f ′(Ik)yj mit δk = (tk − zk)f ′(Ik)
∆wk,j = ηδkyj
![Page 189: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/189.jpg)
VU Pattern Recognition II
Neural Networks
Error–Backpropagation I
x1
x2
H1
H2
I1
I2 z2
z1
y1
y2
w12
w22
w21
w11v11
v21
v12
v22
Hj =∑n
i=1 vj ,ixi Ik =∑h
j=1 wk,jyjyj = f (Hj), zk = f (Ik)
Error E (p) = 12
∑mk=1 (t
(p)k − z
(p)k )2
Output Layer: ∆wk,j = −η ∂E∂wk,j
∂E∂wk,j
= ∂E∂Ik
∂Ik∂wk,j
= ∂E∂Ik
yj∂E∂Ik
= ∂E∂zk
∂zk∂Ik
= −(tk − zk)f ′(Ik)∂E∂wk,j
= −(tk − zk)f ′(Ik)yj mit δk = (tk − zk)f ′(Ik)
∆wk,j = ηδkyj
![Page 190: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/190.jpg)
VU Pattern Recognition II
Neural Networks
Error–Backpropagation II
Hidden Layer: ∆vj ,i = −η ∂E∂vj,i
∂E∂vj,i
= ∂E∂Hj
∂Hj
∂vj,i= ∂E
∂Hjxi
∂E∂Hj
= ∂E∂yj
∂yj∂Hj
= ∂E∂yj
f ′(Hj)
∂E∂yj
= −12
∑mk=1
∂(tk−f (Ik ))2
∂yj= −
∑mk=1 (tk − zk)f ′(Ik)wk,j
mit δj = f ′(Hj)∑m
k=1 δkwk,j
∆vj ,i = ηδjxi
Local update rules propagating error from output to input
Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)
Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights
On–line Learning: update weights after each pattern
![Page 191: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/191.jpg)
VU Pattern Recognition II
Neural Networks
Error–Backpropagation II
Hidden Layer: ∆vj ,i = −η ∂E∂vj,i
∂E∂vj,i
= ∂E∂Hj
∂Hj
∂vj,i= ∂E
∂Hjxi
∂E∂Hj
= ∂E∂yj
∂yj∂Hj
= ∂E∂yj
f ′(Hj)
∂E∂yj
= −12
∑mk=1
∂(tk−f (Ik ))2
∂yj= −
∑mk=1 (tk − zk)f ′(Ik)wk,j
mit δj = f ′(Hj)∑m
k=1 δkwk,j
∆vj ,i = ηδjxi
Local update rules propagating error from output to input
Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)
Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights
On–line Learning: update weights after each pattern
![Page 192: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/192.jpg)
VU Pattern Recognition II
Neural Networks
Error–Backpropagation II
Hidden Layer: ∆vj ,i = −η ∂E∂vj,i
∂E∂vj,i
= ∂E∂Hj
∂Hj
∂vj,i= ∂E
∂Hjxi
∂E∂Hj
= ∂E∂yj
∂yj∂Hj
= ∂E∂yj
f ′(Hj)
∂E∂yj
= −12
∑mk=1
∂(tk−f (Ik ))2
∂yj= −
∑mk=1 (tk − zk)f ′(Ik)wk,j
mit δj = f ′(Hj)∑m
k=1 δkwk,j
∆vj ,i = ηδjxi
Local update rules propagating error from output to input
Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)
Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights
On–line Learning: update weights after each pattern
![Page 193: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/193.jpg)
VU Pattern Recognition II
Neural Networks
Error–Backpropagation II
Hidden Layer: ∆vj ,i = −η ∂E∂vj,i
∂E∂vj,i
= ∂E∂Hj
∂Hj
∂vj,i= ∂E
∂Hjxi
∂E∂Hj
= ∂E∂yj
∂yj∂Hj
= ∂E∂yj
f ′(Hj)
∂E∂yj
= −12
∑mk=1
∂(tk−f (Ik ))2
∂yj= −
∑mk=1 (tk − zk)f ′(Ik)wk,j
mit δj = f ′(Hj)∑m
k=1 δkwk,j
∆vj ,i = ηδjxi
Local update rules propagating error from output to input
Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)
Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights
On–line Learning: update weights after each pattern
![Page 194: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/194.jpg)
VU Pattern Recognition II
Neural Networks
Error–Backpropagation II
Hidden Layer: ∆vj ,i = −η ∂E∂vj,i
∂E∂vj,i
= ∂E∂Hj
∂Hj
∂vj,i= ∂E
∂Hjxi
∂E∂Hj
= ∂E∂yj
∂yj∂Hj
= ∂E∂yj
f ′(Hj)
∂E∂yj
= −12
∑mk=1
∂(tk−f (Ik ))2
∂yj= −
∑mk=1 (tk − zk)f ′(Ik)wk,j
mit δj = f ′(Hj)∑m
k=1 δkwk,j
∆vj ,i = ηδjxi
Local update rules propagating error from output to input
Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)
Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights
On–line Learning: update weights after each pattern
![Page 195: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/195.jpg)
VU Pattern Recognition II
Neural Networks
Learning Curves
J/n
epochs
trainingtest
validation
1 2 3 4 5 6 7 8 9 10 11
FIGURE 6.6. A learning curve shows the criterion function as a function of the amountof training, typically indicated by the number of epochs or presentations of the full train-ing set. We plot the average error per pattern, that is, 1/n
∑np=1 Jp. The validation error
and the test or generalization error per pattern are virtually always higher than the train-ing error. In some protocols, training is stopped at the first minimum of the validationset. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 196: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/196.jpg)
VU Pattern Recognition II
Neural Networks
XOR Learning Details
-1 1x1
-1
1
x2
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
2
final
decis
ion b
ound
ary
y1
y2 x2x1
y2y1
bias
10 20 30 40 50 60epoch
0.5
1
1.5
2
J
1
1
1515
15
15
1
1
30
30
30
45
45
45
45
60
60
60
60
total error
error onindividual patterns
inpu
tre
pres
enta
tion
hidd
enre
pres
enta
tion
FIGURE 6.10. A 2-2-1 backpropagation network with bias and the four patterns of theXOR problem are shown at the top. The middle figure shows the outputs of the hid-den units for each of the four patterns; these outputs move across the y1y2-space asthe network learns. In this space, early in training (epoch 1) the two categories are notlinearly separable. As the input-to-hidden weights learn, as marked by the number ofepochs, the categories become linearly separable. The dashed line is the linear decisionboundary determined by the hidden-to-output weights at the end of learning; indeedthe patterns of the two classes are separated by this boundary. The bottom graph showsthe learning curves—the error on individual patterns and the total error as a functionof epoch. Note that, as frequently happens, the total training error decreases monoton-ically, even though this is not the case for the error on each individual pattern. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
![Page 197: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/197.jpg)
VU Pattern Recognition II
Neural Networks
Backpropagation Variants I
Standard Backpropagation: ~wt = ~wt−1 − η~∇E
Gradient Reuse: use ~∇E as long as error drops
BP with variable stepsize (learn rate) η
BP with momentum: ∆ ~wt = −η~∇E + α∆ ~wt−1
![Page 198: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/198.jpg)
VU Pattern Recognition II
Neural Networks
Backpropagation Variants I
Standard Backpropagation: ~wt = ~wt−1 − η~∇EGradient Reuse: use ~∇E as long as error drops
BP with variable stepsize (learn rate) η
BP with momentum: ∆ ~wt = −η~∇E + α∆ ~wt−1
![Page 199: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/199.jpg)
VU Pattern Recognition II
Neural Networks
Backpropagation Variants I
Standard Backpropagation: ~wt = ~wt−1 − η~∇EGradient Reuse: use ~∇E as long as error drops
BP with variable stepsize (learn rate) η
BP with momentum: ∆ ~wt = −η~∇E + α∆ ~wt−1
![Page 200: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/200.jpg)
VU Pattern Recognition II
Neural Networks
Backpropagation Variants I
Standard Backpropagation: ~wt = ~wt−1 − η~∇EGradient Reuse: use ~∇E as long as error drops
BP with variable stepsize (learn rate) η
BP with momentum: ∆ ~wt = −η~∇E + α∆ ~wt−1
![Page 201: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/201.jpg)
VU Pattern Recognition II
Nonmetric Methods
Decision Trees
Real problems: nominal data, e.g., car = green, red,
blue
Rule–based or syntactic methods
Decision tree (DT): series of questions (nodes) lead to answerat leaf (category)
DT is interpretable (decisions and categories)
![Page 202: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/202.jpg)
VU Pattern Recognition II
Nonmetric Methods
Decision Trees
Real problems: nominal data, e.g., car = green, red,
blue
Rule–based or syntactic methods
Decision tree (DT): series of questions (nodes) lead to answerat leaf (category)
DT is interpretable (decisions and categories)
![Page 203: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/203.jpg)
VU Pattern Recognition II
Nonmetric Methods
Decision Trees
Real problems: nominal data, e.g., car = green, red,
blue
Rule–based or syntactic methods
Decision tree (DT): series of questions (nodes) lead to answerat leaf (category)
DT is interpretable (decisions and categories)
![Page 204: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/204.jpg)
VU Pattern Recognition II
Nonmetric Methods
Decision Trees
Real problems: nominal data, e.g., car = green, red,
blue
Rule–based or syntactic methods
Decision tree (DT): series of questions (nodes) lead to answerat leaf (category)
DT is interpretable (decisions and categories)
![Page 205: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/205.jpg)
VU Pattern Recognition II
Nonmetric Methods
Monothetic Decision Tree
Color?
Size? Size?Shape?
round
Size?
yellowredgreen
thin mediumsmall smallbig
Grapefruit
big small
Watermelon Banana AppleApple
Lemon
Grape Taste?
sweet sour
Cherry Grape
medium
level 0
level 1
level 2
level 3
root
FIGURE 8.1. Classification in a basic decision tree proceeds from top to bottom. The questions asked ateach node concern a particular property of the pattern, and the downward links correspond to the possiblevalues. Successive nodes are visited until a terminal or leaf node is reached, where the category label is read.Note that the same question, Size?, appears in different places in the tree and that different questions canhave different numbers of branches. Moreover, different leaf nodes, shown in pink, can be labeled by thesame category (e.g., Apple). From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 206: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/206.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?
![Page 207: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/207.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?
![Page 208: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/208.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?
![Page 209: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/209.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?
Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?
![Page 210: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/210.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?
Termination (leaf node)?Pruning (simplification)?Missing data?
![Page 211: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/211.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?Termination (leaf node)?
Pruning (simplification)?Missing data?
![Page 212: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/212.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?
Missing data?
![Page 213: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/213.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?
![Page 214: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/214.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Monothetic Decision Boundaries
R1
R2R
2
R2
R2
R1 R1
R1
x1
x3
x2
x2
x1
R1
R2
R1
FIGURE 8.3. Monothetic decision trees create decision boundaries with portions perpendicular to the featureaxes. The decision regions are marked R1 and R2 in these two-dimensional and three-dimensional two-category examples. With a sufficiently large tree, any decision boundary can be approximated arbitrarilywell in this way. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
![Page 215: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/215.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Entropy Impurity
Each non-binary tree can be transformed to binary tree
Monothetic (single feature node) and polythetic (multiplefeatures node) trees
Any query at a node should gain maximal purity (or minimalimpurity)
Entropy impurity of a node N with class “probabilities” Pi(N) = −
∑j P(ωj)ldP(ωj)
i(N) = 0→ pure node
![Page 216: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/216.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Entropy Impurity
Each non-binary tree can be transformed to binary tree
Monothetic (single feature node) and polythetic (multiplefeatures node) trees
Any query at a node should gain maximal purity (or minimalimpurity)
Entropy impurity of a node N with class “probabilities” Pi(N) = −
∑j P(ωj)ldP(ωj)
i(N) = 0→ pure node
![Page 217: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/217.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Entropy Impurity
Each non-binary tree can be transformed to binary tree
Monothetic (single feature node) and polythetic (multiplefeatures node) trees
Any query at a node should gain maximal purity (or minimalimpurity)
Entropy impurity of a node N with class “probabilities” Pi(N) = −
∑j P(ωj)ldP(ωj)
i(N) = 0→ pure node
![Page 218: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/218.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Entropy Impurity
Each non-binary tree can be transformed to binary tree
Monothetic (single feature node) and polythetic (multiplefeatures node) trees
Any query at a node should gain maximal purity (or minimalimpurity)
Entropy impurity of a node N with class “probabilities” Pi(N) = −
∑j P(ωj)ldP(ωj)
i(N) = 0→ pure node
![Page 219: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/219.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Entropy Impurity
Each non-binary tree can be transformed to binary tree
Monothetic (single feature node) and polythetic (multiplefeatures node) trees
Any query at a node should gain maximal purity (or minimalimpurity)
Entropy impurity of a node N with class “probabilities” Pi(N) = −
∑j P(ωj)ldP(ωj)
i(N) = 0→ pure node
![Page 220: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/220.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Other Impurity Measures
Gini impurity (generalization of variance impurity)i(N) =
∑i 6=j P(ωi )P(ωj) = 1
2 [1−∑
j P2(ωj)]
Expected error rate at N (if pattern is selected fromdistribution at N)
Misclassification impurity (discontinous derivative may causeproblems)i(N) = 1−max
jP(ωj)
Minimal probability of a misclassified pattern at N
![Page 221: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/221.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Other Impurity Measures
Gini impurity (generalization of variance impurity)i(N) =
∑i 6=j P(ωi )P(ωj) = 1
2 [1−∑
j P2(ωj)]
Expected error rate at N (if pattern is selected fromdistribution at N)
Misclassification impurity (discontinous derivative may causeproblems)i(N) = 1−max
jP(ωj)
Minimal probability of a misclassified pattern at N
![Page 222: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/222.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Other Impurity Measures
Gini impurity (generalization of variance impurity)i(N) =
∑i 6=j P(ωi )P(ωj) = 1
2 [1−∑
j P2(ωj)]
Expected error rate at N (if pattern is selected fromdistribution at N)
Misclassification impurity (discontinous derivative may causeproblems)i(N) = 1−max
jP(ωj)
Minimal probability of a misclassified pattern at N
![Page 223: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/223.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Other Impurity Measures
Gini impurity (generalization of variance impurity)i(N) =
∑i 6=j P(ωi )P(ωj) = 1
2 [1−∑
j P2(ωj)]
Expected error rate at N (if pattern is selected fromdistribution at N)
Misclassification impurity (discontinous derivative may causeproblems)i(N) = 1−max
jP(ωj)
Minimal probability of a misclassified pattern at N
![Page 224: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/224.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Greedy Query Search
Select query with largest impurity decrease from Nto NL (left child) and NR (right child)∆i(N) = i(N)− PLi(NL)− (1− PL)i(NR)
Nominal features (exhaustive search), continous features(gradient descent)
Specific choice of impurity measure is uncritical, moreimportant are stop splitting and pruning methods
Multiway splits (B > 2), simple impurity decrease favors largesplits, scaling of impurity decrease, Gain Ratio Impurity∆i ′(N,B) = ∆i(N,B)
−∑
k Pk ldPk
![Page 225: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/225.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Greedy Query Search
Select query with largest impurity decrease from Nto NL (left child) and NR (right child)∆i(N) = i(N)− PLi(NL)− (1− PL)i(NR)
Nominal features (exhaustive search), continous features(gradient descent)
Specific choice of impurity measure is uncritical, moreimportant are stop splitting and pruning methods
Multiway splits (B > 2), simple impurity decrease favors largesplits, scaling of impurity decrease, Gain Ratio Impurity∆i ′(N,B) = ∆i(N,B)
−∑
k Pk ldPk
![Page 226: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/226.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Greedy Query Search
Select query with largest impurity decrease from Nto NL (left child) and NR (right child)∆i(N) = i(N)− PLi(NL)− (1− PL)i(NR)
Nominal features (exhaustive search), continous features(gradient descent)
Specific choice of impurity measure is uncritical, moreimportant are stop splitting and pruning methods
Multiway splits (B > 2), simple impurity decrease favors largesplits, scaling of impurity decrease, Gain Ratio Impurity∆i ′(N,B) = ∆i(N,B)
−∑
k Pk ldPk
![Page 227: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/227.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Greedy Query Search
Select query with largest impurity decrease from Nto NL (left child) and NR (right child)∆i(N) = i(N)− PLi(NL)− (1− PL)i(NR)
Nominal features (exhaustive search), continous features(gradient descent)
Specific choice of impurity measure is uncritical, moreimportant are stop splitting and pruning methods
Multiway splits (B > 2), simple impurity decrease favors largesplits, scaling of impurity decrease, Gain Ratio Impurity∆i ′(N,B) = ∆i(N,B)
−∑
k Pk ldPk
![Page 228: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/228.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Stop Splitting Methods
Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)
Measure split performance with a separate validation set(minimal error on validation set)
Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?
Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns
Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +
∑LN i(LN) (LN = leaf nodes)
Statistical significance of impurity reduction(distribution of ∆i)
![Page 229: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/229.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Stop Splitting Methods
Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)
Measure split performance with a separate validation set(minimal error on validation set)
Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?
Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns
Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +
∑LN i(LN) (LN = leaf nodes)
Statistical significance of impurity reduction(distribution of ∆i)
![Page 230: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/230.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Stop Splitting Methods
Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)
Measure split performance with a separate validation set(minimal error on validation set)
Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?
Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns
Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +
∑LN i(LN) (LN = leaf nodes)
Statistical significance of impurity reduction(distribution of ∆i)
![Page 231: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/231.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Stop Splitting Methods
Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)
Measure split performance with a separate validation set(minimal error on validation set)
Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?
Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns
Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +
∑LN i(LN) (LN = leaf nodes)
Statistical significance of impurity reduction(distribution of ∆i)
![Page 232: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/232.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Stop Splitting Methods
Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)
Measure split performance with a separate validation set(minimal error on validation set)
Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?
Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns
Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +
∑LN i(LN) (LN = leaf nodes)
Statistical significance of impurity reduction(distribution of ∆i)
![Page 233: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/233.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Stop Splitting Methods
Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)
Measure split performance with a separate validation set(minimal error on validation set)
Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?
Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns
Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +
∑LN i(LN) (LN = leaf nodes)
Statistical significance of impurity reduction(distribution of ∆i)
![Page 234: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/234.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Pruning
Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search
Pruning: merge nodes, starts at leaf nodes, but any node ispossible
Uses complete data set, huge cost with large data sets
Rule pruning: construct and simplify rules (conjunctions) foreach leaf
Context pruning: prune specific rules for specific patterns
Improved interpretability
![Page 235: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/235.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Pruning
Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search
Pruning: merge nodes, starts at leaf nodes, but any node ispossible
Uses complete data set, huge cost with large data sets
Rule pruning: construct and simplify rules (conjunctions) foreach leaf
Context pruning: prune specific rules for specific patterns
Improved interpretability
![Page 236: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/236.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Pruning
Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search
Pruning: merge nodes, starts at leaf nodes, but any node ispossible
Uses complete data set, huge cost with large data sets
Rule pruning: construct and simplify rules (conjunctions) foreach leaf
Context pruning: prune specific rules for specific patterns
Improved interpretability
![Page 237: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/237.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Pruning
Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search
Pruning: merge nodes, starts at leaf nodes, but any node ispossible
Uses complete data set, huge cost with large data sets
Rule pruning: construct and simplify rules (conjunctions) foreach leaf
Context pruning: prune specific rules for specific patterns
Improved interpretability
![Page 238: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/238.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Pruning
Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search
Pruning: merge nodes, starts at leaf nodes, but any node ispossible
Uses complete data set, huge cost with large data sets
Rule pruning: construct and simplify rules (conjunctions) foreach leaf
Context pruning: prune specific rules for specific patterns
Improved interpretability
![Page 239: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/239.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Pruning
Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search
Pruning: merge nodes, starts at leaf nodes, but any node ispossible
Uses complete data set, huge cost with large data sets
Rule pruning: construct and simplify rules (conjunctions) foreach leaf
Context pruning: prune specific rules for specific patterns
Improved interpretability
![Page 240: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/240.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Feature Extraction
.2 .4 .6 .8 10
.2
.4
.6
.8
1 - 1.2 x1 + x2 < 0.1
x1 < 0.27
x2 < 0.32
x1 < 0.07
x2 < 0.6
x1 < 0.55
x2 < 0.86
x1 < 0.81
x1
x2
ω2 ω1
ω2
ω1
ω1
ω1
ω1
ω2
ω2
ω2R2
R1
R2
R1
.2 .4 .6 .8 10
.2
.4
.6
.8
1
x1
x2
FIGURE 8.5. If the class of node decisions does not match the form of the training data,a very complicated decision tree will result, as shown at the top. Here decisions areparallel to the axes while in fact the data is better split by boundaries along anotherdirection. If, however, “proper” decision forms are used (here, linear combinations ofthe features), the tree can be quite simple, as shown at the bottom. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
![Page 241: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/241.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Potential Improvements
At each node train a linear classifier, arbitrary linear decisionboundaries
Long training, (again) fast recall
Integrate priors and/or costs by weights
Weighted Gini Impurity with cost λiji(N) =
∑ij λijP(ωi )P(ωj)
![Page 242: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/242.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Potential Improvements
At each node train a linear classifier, arbitrary linear decisionboundaries
Long training, (again) fast recall
Integrate priors and/or costs by weights
Weighted Gini Impurity with cost λiji(N) =
∑ij λijP(ωi )P(ωj)
![Page 243: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/243.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Potential Improvements
At each node train a linear classifier, arbitrary linear decisionboundaries
Long training, (again) fast recall
Integrate priors and/or costs by weights
Weighted Gini Impurity with cost λiji(N) =
∑ij λijP(ωi )P(ωj)
![Page 244: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/244.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Potential Improvements
At each node train a linear classifier, arbitrary linear decisionboundaries
Long training, (again) fast recall
Integrate priors and/or costs by weights
Weighted Gini Impurity with cost λiji(N) =
∑ij λijP(ωi )P(ωj)
![Page 245: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/245.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Multivariate Decision Trees
0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1
0.04 x1 + 0.16 x2 < 0.11
0.27 x1 - 0.44 x2 < -0.02
0.96 x1 - 1.77x2 < -0.45
5.43 x1 - 13.33 x2 < -6.03
x2 < 0.5
x2 < 0.56x1 < 0.95
x2 < 0.54
x1
ω1ω2
R2
R1
0
ω2
ω1
ω1
ω1
ω1
ω2
ω2
ω2
R1
R2
R2
R1
x2
0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1
x10
x2
FIGURE 8.6. One form of multivariate tree employs general linear decisions at eachnode, giving splits along arbitrary directions in the feature space. In virtually all inter-esting cases the training data are not linearly separable, and thus the LMS algorithm ismore useful than methods that require the data to be linearly separable, even though theLMS need not yield a minimum in classification error (Chapter 5). The tree at the bottomcan be simplified by methods outlined in Section 8.4.2. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.
![Page 246: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/246.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Missing Attributes
Naive approach: use only non-deficient patterns
Better: use only non–deficient attributes
Works with training, but how to classify a deficient pattern?
Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)
Virtual values, e.g., mean value of non–deficient feature values
![Page 247: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/247.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Missing Attributes
Naive approach: use only non-deficient patterns
Better: use only non–deficient attributes
Works with training, but how to classify a deficient pattern?
Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)
Virtual values, e.g., mean value of non–deficient feature values
![Page 248: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/248.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Missing Attributes
Naive approach: use only non-deficient patterns
Better: use only non–deficient attributes
Works with training, but how to classify a deficient pattern?
Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)
Virtual values, e.g., mean value of non–deficient feature values
![Page 249: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/249.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Missing Attributes
Naive approach: use only non-deficient patterns
Better: use only non–deficient attributes
Works with training, but how to classify a deficient pattern?
Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)
Virtual values, e.g., mean value of non–deficient feature values
![Page 250: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/250.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Missing Attributes
Naive approach: use only non-deficient patterns
Better: use only non–deficient attributes
Works with training, but how to classify a deficient pattern?
Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)
Virtual values, e.g., mean value of non–deficient feature values
![Page 251: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/251.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
ID3
ID3 stems from third interactive dichotomizer
Nominal features (real are binned)
Branch factor is number of attributes
Train until all nodes pure or no more features
Results in tree depth = number of features
No pruning
![Page 252: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/252.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
ID3
ID3 stems from third interactive dichotomizer
Nominal features (real are binned)
Branch factor is number of attributes
Train until all nodes pure or no more features
Results in tree depth = number of features
No pruning
![Page 253: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/253.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
ID3
ID3 stems from third interactive dichotomizer
Nominal features (real are binned)
Branch factor is number of attributes
Train until all nodes pure or no more features
Results in tree depth = number of features
No pruning
![Page 254: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/254.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
ID3
ID3 stems from third interactive dichotomizer
Nominal features (real are binned)
Branch factor is number of attributes
Train until all nodes pure or no more features
Results in tree depth = number of features
No pruning
![Page 255: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/255.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
ID3
ID3 stems from third interactive dichotomizer
Nominal features (real are binned)
Branch factor is number of attributes
Train until all nodes pure or no more features
Results in tree depth = number of features
No pruning
![Page 256: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/256.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
ID3
ID3 stems from third interactive dichotomizer
Nominal features (real are binned)
Branch factor is number of attributes
Train until all nodes pure or no more features
Results in tree depth = number of features
No pruning
![Page 257: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/257.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
C4.5
Refinement of ID3
B > 2 with nominal features, B = 2 with real features
Pruning based on statistical significance of splits
Missing features: sample all subtrees of missing feature usingtraining data
Additional rule pruning, can prune any node (see Figure 8.6)
![Page 258: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/258.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
C4.5
Refinement of ID3
B > 2 with nominal features, B = 2 with real features
Pruning based on statistical significance of splits
Missing features: sample all subtrees of missing feature usingtraining data
Additional rule pruning, can prune any node (see Figure 8.6)
![Page 259: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/259.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
C4.5
Refinement of ID3
B > 2 with nominal features, B = 2 with real features
Pruning based on statistical significance of splits
Missing features: sample all subtrees of missing feature usingtraining data
Additional rule pruning, can prune any node (see Figure 8.6)
![Page 260: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/260.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
C4.5
Refinement of ID3
B > 2 with nominal features, B = 2 with real features
Pruning based on statistical significance of splits
Missing features: sample all subtrees of missing feature usingtraining data
Additional rule pruning, can prune any node (see Figure 8.6)
![Page 261: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/261.jpg)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
C4.5
Refinement of ID3
B > 2 with nominal features, B = 2 with real features
Pruning based on statistical significance of splits
Missing features: sample all subtrees of missing feature usingtraining data
Additional rule pruning, can prune any node (see Figure 8.6)
![Page 262: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/262.jpg)
VU Pattern Recognition II
Stochastic Methods
Stochastic Search
Analytical methods problematic in high dimensions or withcomplex models
Large number of local optima makes gradient descent verycostly
Stochastic methods try to localize promising search regions
Pure random search is often not sufficient
Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics
Evolutionary Computation motivated by evolutionaryprinciples from biology
![Page 263: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/263.jpg)
VU Pattern Recognition II
Stochastic Methods
Stochastic Search
Analytical methods problematic in high dimensions or withcomplex models
Large number of local optima makes gradient descent verycostly
Stochastic methods try to localize promising search regions
Pure random search is often not sufficient
Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics
Evolutionary Computation motivated by evolutionaryprinciples from biology
![Page 264: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/264.jpg)
VU Pattern Recognition II
Stochastic Methods
Stochastic Search
Analytical methods problematic in high dimensions or withcomplex models
Large number of local optima makes gradient descent verycostly
Stochastic methods try to localize promising search regions
Pure random search is often not sufficient
Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics
Evolutionary Computation motivated by evolutionaryprinciples from biology
![Page 265: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/265.jpg)
VU Pattern Recognition II
Stochastic Methods
Stochastic Search
Analytical methods problematic in high dimensions or withcomplex models
Large number of local optima makes gradient descent verycostly
Stochastic methods try to localize promising search regions
Pure random search is often not sufficient
Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics
Evolutionary Computation motivated by evolutionaryprinciples from biology
![Page 266: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/266.jpg)
VU Pattern Recognition II
Stochastic Methods
Stochastic Search
Analytical methods problematic in high dimensions or withcomplex models
Large number of local optima makes gradient descent verycostly
Stochastic methods try to localize promising search regions
Pure random search is often not sufficient
Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics
Evolutionary Computation motivated by evolutionaryprinciples from biology
![Page 267: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/267.jpg)
VU Pattern Recognition II
Stochastic Methods
Stochastic Search
Analytical methods problematic in high dimensions or withcomplex models
Large number of local optima makes gradient descent verycostly
Stochastic methods try to localize promising search regions
Pure random search is often not sufficient
Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics
Evolutionary Computation motivated by evolutionaryprinciples from biology
![Page 268: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/268.jpg)
VU Pattern Recognition II
Stochastic Methods
Energy Minimization
Example: minimizing (model) energy in a (Hopfield) network
Energy E = −12
∑Ni ,j=1 wijsi sj si = ±1
Minimize energy of spin–glass model
Probability of energy state, Boltzmann factor
P(γ) = e−EγT
Z(T )
![Page 269: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/269.jpg)
VU Pattern Recognition II
Stochastic Methods
Energy Minimization
Example: minimizing (model) energy in a (Hopfield) network
Energy E = −12
∑Ni ,j=1 wijsi sj si = ±1
Minimize energy of spin–glass model
Probability of energy state, Boltzmann factor
P(γ) = e−EγT
Z(T )
![Page 270: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/270.jpg)
VU Pattern Recognition II
Stochastic Methods
Energy Minimization
Example: minimizing (model) energy in a (Hopfield) network
Energy E = −12
∑Ni ,j=1 wijsi sj si = ±1
Minimize energy of spin–glass model
Probability of energy state, Boltzmann factor
P(γ) = e−EγT
Z(T )
![Page 271: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/271.jpg)
VU Pattern Recognition II
Stochastic Methods
Energy Minimization
Example: minimizing (model) energy in a (Hopfield) network
Energy E = −12
∑Ni ,j=1 wijsi sj si = ±1
Minimize energy of spin–glass model
Probability of energy state, Boltzmann factor
P(γ) = e−EγT
Z(T )
![Page 272: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/272.jpg)
VU Pattern Recognition II
Stochastic Methods
Recurrent Nethi
dden
wij
i
j
αβ
visible
visible
s1 s2 s3 s4 s5 s6
s14 s15 s16 s17
s8
s10
s9
s11
s12s13
s7
FIGURE 7.1. The class of optimization problems of Eq. 1 can be viewed in terms of anetwork of nodes or units, each of which can be in the si = +1 or si = −1 state. Everypair of nodes i and j is connected by bi-directional weights wij; if a weight between twonodes is zero, then no connection is drawn. (Because the networks we shall discuss canhave an arbitrary interconnection, there is no notion of layers as in multilayer neuralnetworks.) The optimization problem is to find a configuration (i.e., assignment of allsi) that minimizes the energy described by Eq. 1. While our convention was to showfunctions inside each node’s circle, our convention in so-called Boltzmann networks isto indicate the state of each node. The configuration of the full network is indexed byan integer γ , and because here there are 17 binary nodes, γ is bounded 0 ≤ γ < 217.When such a network is used for pattern recognition, the input and output nodes aresaid to be visible, and the remaining nodes are said to be hidden. The states of thevisible nodes and hidden nodes are indexed by α and β, respectively, and in this caseare bounded 0 ≤ α ≤ 210 and 0 ≤ β < 27. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 273: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/273.jpg)
VU Pattern Recognition II
Stochastic Methods
Energy Landscape
x1 x1
x2 x2
E E
FIGURE 7.2. The energy function or energy “landscape” on the left is meant to suggest the types of opti-mization problems addressed by simulated annealing. The method uses randomness, governed by a controlparameter or “temperature” T to avoid getting stuck in local energy minima and thus to find the global mini-mum, like a small ball rolling in the landscape as it is shaken. The pathological “golf course” landscape at theright is, generally speaking, not amenable to solution via simulated annealing because the region of lowestenergy is so small and is surrounded by energetically unfavorable configurations. The configuration spaces ofthe problems we shall address are discrete and are more accurately displayed in Fig. 7.6. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 274: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/274.jpg)
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Simulated Annealing Basics
Stochastic search for state of lower energy
Basic idea: occasionally go to higher energy to possiblyescape local minima
After random change of parameter si∆Eab = Eb − Ea
accept Eb, if Eb < Ea or
accept Ea with P = e−∆Eab
T
Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99
High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)
![Page 275: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/275.jpg)
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Simulated Annealing Basics
Stochastic search for state of lower energy
Basic idea: occasionally go to higher energy to possiblyescape local minima
After random change of parameter si∆Eab = Eb − Ea
accept Eb, if Eb < Ea or
accept Ea with P = e−∆Eab
T
Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99
High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)
![Page 276: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/276.jpg)
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Simulated Annealing Basics
Stochastic search for state of lower energy
Basic idea: occasionally go to higher energy to possiblyescape local minima
After random change of parameter si∆Eab = Eb − Ea
accept Eb, if Eb < Ea or
accept Ea with P = e−∆Eab
T
Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99
High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)
![Page 277: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/277.jpg)
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Simulated Annealing Basics
Stochastic search for state of lower energy
Basic idea: occasionally go to higher energy to possiblyescape local minima
After random change of parameter si∆Eab = Eb − Ea
accept Eb, if Eb < Ea or
accept Ea with P = e−∆Eab
T
Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99
High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)
![Page 278: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/278.jpg)
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Simulated Annealing Basics
Stochastic search for state of lower energy
Basic idea: occasionally go to higher energy to possiblyescape local minima
After random change of parameter si∆Eab = Eb − Ea
accept Eb, if Eb < Ea or
accept Ea with P = e−∆Eab
T
Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99
High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)
![Page 279: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/279.jpg)
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Simulated Annealing Experiment
−−−−−−
+−−−−−
++−−−−
−+−−−−
−++−−−
+++−−−
+−+−−−
−−+−−−
−−++−−
+−++−−
++++−−
−+++−−
−+−+−−
++−+−−
+−−+−−
−−−+−−
−−−++−
+−−++−
++−++−
−+−++−
−++++−
+++++−
+−+++−
−−+++−
−−+−+−
+−+−+−
+++−+−
−++−+−
−+−−+−
++−−+−
+−−−+−
−−−−+−
−−−−++
+−−−++
++−−++
−+−−++
−++−++
+++−++
+−+−++
−−+−++
−−++++
+−++++
++++++
−+++++
−+−+++
++−+++
+−−+++
−−−+++
−−−+−+
+−−+−+
++−+−+
−+−+−+
−+++−+
++++−+
+−++−+
−−++−+
−−+−−+
+−+−−+
+++−−+
−++−−+
−+−−−+
++−−−+
+−−−−+
−−−−−+
T(k
)E
(k)
k k
E
s1
s2
s3
s4
s5
s6
begin
end
γ
FIGURE 7.3. Stochastic simulated annealing (Algorithm 1) uses randomness, governed by a control parameteror “temperature” T (k) to search through a discrete space for a minimum of an energy function. In this examplethere are N = 6 variables; the 26 = 64 configurations are shown along the bottom as a column of + and− symbols. The plot of the associated energy of each configuration given by Eq. 1 for randomly chosenweights. Every transition corresponds to the change of just a single si. (The configurations have been arrangedso that adjacent ones differ by the state of just a single node; nevertheless, most transitions correspondingto a single node appear far apart in this ordering.) Because the system energy is invariant with respect to aglobal interchange si ↔ −si, there are two “global” minima. The graph at the upper left shows the annealingschedule—the decreasing temperature versus iteration number k. The middle portion shows the configurationversus iteration number generated by Algorithm 1. The trajectory through the configuration space is coloredred for transitions that increase the energy and black for those that decrease the energy. Such energeticallyunfavorable (red) transitions become rarer later in the anneal. The graph at the right shows the full energyE(k), which decreases to the global minimum. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 280: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/280.jpg)
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Empirical Energy States
−−−−−−
+−−−−−
++−−−−
−+−−−−
−++−−−
+++−−−
+−+−−−
−−+−−−
−−++−−
+−++−−
++++−−
−+++−−
−+−+−−
++−+−−
+−−+−−
−−−+−−
−−−++−
+−−++−
++−++−
−+−++−
−++++−
+++++−
+−+++−
−−+++−
−−+−+−
+−+−+−
+++−+−
−++−+−
−+−−+−
++−−+−
+−−−+−
−−−−+−
−−−−++
+−−−++
++−−++
−+−−++
−++−++
+++−++
+−+−++
−−+−++
−−++++
+−++++
++++++
−+++++
−+−+++
++−+++
+−−+++
−−−+++
−−−+−+
+−−+−+
++−+−+
−+−+−+
−+++−+
++++−+
+−++−+
−−++−+
−−+−−+
+−+−−+
+++−−+
−++−−+
−+−−−+
++−−−+
+−−−−+
−−−−−+
T(k
)
k E
P(γ)
ε[E
]
k
s1
s2
s3
s4
s5
s6
γ
FIGURE 7.4. An estimate of the probability P(γ ) of being in a configuration denoted by γ is shown forfour temperatures during a slow anneal. (These estimates, based on a large number of runs, are nearly thetheoretical values e−Eγ /T.) Early, at high T, each configuration is roughly equal in probability while late, atlow T, the probability is strongly concentrated at the global minima. The expected value of the energy, E[E](i.e., averaged at temperature T), decreases gradually during the anneal. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
![Page 281: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/281.jpg)
VU Pattern Recognition II
Projects
Project Teams
2 students form a team
Implementation of a pattern classification method
Use existing (free) software (e.g., WEKA)
Data import, pre–processing, results (graphics, tables)
Methods must be understood, parameters!
Project report (February 10, 2014)
![Page 282: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/282.jpg)
VU Pattern Recognition II
Projects
Project Teams
2 students form a team
Implementation of a pattern classification method
Use existing (free) software (e.g., WEKA)
Data import, pre–processing, results (graphics, tables)
Methods must be understood, parameters!
Project report (February 10, 2014)
![Page 283: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/283.jpg)
VU Pattern Recognition II
Projects
Project Teams
2 students form a team
Implementation of a pattern classification method
Use existing (free) software (e.g., WEKA)
Data import, pre–processing, results (graphics, tables)
Methods must be understood, parameters!
Project report (February 10, 2014)
![Page 284: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/284.jpg)
VU Pattern Recognition II
Projects
Project Teams
2 students form a team
Implementation of a pattern classification method
Use existing (free) software (e.g., WEKA)
Data import, pre–processing, results (graphics, tables)
Methods must be understood, parameters!
Project report (February 10, 2014)
![Page 285: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/285.jpg)
VU Pattern Recognition II
Projects
Project Teams
2 students form a team
Implementation of a pattern classification method
Use existing (free) software (e.g., WEKA)
Data import, pre–processing, results (graphics, tables)
Methods must be understood, parameters!
Project report (February 10, 2014)
![Page 286: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/286.jpg)
VU Pattern Recognition II
Projects
Project Teams
2 students form a team
Implementation of a pattern classification method
Use existing (free) software (e.g., WEKA)
Data import, pre–processing, results (graphics, tables)
Methods must be understood, parameters!
Project report (February 10, 2014)
![Page 287: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/287.jpg)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
![Page 288: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/288.jpg)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
![Page 289: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/289.jpg)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
![Page 290: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/290.jpg)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
![Page 291: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/291.jpg)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
![Page 292: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/292.jpg)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
![Page 293: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/293.jpg)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
![Page 294: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/294.jpg)
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
![Page 295: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/295.jpg)
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archive
http://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
![Page 296: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/296.jpg)
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
![Page 297: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/297.jpg)
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar Signals
Semeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
![Page 298: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/298.jpg)
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit Recognition
Wine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
![Page 299: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/299.jpg)
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
![Page 300: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/300.jpg)
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
![Page 301: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision](https://reader036.vdocuments.net/reader036/viewer/2022070716/5edb8ce1ad6a402d6665d22a/html5/thumbnails/301.jpg)
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix