introduction to machine learning -...

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Introduction to Machine LearningLecture 3

Chaohui Wang

October 28, 2019

Chaohui Wang Introduction to Machine Learning 1 / 73

Main Supervised Learning Approaches

• Discriminative Approaches• Linear Discriminant Functions• Support Vector Machines (SVMs)• Ensemble Methods & Boosting• Randomized Trees, Forests & Ferns• etc.

• Generative Models• Bayesian Networks• Markov Random Fields• etc.

• Deep Models

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Discriminant Functions

• Recap: Bayesian Decision Theory• Starting point: Bayes’ Theorem:

p(Ck|x) =p(x|Ck)p(Ck)

p(x)∝ p(x|Ck)p(Ck)

• Model conditional probability densities p(x|Ck) as well aspriors p(Ck)

• Minimize the probability of misclassification by maximizingp(Ck|x)

• Now let us study: discriminant functions• Directly encode decision boundary• Without explicit modeling of probability densities• Minimize misclassification probability directly

• Key idea: formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if: yk(x) > yj(x),∀j 6= k• Particular case (K = 2): y1(x) > y2(x)⇔ y1(x)− y2(x) > 0→ directly model y(x) = y1(x)− y2(x)

• Key problem: how to learn such discriminant functions

• Examples (connection with Bayesian Decision Theory)

yk(x) = p(Ck|x)yk(x) = p(x|Ck)p(Ck)

yk(x) = log p(x|Ck) + log p(Ck)

→ when K = 2:

y(x) = p(C1|x)− p(C2|x)y(x) = p(x|C1)p(C1)− p(x|C2)p(C2)

y(x) = logp(x|C1)

p(x|C2)+ log

• Examples (connection with Bayesian Decision Theory)

yk(x) = p(Ck|x)yk(x) = p(x|Ck)p(Ck)

yk(x) = log p(x|Ck) + log p(Ck)

→ when K = 2:

y(x) = p(C1|x)− p(C2|x)y(x) = p(x|C1)p(C1)− p(x|C2)p(C2)

y(x) = logp(x|C1)

p(x|C2)+ log

• For a general classification problem• Goal: take an input x and assign it to one of K classes Ck• Setting of supervised learning: training set X = {x1, . . . , xN}

and target values T = {t1, . . . , tN}→ Learn discriminant function(s) to perform theclassification

• How to define the target variable and the target domain?• 2-class problem: binary target values, e.g., tn ∈ {0, 1}• General K-class problem: 1-of-K coding scheme, e.g.,

tn = (0, 0, 1, 0)T

• What’s the most fundamental discriminant?

→ Linear discriminant functions

tn = (0, 0, 1, 0)T

• What’s the most fundamental discriminant?→ Linear discriminant functions

Linear Discriminant Functions

• Linear discriminant functions:

y(x) = wTx + w0• w: weight vector• w0: “bias”• For illustration, here we consider 2-class case→ if y(x) > 0 decide for class C1, else for C2

• Decision boundary: a hyperplane wTx + w0 = 0

• A dataset is linearly separable: if it can be perfectlyclassified by a linear discriminant

• Notation

x = [x1, x2, . . . , xD]T

w = [w1,w2, . . . ,wD]T

D : Number of dimensions

y(x) = wTx + w0

D∑d=1

wdxd + w0

D∑d=0

wdxd, with x0 = 1 constant

• Notation

x = [x1, x2, . . . , xD]T

w = [w1,w2, . . . ,wD]T

D : Number of dimensions

y(x) = wTx + w0

D∑d=1

wdxd + w0

D∑d=0

wdxd, with x0 = 1 constant

Extension to Multiple Classes

• Two simple strategies:

• What difficulties do those strategies have?→ Both strategies result in regions where the pureclassification result (yk > 0) is ambiguous

• One solution:• Take K linear functions of the form: yk(x) = wT

k x + wk0• Define the decision boundaries directly by deciding for

Ck, iff yk > yj,∀j 6= k→ This corresponds to a 1-of-K coding scheme e.g.,tn = (0, 0, 1, 0)T

• K-class discriminant• Combination of K linear functions: yk(x) = wT

k x + wk0• Resulting decision hyperplanes:

(wk − wj)Tx + (wk0 − wj0) = 0

→ It can be shown that the decision regions of such adiscriminant are always singly connected and convex

(wk − wj)Tx + (wk0 − wj0) = 0

Matrix Formulation of the Classification Model

• Consider K classes described by linear discriminantfunctions

yk(x) = wTk x + wk0, k ∈ {1, . . . ,K}

• Group all those functions using vector notation

y(x) = W̃T x̃

where W̃ = [w̃1, . . . , w̃K ] =

w10 · · · wK0w11 · · · wK1

.... . .

...w1D · · · wKD

, x̃ =

• We directly compare the output y to get the target value in1-of-K coding scheme t = [t1, . . . , tK ]T (e.g., [0, 0, 1, 0]T )→ But, how to learn the model?

yk(x) = wTk x + wk0, k ∈ {1, . . . ,K}

y(x) = W̃T x̃

where W̃ = [w̃1, . . . , w̃K ] =

w10 · · · wK0w11 · · · wK1

.... . .

...w1D · · · wKD

, x̃ =

yk(x) = wTk x + wk0, k ∈ {1, . . . ,K}

y(x) = W̃T x̃

where W̃ = [w̃1, . . . , w̃K ] =

w10 · · · wK0w11 · · · wK1

.... . .

...w1D · · · wKD

, x̃ =

yk(x) = wTk x + wk0, k ∈ {1, . . . ,K}

y(x) = W̃T x̃

where W̃ = [w̃1, . . . , w̃K ] =

w10 · · · wK0w11 · · · wK1

.... . .

...w1D · · · wKD

, x̃ =

Learning of the Classification Model

• For the entire dataset, we can write

Y(X̃) = [y(x1), . . . , y(xN)]T = X̃W̃

where W̃ = [w̃1, . . . , w̃K ] and X̃ = [x1, . . . , xN ]T

• On the other hand, we can write the target matrix

T = [t1, . . . , tN ]T

• Learning Principe: choose W̃ such that the differencebetween Y(X̃) = X̃W̃ and T is minimal

Y(X̃) = [y(x1), . . . , y(xN)]T = X̃W̃

T = [t1, . . . , tN ]T

Y(X̃) = [y(x1), . . . , y(xN)]T = X̃W̃

T = [t1, . . . , tN ]T

Learning with Least Squares

• A fundamental learning approach (with Least Squares)• Aim to minimize the sum-of-squares error

sum(sum((X̃W̃− T).∧2)) (in Matlab)

• Lead to an exact and closed-form solution (derivative=0):

W̃ = (X̃TX̃)−1X̃T T̃ = X̃†T̃

where X̃† = (X̃TX̃)−1X̃T : pseudo-inverse

W̃ = (X̃TX̃)−1X̃T T̃ = X̃†T̃

Problems with Least Squares

• Least-squares is very sensitive to outliers!→ Square error over-considers/penalizes them, includingthose “too correct” ones

• Least-squares is good for linearly separable problem?• The following example: 3 classes (red, green, blue)

→ Most green points are misclassified by least-squares!

• Reason for the failure (from the viewpoint of statistics)?• Least-squares corresponds to Maximum Likelihood under

the assumption of a Gaussian conditional distribution• However, it is difficult to linearly transform all the input

points X̃ via the same W̃, i.e., Y(X̃) = X̃W̃, such that thedistribution of the obtained points of class k is a Gaussiandistribution with mean tk

Generalizations of Linear Discriminants

(Recap) Linear Discriminant: y(x) = W̃T x̃

→ Generalization in two main aspects:

• Generalized linear models: y(x) = g(W̃T x̃)• Generalized linear discriminants: y(x) = W̃Tφ(x̃)

Generalized linear models

• Generalized linear models: y(x) = g(W̃T x̃)• x̃ is first transformed linearly by matrix W̃T

• And then transformed by g(·)→ g(·) is called activation function, and can be nonlinear→ If g is monotonous (which is typically the case), theresulting decision boundaries are still linear functions of x

• Consider 2 classes:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

1 + p(x|C2)p(C2)p(x|C1)p(C1)

1 + exp(−a), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ Logistic sigmoid function: g(a) = 11+exp(−a)

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

1 + p(x|C2)p(C2)p(x|C1)p(C1)

p(x|C1)p(C1)

p(x|C2)p(C2)

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

1 + p(x|C2)p(C2)p(x|C1)p(C1)

p(x|C1)p(C1)

p(x|C2)p(C2)

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

1 + p(x|C2)p(C2)p(x|C1)p(C1)

p(x|C1)p(C1)

p(x|C2)p(C2)

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

1 + p(x|C2)p(C2)p(x|C1)p(C1)

p(x|C1)p(C1)

p(x|C2)p(C2)

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

1 + p(x|C2)p(C2)p(x|C1)p(C1)

p(x|C1)p(C1)

p(x|C2)p(C2)

• Consider K > 2 classes:

p(Ck|x) =p(x|Ck)p(Ck)∑

j p(x|Cj)p(Cj)

=exp(ak)∑

j exp(aj), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ The normalized exponential or softmax function:

gk(a) =exp(ak)∑

j exp(aj)

→ Can be regarded as a multiclass generalization of thelogistic sigmoid

j p(x|Cj)p(Cj)

=exp(ak)∑

p(x|C1)p(C1)

p(x|C2)p(C2)

gk(a) =exp(ak)∑

j exp(aj)

j p(x|Cj)p(Cj)

=exp(ak)∑

p(x|C1)p(C1)

p(x|C2)p(C2)

gk(a) =exp(ak)∑

j exp(aj)

j p(x|Cj)p(Cj)

=exp(ak)∑

p(x|C1)p(C1)

p(x|C2)p(C2)

gk(a) =exp(ak)∑

j exp(aj)

j p(x|Cj)p(Cj)

=exp(ak)∑

p(x|C1)p(C1)

p(x|C2)p(C2)

gk(a) =exp(ak)∑

j exp(aj)

j p(x|Cj)p(Cj)

=exp(ak)∑

p(x|C1)p(C1)

p(x|C2)p(C2)

gk(a) =exp(ak)∑

j exp(aj)

Other Motivation for Nonlinearity

• Recall least-squares classification• One problem: data points that are “too correct” have a

strong influence on the decision boundary under asquared-error criterion

• Reason: the output of y(x;w) can grow arbitrarily large forsome xn:

y(x) = wTx + w0

• By choosing a suitable nonlinearity (e.g., a sigmoid), wecan limit those influences:

y(x) = g(wTx + w0)

y(x) = wTx + w0

y(x) = g(wTx + w0)

y(x) = wTx + w0

y(x) = g(wTx + w0)

y(x) = wTx + w0

y(x) = g(wTx + w0)

Discussion: Generalized Linear Models

• Advantages:• The nonlinearity→ more flexibility• Can be used to limit the effect of outliers• Choice of a sigmoid→ a nice probabilistic interpretation

• Disadvantage:• Parameter learning: in general no longer have a

closed-form analytical solution (even with least-squares)→ iterative methods + gradient decent

→ Next: let’s see Generalized Linear Discriminants

Generalized Linear Discriminants

• Generalized Linear Discriminants: y(x) = W̃Tφ(x̃)• Basis functions: φ(·) = [φ0(·), φ1(·), . . . , φM(·)]T with φ0 ≡ 1• yk(x) =

∑Mm=1 wkmφj(x̃) + wk0

→ Allow non-linear decision boundaries→ By choosing the right φ, every continuous function yk(·)can (in principle) be approximated with arbitrary accuracy

• Simple sequential learning approach available forparameter estimation using gradient descent

• Better 2nd order gradient descent approaches available(e.g., Newton-Raphson)

• Limitations/Caveats:• Generally, can no longer be learned in closed form• Flexibility of model is limited by curse of dimensionality• Need to avoid overfitting, since: e.g., linearly separable

case often leads to overfitting, etc.

Generalization and Overfitting

• Goal of the classification: predict class labels of newobservations (generalization)• Train classification model on limited training set• The further we optimize the model/parameters, the more

the training error will decrease• However, at some point the test error will go up again→ Overfitting to the training set

Typical Example: Linearly Separable Data

• Overfitting: a common problem with linearly separable data

• All of them have zero error on the training set→Which of the many possible decision boundaries iscorrect?

• However, in general they perform differently on novel testdata → Different generalization performances

• How to select the classifier with the best generalizationperformance?

• Intuitively, select the classifier which leaves maximal “safetyroom” for future data points→ Refer to Statistical Learning Theory for theoreticalanalysis

• This can be obtained by maximizing the margin betweenpositive and negative data points

Support Vector Machines (SVMs)

• The Support Vector Machines (SVM) takes theaformentioned idea:• Search for the classifier with maximum margin• Formulate the training as a convex optimization problem→ Possible to find the globally optimal solution!

• Notations:• Given N training examples {(xi, ti)}N

i=1→ Target values ti ∈ {−1, 1}

• The hyperplane separating the data :

wTxn + b = 0

• In the case of linearly separable data, SVMs look for thehyperplane satisfying:

arg minw,b

12||w||2, with constraints tn(wTxn + b) ≥ 1, ∀n

• Quadratic programming problem with linear constraints• Formulated using Lagrange multipliers→ Globally optimal solution

wTxn + b = 0

arg minw,b

wTxn + b = 0

arg minw,b

wTxn + b = 0

arg minw,b

wTxn + b = 0

arg minw,b

wTxn + b = 0

arg minw,b

wTxn + b = 0

arg minw,b

wTxn + b = 0

arg minw,b

SVMs for Non-Separable Data

Non-Separable Data: How to address?

• Relax the hard constraints tn(wTxn + b) ≥ 1 for eachtraining data point→ e.g., the hinge loss function: max

(0, 1− tn(wTxn − b)

)• Jointly optimized together with w→ e.g.:

w̃, b̃ = arg minw,b{1

n∑i=1

max(0, 1− tn(wTxn − b)

12||w||2}

(0, 1− tn(wTxn − b)

n∑i=1

12||w||2}

(0, 1− tn(wTxn − b)

n∑i=1

12||w||2}

(0, 1− tn(wTxn − b)

n∑i=1

12||w||2}

(0, 1− tn(wTxn − b)

n∑i=1

12||w||2}

Other Extensions

• Multi-class SVM• Common approach: reduce the multi-class problem into

multiple binary classification problems• More advanced methods exist

• Nonlinear SVMs• Basic idea: map the input data to some higher-dimensional

feature space where the training set is separable

• Implemented with the Kernel Trick• Structured SVM: the label space is structured• . . .

Other Extensions

• Discriminative Approaches• Linear Discriminant Functions• Support Vector Machines• Ensemble Methods & Boosting• (Randomized) Decision Trees & Random Forests• etc.

• Deep Models

Up to now...

• We have seen a variety of different classifiers• K-NN• Bayes classifiers• Linear discriminants• SVMs

• Each of them has their strengths and weaknesses...

→ Can we improve performance by combining them?

→ One approach: via Ensemble Methods

Up to now...

→ Can we improve performance by combining them?→ One approach: via Ensemble Methods

Ensembles of Classifiers

• Intuition• Assume we have L classifiers• They are independent (i.e., their errors are uncorrelated)• Each one has an error probability p < 0.5 on training data→ Then a simple majority vote of all classifiers should havea lower error than each individual classifier...

• Example• L classifiers with error probability p = 0.3• Probability that exactly L′ classifiers make an error:

pL′(1− p)L−L′

• If L = 20, the probability that 11 or more classifiers make anerror is 0.026

pL′(1− p)L−L′

Constructing Ensembles

• How do we get different classifiers?• Simplest case: train same classifier on different data• But... where shall we get this additional data from?→ Recall: training data is very expensive!

• Idea: Subsample the training data• Reuse the same training algorithm several times on

different subsets of the training data

Cross-validation

• Cross-Validation : a model validation technique forassessing how the results of a statistical analysis willgeneralize to an independent dataset• K-fold cross-validation:

I Partition the available data into K roughly equal subsetsI In each run, train a classifier based on K − 1 subsetsI Estimate the generalization error on the remaining validation

→ E.g., 5-fold cross-validation

Cross-validation

Bagging

• Bagging = “Bootstrap aggregating” (Breiman 1996): aspecial Bayesian model averaging approach to improvingthe model by combining multiple models trained on a set ofrandomly generated training sets• In each run of the training algorithm, randomly select N′

samples (with replacement) from the full set of N trainingdata points

• If N′ = N, then on average, 63.2% of the training points willbe represented. The rest are duplicates

Bagging

Others

• Injecting randomness• Many (iterative) learning algorithms need a random

initialization (e.g. k-means, EM)→ Perform multiple runs of the learning algorithm withdifferent random initializations

Others

Stacking

• Idea• Learn L classifiers (based on the training data)• Find a meta-classifier that takes as input the output of the L

first-level classifiers

• An example• Learn L classifiers with leave-one-out cross-validation• Interpret the prediction of the L classifiers as L-dimensional

feature vector• Learn “level-2” classifier based on the examples generated

this way

Stacking

this way

Stacking

this way

Stacking

this way

Stacking

this way

Stacking

this way

Stacking

this way

Stacking

• Why can this be useful?• Simplicity: we may already have several existing classifiers

available→ No need to re-train those, they can just be combinedwith the additional new ones

• Correlation between classifiers: the combined classifier canexploit such correlation

• Feature combination: E.g., we can integrate informationfrom multi-modal data (video, audio, sub-title, etc.) via thefollowing scheme :

I First train each of the L classifiers on its own input dataI Then train the combination classifier on the combined input

Stacking

Model Averaging: Expected Error

• Combine L predictors yl(x) for target output h(x)• The committee prediction is given by

ycom =1L

L∑l=1

• The output can be written as the true value plus some error:

y(x) = h(x) + ε(x)

• Thus, the average sum-of-squares error takes the form:

Ex[εl(x)2] = Ex[{yl(x)− h(x)}2]

ycom =1L

L∑l=1

y(x) = h(x) + ε(x)

Ex[εl(x)2] = Ex[{yl(x)− h(x)}2]

ycom =1L

L∑l=1

y(x) = h(x) + ε(x)

Ex[εl(x)2] = Ex[{yl(x)− h(x)}2]

ycom =1L

L∑l=1

y(x) = h(x) + ε(x)

Ex[εl(x)2] = Ex[{yl(x)− h(x)}2]

• Average error of individual models

EAV =1L

L∑l=1

Ex[εl(x)2]

• Average error of committee

ECOM = Ex[{1L

L∑l=1

yl(x)− h(x)}2]

= Ex[1L2 {

L∑l=1

(yl(x)− h(x))}2]

= Ex[1L2 {

L∑l=1

εl(x)}2]

EAV =1L

L∑l=1

Ex[εl(x)2]

ECOM = Ex[{1L

L∑l=1

yl(x)− h(x)}2]

= Ex[1L2 {

L∑l=1

(yl(x)− h(x))}2]

= Ex[1L2 {

L∑l=1

εl(x)}2]

EAV =1L

L∑l=1

Ex[εl(x)2]

ECOM = Ex[{1L

L∑l=1

yl(x)− h(x)}2]

= Ex[1L2 {

L∑l=1

(yl(x)− h(x))}2]

= Ex[1L2 {

L∑l=1

εl(x)}2]

EAV =1L

L∑l=1

Ex[εl(x)2]

ECOM = Ex[{1L

L∑l=1

yl(x)− h(x)}2]

= Ex[1L2 {

L∑l=1

(yl(x)− h(x))}2]

= Ex[1L2 {

L∑l=1

εl(x)}2]

EAV =1L

L∑l=1

Ex[εl(x)2]

ECOM = Ex[{1L

L∑l=1

yl(x)− h(x)}2]

= Ex[1L2 {

L∑l=1

(yl(x)− h(x))}2]

= Ex[1L2 {

L∑l=1

εl(x)}2]

EAV =1L

L∑l=1

Ex[εl(x)2]

ECOM = Ex[1L2 {

L∑l=1

εl(x)}2]

• Assumptions• Errors have zero mean: Ex[εl(x)] = 0• Errors are uncorrelated: Ex[εl(x)εl′(x)] = 0→ Then:

ECOM =1L

EAV =1L

L∑l=1

Ex[εl(x)2]

ECOM = Ex[1L2 {

L∑l=1

εl(x)}2]

ECOM =1L

EAV =1L

L∑l=1

Ex[εl(x)2]

ECOM = Ex[1L2 {

L∑l=1

εl(x)}2]

ECOM =1L

EAV =1L

L∑l=1

Ex[εl(x)2]

ECOM = Ex[1L2 {

L∑l=1

εl(x)}2]

ECOM =1L

EAV =1L

L∑l=1

Ex[εl(x)2]

ECOM = Ex[1L2 {

L∑l=1

εl(x)}2]

ECOM =1L

• Average error of committee?

ECOM =1L

• This suggests that the average error of a model can bereduced by a factor of L simply by averaging L versions ofthe model!→ This sounds almost too good to be true...

• Can you see where the problem is?

• Unfortunately, this result depends on the assumption thatthe errors are all uncorrelated

• In practice, they will typically be highly correlated• Still, it can be shown that:

ECOM ≤ EAV

ECOM =1L

ECOM ≤ EAV

ECOM =1L

ECOM ≤ EAV

ECOM =1L

ECOM ≤ EAV

ECOM =1L

• Can you see where the problem is?• Unfortunately, this result depends on the assumption that

the errors are all uncorrelated

ECOM ≤ EAV

ECOM =1L

the errors are all uncorrelated• In practice, they will typically be highly correlated

• Still, it can be shown that:

ECOM ≤ EAV

ECOM =1L

the errors are all uncorrelated• In practice, they will typically be highly correlated• Still, it can be shown that:

ECOM ≤ EAV

Boosting

• Main idea• Sequential classifier selection: Train successively

component classifiers on a subset of the training data→ Select the subset that is most informative given thecurrent set of classifiers

Boosting

• Algorithm (3-component classifier) [Schapire 1989]1 Sample N1 < N training examples (without replacement)

from training set D to get set D1→ Train weak classifier C1 on D1

2 Sample N2 < N training examples (without replacement),half of which were misclassified by C1 to get set D2→ Train weak classifier C2 on D2

3 Choose all data in D on which C1 and C2 disagree to getset D3→ Train weak classifier C3 on D3

4 Get the final classifier output by majority voting of C1, C2,and C3

→ Problem: How should we choose the number ofsamples N1?

Boosting

• How should we choose the number of samples N1?• Ideally, the number of samples should be roughly equal in

all 3 component classifiers• Reasonable first guess: N1 ≈ N/3• However, if the problem is very simple→ C1 will explain most of the data→ N2 and N3 will be very small→ Not all of the data will be used effectively

• Similarly, if the problem is extremely hard→ C1 will explain only a small part of the data→ N2 may be unacceptably large

• In practice, we may need to run the boosting procedure afew times and adjust N1 in order to explore the full trainingset

Boosting

AdaBoost: Idea

• AdaBoost: “Adaptive Boosting” ([Freund & Schapire,1996])

• Main idea: Reweight misclassified training examples,instead of resampling as original boosting algorithm• Increase the chance of being selected in a sampled training

set• Or: increase the misclassification cost when training on the

full set

AdaBoost: Idea

full set

AdaBoost: Idea

full set

AdaBoost: Idea

full set

AdaBoost: Formulation

• Formulation: Construct a strong classifier as a thresholdedlinear combination of the weighted weak classifiers:

H(x) = sign(L∑

αlhl(x))

Notations:• H(x): strong classifier (final classifier)• hl(x) (l ∈ {1, . . . ,L}): weak classifier (base classifier)→ Condition: < 50% training error over any distribution→Why? Discover the reason after studying the algorithm inthe next slide

• αl (l ∈ {1, . . . ,L}): weigh for weak classifier l→ Also learned during the training process

H(x) = sign(L∑

αlhl(x))

H(x) = sign(L∑

αlhl(x))

H(x) = sign(L∑

αlhl(x))

H(x) = sign(L∑

αlhl(x))

H(x) = sign(L∑

αlhl(x))

H(x) = sign(L∑

αlhl(x))

H(x) = sign(L∑

αlhl(x))

AdaBoost: Algorithm

• Initialization: Set w(1)n = 1/N (n = 1, . . . ,N)

• For l = 1, . . . ,L, do iterations:1 Train a new weak classifier hl(x) using the current weightis

W(l) by minimizing the weighted error function:

N∑n=1

w(l)n [hl(x) 6= tn]

2 Estimate the weighted error of this classifier on all data:

∑Nn=1 w(l)

n [hl(x) 6= tn]∑Nn=1 w(l)

3 Calculate the weight for classifier hl(x): αl = ln 1−εlεl

4 Update the weighting coefficients of all training samples:w(l+1)

n = w(l)n exp (αl[hl(xn) 6= tn])

AdaBoost: Algorithm

N∑n=1

w(l)n [hl(x) 6= tn]

∑Nn=1 w(l)

AdaBoost: Algorithm

N∑n=1

w(l)n [hl(x) 6= tn]

∑Nn=1 w(l)

AdaBoost: Algorithm

N∑n=1

w(l)n [hl(x) 6= tn]

∑Nn=1 w(l)

AdaBoost: Algorithm

N∑n=1

w(l)n [hl(x) 6= tn]

∑Nn=1 w(l)

AdaBoost: Algorithm

N∑n=1

w(l)n [hl(x) 6= tn]

∑Nn=1 w(l)

Summary: AdaBoost

• Properties• Simple combination of multiple classifiers• Easy to implement• Can be used with many different types of classifiers→ None of them needs to be too good on its own. Theyonly have to be slightly better than chance

• Commonly used in many areas• Empirically good generalization capabilities

• Limitations• Original AdaBoost is sensitive to misclassified training data

points→ Improvement by for example GentleBoost

• Binary-class classifier, although multiclass extensionsavailable

Summary: AdaBoost

Applications: Viola-Jones Face Detector

• Viola-Jones Face Detector [Viola & Jones 2004]

Decision Trees

• A tree• Each node specifies a test for some attribute• Each branch corresponds to a possible (attribute) value

→ Example: is Sunday suitable for playing tennis?

• A tree⇒ a set of if-then rules⇒ logical expressions(Outlook = Sunny ∩ Humidity = Normal)

→ For the question above: ∪(Outlook = Overcast)

∪(Outlook = Rain ∩Wind = Weak)

Decision Trees

Decision Trees: Training

• Common procedure: Greedy top-down growing1 Start at the root node2 Progressively split the training data into smaller and smaller

subsets3 In each step, pick the best attribute to split the data4 If the resulting subsets are pure (only one label) or if no

further attribute can be found that splits them, terminate thetree

5 Else, recursively apply the procedure to the subsets

• Study & formalization of the different design choices→ E.g., CART framework (Classification And RegressionTrees [Breiman et al. 1984]

Decision Trees: Issues

• Finding the optimal decision tree is NP-hard, leading to:• Prone to overfitting (the learned model fits the training data

very well, but is not good for unseen data in the testingstage )

• High complexity

→ Next: Randomized decision trees & Random forests

• High complexity

Randomized Decision Trees

• Basic idea: randomize attribute selection• No longer look for globally optimal split• Instead, randomly choose a subset of K attributes base on

which to perform the split• Choose the best splitting attribute, e.g., by maximizing the

information gain (i.e., reducing entropy)

→ The obtained tree is not as powerful as a single classifier . . .→ Regard it as a weak classifier & build multiple ones (“RandomForest”)→What technique can be used here?

“Ensemble Methods”

→ The obtained tree is not as powerful as a single classifier . . .→ Regard it as a weak classifier & build multiple ones (“RandomForest”)→What technique can be used here? “Ensemble Methods”

Random Forest

• Random forest: composed of a set of L randomizeddecision trees• Internal node: a random test on an input feature vector• Leaf node: stores a histogram H = (h1, . . . , hK) (K: the

number of classes)→ obtained during the training phase by counting thenumber of labeled feature vectors that arrive at this leaf

Random Forest

• Random forest: composed of a set of L randomizeddecision trees• In the testing phase, a feature vector is dropped in each

decision tree l and reaches the leaf τl• The probabilities of all the trees are averaged to obtain the

probability over the forest: p(H|x) = 1L

∑Ll=1 pl(H|x)

→ pl(H|x) is the normalized histogram at the leaf τl of tree l

Random Forest

∑Ll=1 pl(H|x)

Random Forest

∑Ll=1 pl(H|x)

Random Forests: Properties

• Advantages:1 Very simple algorithm2 Straightforward to deal with multiple classes3 Resistant to overfitting - generalizes well to new data4 Fast training

• Limitations:1 Memory consumption: Decision tree construction uses

much more memory2 Well-suited for problems with little training data, whereas,

little performance gain when training data is really large

introduction to machine learning -...

Documents