introduction to machine learning -...

319
Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests Introduction to Machine Learning Lecture 3 Chaohui Wang October 28, 2019 Chaohui Wang Introduction to Machine Learning 1 / 73

Upload: others

Post on 20-May-2020

25 views

Category:

Documents


0 download

TRANSCRIPT

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Introduction to Machine LearningLecture 3

Chaohui Wang

October 28, 2019

Chaohui Wang Introduction to Machine Learning 1 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Main Supervised Learning Approaches

• Discriminative Approaches• Linear Discriminant Functions• Support Vector Machines (SVMs)• Ensemble Methods & Boosting• Randomized Trees, Forests & Ferns• etc.

• Generative Models• Bayesian Networks• Markov Random Fields• etc.

• Deep Models

Chaohui Wang Introduction to Machine Learning 2 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Main Supervised Learning Approaches

• Discriminative Approaches• Linear Discriminant Functions• Support Vector Machines (SVMs)• Ensemble Methods & Boosting• Randomized Trees, Forests & Ferns• etc.

• Generative Models• Bayesian Networks• Markov Random Fields• etc.

• Deep Models

Chaohui Wang Introduction to Machine Learning 2 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Main Supervised Learning Approaches

• Discriminative Approaches• Linear Discriminant Functions• Support Vector Machines (SVMs)• Ensemble Methods & Boosting• Randomized Trees, Forests & Ferns• etc.

• Generative Models• Bayesian Networks• Markov Random Fields• etc.

• Deep Models

Chaohui Wang Introduction to Machine Learning 2 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Main Supervised Learning Approaches

• Discriminative Approaches• Linear Discriminant Functions• Support Vector Machines (SVMs)• Ensemble Methods & Boosting• Randomized Trees, Forests & Ferns• etc.

• Generative Models• Bayesian Networks• Markov Random Fields• etc.

• Deep Models

Chaohui Wang Introduction to Machine Learning 2 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

SVMs

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Chaohui Wang Introduction to Machine Learning 3 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

SVMs

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Chaohui Wang Introduction to Machine Learning 4 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

SVMs

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Chaohui Wang Introduction to Machine Learning 5 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Recap: Bayesian Decision Theory• Starting point: Bayes’ Theorem:

p(Ck|x) =p(x|Ck)p(Ck)

p(x)∝ p(x|Ck)p(Ck)

• Model conditional probability densities p(x|Ck) as well aspriors p(Ck)

• Minimize the probability of misclassification by maximizingp(Ck|x)

Chaohui Wang Introduction to Machine Learning 6 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Recap: Bayesian Decision Theory• Starting point: Bayes’ Theorem:

p(Ck|x) =p(x|Ck)p(Ck)

p(x)∝ p(x|Ck)p(Ck)

• Model conditional probability densities p(x|Ck) as well aspriors p(Ck)

• Minimize the probability of misclassification by maximizingp(Ck|x)

Chaohui Wang Introduction to Machine Learning 6 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Recap: Bayesian Decision Theory• Starting point: Bayes’ Theorem:

p(Ck|x) =p(x|Ck)p(Ck)

p(x)∝ p(x|Ck)p(Ck)

• Model conditional probability densities p(x|Ck) as well aspriors p(Ck)

• Minimize the probability of misclassification by maximizingp(Ck|x)

Chaohui Wang Introduction to Machine Learning 6 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Recap: Bayesian Decision Theory• Starting point: Bayes’ Theorem:

p(Ck|x) =p(x|Ck)p(Ck)

p(x)∝ p(x|Ck)p(Ck)

• Model conditional probability densities p(x|Ck) as well aspriors p(Ck)

• Minimize the probability of misclassification by maximizingp(Ck|x)

Chaohui Wang Introduction to Machine Learning 6 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Now let us study: discriminant functions• Directly encode decision boundary• Without explicit modeling of probability densities• Minimize misclassification probability directly

• Key idea: formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if: yk(x) > yj(x),∀j 6= k• Particular case (K = 2): y1(x) > y2(x)⇔ y1(x)− y2(x) > 0→ directly model y(x) = y1(x)− y2(x)

• Key problem: how to learn such discriminant functions

Chaohui Wang Introduction to Machine Learning 7 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Now let us study: discriminant functions• Directly encode decision boundary• Without explicit modeling of probability densities• Minimize misclassification probability directly

• Key idea: formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if: yk(x) > yj(x),∀j 6= k• Particular case (K = 2): y1(x) > y2(x)⇔ y1(x)− y2(x) > 0→ directly model y(x) = y1(x)− y2(x)

• Key problem: how to learn such discriminant functions

Chaohui Wang Introduction to Machine Learning 7 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Now let us study: discriminant functions• Directly encode decision boundary• Without explicit modeling of probability densities• Minimize misclassification probability directly

• Key idea: formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if: yk(x) > yj(x),∀j 6= k• Particular case (K = 2): y1(x) > y2(x)⇔ y1(x)− y2(x) > 0→ directly model y(x) = y1(x)− y2(x)

• Key problem: how to learn such discriminant functions

Chaohui Wang Introduction to Machine Learning 7 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Now let us study: discriminant functions• Directly encode decision boundary• Without explicit modeling of probability densities• Minimize misclassification probability directly

• Key idea: formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if: yk(x) > yj(x),∀j 6= k• Particular case (K = 2): y1(x) > y2(x)⇔ y1(x)− y2(x) > 0→ directly model y(x) = y1(x)− y2(x)

• Key problem: how to learn such discriminant functions

Chaohui Wang Introduction to Machine Learning 7 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Now let us study: discriminant functions• Directly encode decision boundary• Without explicit modeling of probability densities• Minimize misclassification probability directly

• Key idea: formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if: yk(x) > yj(x),∀j 6= k• Particular case (K = 2): y1(x) > y2(x)⇔ y1(x)− y2(x) > 0→ directly model y(x) = y1(x)− y2(x)

• Key problem: how to learn such discriminant functions

Chaohui Wang Introduction to Machine Learning 7 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Now let us study: discriminant functions• Directly encode decision boundary• Without explicit modeling of probability densities• Minimize misclassification probability directly

• Key idea: formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if: yk(x) > yj(x),∀j 6= k• Particular case (K = 2): y1(x) > y2(x)⇔ y1(x)− y2(x) > 0→ directly model y(x) = y1(x)− y2(x)

• Key problem: how to learn such discriminant functions

Chaohui Wang Introduction to Machine Learning 7 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Now let us study: discriminant functions• Directly encode decision boundary• Without explicit modeling of probability densities• Minimize misclassification probability directly

• Key idea: formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if: yk(x) > yj(x),∀j 6= k• Particular case (K = 2): y1(x) > y2(x)⇔ y1(x)− y2(x) > 0→ directly model y(x) = y1(x)− y2(x)

• Key problem: how to learn such discriminant functions

Chaohui Wang Introduction to Machine Learning 7 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Now let us study: discriminant functions• Directly encode decision boundary• Without explicit modeling of probability densities• Minimize misclassification probability directly

• Key idea: formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if: yk(x) > yj(x),∀j 6= k• Particular case (K = 2): y1(x) > y2(x)⇔ y1(x)− y2(x) > 0→ directly model y(x) = y1(x)− y2(x)

• Key problem: how to learn such discriminant functions

Chaohui Wang Introduction to Machine Learning 7 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Now let us study: discriminant functions• Directly encode decision boundary• Without explicit modeling of probability densities• Minimize misclassification probability directly

• Key idea: formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if: yk(x) > yj(x),∀j 6= k• Particular case (K = 2): y1(x) > y2(x)⇔ y1(x)− y2(x) > 0→ directly model y(x) = y1(x)− y2(x)

• Key problem: how to learn such discriminant functions

Chaohui Wang Introduction to Machine Learning 7 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Now let us study: discriminant functions• Directly encode decision boundary• Without explicit modeling of probability densities• Minimize misclassification probability directly

• Key idea: formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if: yk(x) > yj(x),∀j 6= k• Particular case (K = 2): y1(x) > y2(x)⇔ y1(x)− y2(x) > 0→ directly model y(x) = y1(x)− y2(x)

• Key problem: how to learn such discriminant functions

Chaohui Wang Introduction to Machine Learning 7 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Examples (connection with Bayesian Decision Theory)

yk(x) = p(Ck|x)yk(x) = p(x|Ck)p(Ck)

yk(x) = log p(x|Ck) + log p(Ck)

→ when K = 2:

y(x) = p(C1|x)− p(C2|x)y(x) = p(x|C1)p(C1)− p(x|C2)p(C2)

y(x) = logp(x|C1)

p(x|C2)+ log

p(C1)

p(C2)

Chaohui Wang Introduction to Machine Learning 8 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• Examples (connection with Bayesian Decision Theory)

yk(x) = p(Ck|x)yk(x) = p(x|Ck)p(Ck)

yk(x) = log p(x|Ck) + log p(Ck)

→ when K = 2:

y(x) = p(C1|x)− p(C2|x)y(x) = p(x|C1)p(C1)− p(x|C2)p(C2)

y(x) = logp(x|C1)

p(x|C2)+ log

p(C1)

p(C2)

Chaohui Wang Introduction to Machine Learning 8 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• For a general classification problem• Goal: take an input x and assign it to one of K classes Ck• Setting of supervised learning: training set X = {x1, . . . , xN}

and target values T = {t1, . . . , tN}→ Learn discriminant function(s) to perform theclassification

• How to define the target variable and the target domain?• 2-class problem: binary target values, e.g., tn ∈ {0, 1}• General K-class problem: 1-of-K coding scheme, e.g.,

tn = (0, 0, 1, 0)T

• What’s the most fundamental discriminant?

→ Linear discriminant functions

Chaohui Wang Introduction to Machine Learning 9 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• For a general classification problem• Goal: take an input x and assign it to one of K classes Ck• Setting of supervised learning: training set X = {x1, . . . , xN}

and target values T = {t1, . . . , tN}→ Learn discriminant function(s) to perform theclassification

• How to define the target variable and the target domain?• 2-class problem: binary target values, e.g., tn ∈ {0, 1}• General K-class problem: 1-of-K coding scheme, e.g.,

tn = (0, 0, 1, 0)T

• What’s the most fundamental discriminant?

→ Linear discriminant functions

Chaohui Wang Introduction to Machine Learning 9 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• For a general classification problem• Goal: take an input x and assign it to one of K classes Ck• Setting of supervised learning: training set X = {x1, . . . , xN}

and target values T = {t1, . . . , tN}→ Learn discriminant function(s) to perform theclassification

• How to define the target variable and the target domain?• 2-class problem: binary target values, e.g., tn ∈ {0, 1}• General K-class problem: 1-of-K coding scheme, e.g.,

tn = (0, 0, 1, 0)T

• What’s the most fundamental discriminant?

→ Linear discriminant functions

Chaohui Wang Introduction to Machine Learning 9 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• For a general classification problem• Goal: take an input x and assign it to one of K classes Ck• Setting of supervised learning: training set X = {x1, . . . , xN}

and target values T = {t1, . . . , tN}→ Learn discriminant function(s) to perform theclassification

• How to define the target variable and the target domain?• 2-class problem: binary target values, e.g., tn ∈ {0, 1}• General K-class problem: 1-of-K coding scheme, e.g.,

tn = (0, 0, 1, 0)T

• What’s the most fundamental discriminant?

→ Linear discriminant functions

Chaohui Wang Introduction to Machine Learning 9 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• For a general classification problem• Goal: take an input x and assign it to one of K classes Ck• Setting of supervised learning: training set X = {x1, . . . , xN}

and target values T = {t1, . . . , tN}→ Learn discriminant function(s) to perform theclassification

• How to define the target variable and the target domain?• 2-class problem: binary target values, e.g., tn ∈ {0, 1}• General K-class problem: 1-of-K coding scheme, e.g.,

tn = (0, 0, 1, 0)T

• What’s the most fundamental discriminant?

→ Linear discriminant functions

Chaohui Wang Introduction to Machine Learning 9 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• For a general classification problem• Goal: take an input x and assign it to one of K classes Ck• Setting of supervised learning: training set X = {x1, . . . , xN}

and target values T = {t1, . . . , tN}→ Learn discriminant function(s) to perform theclassification

• How to define the target variable and the target domain?• 2-class problem: binary target values, e.g., tn ∈ {0, 1}• General K-class problem: 1-of-K coding scheme, e.g.,

tn = (0, 0, 1, 0)T

• What’s the most fundamental discriminant?

→ Linear discriminant functions

Chaohui Wang Introduction to Machine Learning 9 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• For a general classification problem• Goal: take an input x and assign it to one of K classes Ck• Setting of supervised learning: training set X = {x1, . . . , xN}

and target values T = {t1, . . . , tN}→ Learn discriminant function(s) to perform theclassification

• How to define the target variable and the target domain?• 2-class problem: binary target values, e.g., tn ∈ {0, 1}• General K-class problem: 1-of-K coding scheme, e.g.,

tn = (0, 0, 1, 0)T

• What’s the most fundamental discriminant?

→ Linear discriminant functions

Chaohui Wang Introduction to Machine Learning 9 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• For a general classification problem• Goal: take an input x and assign it to one of K classes Ck• Setting of supervised learning: training set X = {x1, . . . , xN}

and target values T = {t1, . . . , tN}→ Learn discriminant function(s) to perform theclassification

• How to define the target variable and the target domain?• 2-class problem: binary target values, e.g., tn ∈ {0, 1}• General K-class problem: 1-of-K coding scheme, e.g.,

tn = (0, 0, 1, 0)T

• What’s the most fundamental discriminant?

→ Linear discriminant functions

Chaohui Wang Introduction to Machine Learning 9 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discriminant Functions

• For a general classification problem• Goal: take an input x and assign it to one of K classes Ck• Setting of supervised learning: training set X = {x1, . . . , xN}

and target values T = {t1, . . . , tN}→ Learn discriminant function(s) to perform theclassification

• How to define the target variable and the target domain?• 2-class problem: binary target values, e.g., tn ∈ {0, 1}• General K-class problem: 1-of-K coding scheme, e.g.,

tn = (0, 0, 1, 0)T

• What’s the most fundamental discriminant?→ Linear discriminant functions

Chaohui Wang Introduction to Machine Learning 9 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

SVMs

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Chaohui Wang Introduction to Machine Learning 10 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Linear Discriminant Functions

• Linear discriminant functions:

y(x) = wTx + w0• w: weight vector• w0: “bias”• For illustration, here we consider 2-class case→ if y(x) > 0 decide for class C1, else for C2

• Decision boundary: a hyperplane wTx + w0 = 0

• A dataset is linearly separable: if it can be perfectlyclassified by a linear discriminant

Chaohui Wang Introduction to Machine Learning 11 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Linear Discriminant Functions

• Linear discriminant functions:

y(x) = wTx + w0• w: weight vector• w0: “bias”• For illustration, here we consider 2-class case→ if y(x) > 0 decide for class C1, else for C2

• Decision boundary: a hyperplane wTx + w0 = 0

• A dataset is linearly separable: if it can be perfectlyclassified by a linear discriminant

Chaohui Wang Introduction to Machine Learning 11 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Linear Discriminant Functions

• Linear discriminant functions:

y(x) = wTx + w0• w: weight vector• w0: “bias”• For illustration, here we consider 2-class case→ if y(x) > 0 decide for class C1, else for C2

• Decision boundary: a hyperplane wTx + w0 = 0

• A dataset is linearly separable: if it can be perfectlyclassified by a linear discriminant

Chaohui Wang Introduction to Machine Learning 11 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Linear Discriminant Functions

• Linear discriminant functions:

y(x) = wTx + w0• w: weight vector• w0: “bias”• For illustration, here we consider 2-class case→ if y(x) > 0 decide for class C1, else for C2

• Decision boundary: a hyperplane wTx + w0 = 0

• A dataset is linearly separable: if it can be perfectlyclassified by a linear discriminant

Chaohui Wang Introduction to Machine Learning 11 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Linear Discriminant Functions

• Linear discriminant functions:

y(x) = wTx + w0• w: weight vector• w0: “bias”• For illustration, here we consider 2-class case→ if y(x) > 0 decide for class C1, else for C2

• Decision boundary: a hyperplane wTx + w0 = 0

• A dataset is linearly separable: if it can be perfectlyclassified by a linear discriminant

Chaohui Wang Introduction to Machine Learning 11 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Linear Discriminant Functions

• Linear discriminant functions:

y(x) = wTx + w0• w: weight vector• w0: “bias”• For illustration, here we consider 2-class case→ if y(x) > 0 decide for class C1, else for C2

• Decision boundary: a hyperplane wTx + w0 = 0

• A dataset is linearly separable: if it can be perfectlyclassified by a linear discriminant

Chaohui Wang Introduction to Machine Learning 11 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Linear Discriminant Functions

• Notation

x = [x1, x2, . . . , xD]T

w = [w1,w2, . . . ,wD]T

D : Number of dimensions

y(x) = wTx + w0

=

D∑d=1

wdxd + w0

=

D∑d=0

wdxd, with x0 = 1 constant

Chaohui Wang Introduction to Machine Learning 12 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Linear Discriminant Functions

• Notation

x = [x1, x2, . . . , xD]T

w = [w1,w2, . . . ,wD]T

D : Number of dimensions

y(x) = wTx + w0

=

D∑d=1

wdxd + w0

=

D∑d=0

wdxd, with x0 = 1 constant

Chaohui Wang Introduction to Machine Learning 12 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Extension to Multiple Classes

• Two simple strategies:

• What difficulties do those strategies have?→ Both strategies result in regions where the pureclassification result (yk > 0) is ambiguous

Chaohui Wang Introduction to Machine Learning 13 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Extension to Multiple Classes

• Two simple strategies:

• What difficulties do those strategies have?→ Both strategies result in regions where the pureclassification result (yk > 0) is ambiguous

Chaohui Wang Introduction to Machine Learning 13 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Extension to Multiple Classes

• Two simple strategies:

• What difficulties do those strategies have?→ Both strategies result in regions where the pureclassification result (yk > 0) is ambiguous

Chaohui Wang Introduction to Machine Learning 13 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Extension to Multiple Classes

• One solution:• Take K linear functions of the form: yk(x) = wT

k x + wk0• Define the decision boundaries directly by deciding for

Ck, iff yk > yj,∀j 6= k→ This corresponds to a 1-of-K coding scheme e.g.,tn = (0, 0, 1, 0)T

Chaohui Wang Introduction to Machine Learning 14 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Extension to Multiple Classes

• One solution:• Take K linear functions of the form: yk(x) = wT

k x + wk0• Define the decision boundaries directly by deciding for

Ck, iff yk > yj,∀j 6= k→ This corresponds to a 1-of-K coding scheme e.g.,tn = (0, 0, 1, 0)T

Chaohui Wang Introduction to Machine Learning 14 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Extension to Multiple Classes

• One solution:• Take K linear functions of the form: yk(x) = wT

k x + wk0• Define the decision boundaries directly by deciding for

Ck, iff yk > yj,∀j 6= k→ This corresponds to a 1-of-K coding scheme e.g.,tn = (0, 0, 1, 0)T

Chaohui Wang Introduction to Machine Learning 14 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Extension to Multiple Classes

• One solution:• Take K linear functions of the form: yk(x) = wT

k x + wk0• Define the decision boundaries directly by deciding for

Ck, iff yk > yj,∀j 6= k→ This corresponds to a 1-of-K coding scheme e.g.,tn = (0, 0, 1, 0)T

Chaohui Wang Introduction to Machine Learning 14 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Extension to Multiple Classes

• K-class discriminant• Combination of K linear functions: yk(x) = wT

k x + wk0• Resulting decision hyperplanes:

(wk − wj)Tx + (wk0 − wj0) = 0

→ It can be shown that the decision regions of such adiscriminant are always singly connected and convex

Chaohui Wang Introduction to Machine Learning 15 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Extension to Multiple Classes

• K-class discriminant• Combination of K linear functions: yk(x) = wT

k x + wk0• Resulting decision hyperplanes:

(wk − wj)Tx + (wk0 − wj0) = 0

→ It can be shown that the decision regions of such adiscriminant are always singly connected and convex

Chaohui Wang Introduction to Machine Learning 15 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Extension to Multiple Classes

• K-class discriminant• Combination of K linear functions: yk(x) = wT

k x + wk0• Resulting decision hyperplanes:

(wk − wj)Tx + (wk0 − wj0) = 0

→ It can be shown that the decision regions of such adiscriminant are always singly connected and convex

Chaohui Wang Introduction to Machine Learning 15 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Extension to Multiple Classes

• K-class discriminant• Combination of K linear functions: yk(x) = wT

k x + wk0• Resulting decision hyperplanes:

(wk − wj)Tx + (wk0 − wj0) = 0

→ It can be shown that the decision regions of such adiscriminant are always singly connected and convex

Chaohui Wang Introduction to Machine Learning 15 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Matrix Formulation of the Classification Model

• Consider K classes described by linear discriminantfunctions

yk(x) = wTk x + wk0, k ∈ {1, . . . ,K}

• Group all those functions using vector notation

y(x) = W̃T x̃

where W̃ = [w̃1, . . . , w̃K ] =

w10 · · · wK0w11 · · · wK1

.... . .

...w1D · · · wKD

, x̃ =

[1x

]

• We directly compare the output y to get the target value in1-of-K coding scheme t = [t1, . . . , tK ]T (e.g., [0, 0, 1, 0]T )→ But, how to learn the model?

Chaohui Wang Introduction to Machine Learning 16 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Matrix Formulation of the Classification Model

• Consider K classes described by linear discriminantfunctions

yk(x) = wTk x + wk0, k ∈ {1, . . . ,K}

• Group all those functions using vector notation

y(x) = W̃T x̃

where W̃ = [w̃1, . . . , w̃K ] =

w10 · · · wK0w11 · · · wK1

.... . .

...w1D · · · wKD

, x̃ =

[1x

]

• We directly compare the output y to get the target value in1-of-K coding scheme t = [t1, . . . , tK ]T (e.g., [0, 0, 1, 0]T )→ But, how to learn the model?

Chaohui Wang Introduction to Machine Learning 16 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Matrix Formulation of the Classification Model

• Consider K classes described by linear discriminantfunctions

yk(x) = wTk x + wk0, k ∈ {1, . . . ,K}

• Group all those functions using vector notation

y(x) = W̃T x̃

where W̃ = [w̃1, . . . , w̃K ] =

w10 · · · wK0w11 · · · wK1

.... . .

...w1D · · · wKD

, x̃ =

[1x

]

• We directly compare the output y to get the target value in1-of-K coding scheme t = [t1, . . . , tK ]T (e.g., [0, 0, 1, 0]T )→ But, how to learn the model?

Chaohui Wang Introduction to Machine Learning 16 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Matrix Formulation of the Classification Model

• Consider K classes described by linear discriminantfunctions

yk(x) = wTk x + wk0, k ∈ {1, . . . ,K}

• Group all those functions using vector notation

y(x) = W̃T x̃

where W̃ = [w̃1, . . . , w̃K ] =

w10 · · · wK0w11 · · · wK1

.... . .

...w1D · · · wKD

, x̃ =

[1x

]

• We directly compare the output y to get the target value in1-of-K coding scheme t = [t1, . . . , tK ]T (e.g., [0, 0, 1, 0]T )→ But, how to learn the model?

Chaohui Wang Introduction to Machine Learning 16 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Learning of the Classification Model

• For the entire dataset, we can write

Y(X̃) = [y(x1), . . . , y(xN)]T = X̃W̃

where W̃ = [w̃1, . . . , w̃K ] and X̃ = [x1, . . . , xN ]T

• On the other hand, we can write the target matrix

T = [t1, . . . , tN ]T

• Learning Principe: choose W̃ such that the differencebetween Y(X̃) = X̃W̃ and T is minimal

Chaohui Wang Introduction to Machine Learning 17 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Learning of the Classification Model

• For the entire dataset, we can write

Y(X̃) = [y(x1), . . . , y(xN)]T = X̃W̃

where W̃ = [w̃1, . . . , w̃K ] and X̃ = [x1, . . . , xN ]T

• On the other hand, we can write the target matrix

T = [t1, . . . , tN ]T

• Learning Principe: choose W̃ such that the differencebetween Y(X̃) = X̃W̃ and T is minimal

Chaohui Wang Introduction to Machine Learning 17 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Learning of the Classification Model

• For the entire dataset, we can write

Y(X̃) = [y(x1), . . . , y(xN)]T = X̃W̃

where W̃ = [w̃1, . . . , w̃K ] and X̃ = [x1, . . . , xN ]T

• On the other hand, we can write the target matrix

T = [t1, . . . , tN ]T

• Learning Principe: choose W̃ such that the differencebetween Y(X̃) = X̃W̃ and T is minimal

Chaohui Wang Introduction to Machine Learning 17 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Learning with Least Squares

• A fundamental learning approach (with Least Squares)• Aim to minimize the sum-of-squares error

sum(sum((X̃W̃− T).∧2)) (in Matlab)

• Lead to an exact and closed-form solution (derivative=0):

W̃ = (X̃TX̃)−1X̃T T̃ = X̃†T̃

where X̃† = (X̃TX̃)−1X̃T : pseudo-inverse

Chaohui Wang Introduction to Machine Learning 18 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Learning with Least Squares

• A fundamental learning approach (with Least Squares)• Aim to minimize the sum-of-squares error

sum(sum((X̃W̃− T).∧2)) (in Matlab)

• Lead to an exact and closed-form solution (derivative=0):

W̃ = (X̃TX̃)−1X̃T T̃ = X̃†T̃

where X̃† = (X̃TX̃)−1X̃T : pseudo-inverse

Chaohui Wang Introduction to Machine Learning 18 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Learning with Least Squares

• A fundamental learning approach (with Least Squares)• Aim to minimize the sum-of-squares error

sum(sum((X̃W̃− T).∧2)) (in Matlab)

• Lead to an exact and closed-form solution (derivative=0):

W̃ = (X̃TX̃)−1X̃T T̃ = X̃†T̃

where X̃† = (X̃TX̃)−1X̃T : pseudo-inverse

Chaohui Wang Introduction to Machine Learning 18 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Problems with Least Squares

• Least-squares is very sensitive to outliers!→ Square error over-considers/penalizes them, includingthose “too correct” ones

Chaohui Wang Introduction to Machine Learning 19 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Problems with Least Squares

• Least-squares is very sensitive to outliers!→ Square error over-considers/penalizes them, includingthose “too correct” ones

Chaohui Wang Introduction to Machine Learning 19 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Problems with Least Squares

• Least-squares is good for linearly separable problem?• The following example: 3 classes (red, green, blue)

→ Most green points are misclassified by least-squares!

• Reason for the failure (from the viewpoint of statistics)?• Least-squares corresponds to Maximum Likelihood under

the assumption of a Gaussian conditional distribution• However, it is difficult to linearly transform all the input

points X̃ via the same W̃, i.e., Y(X̃) = X̃W̃, such that thedistribution of the obtained points of class k is a Gaussiandistribution with mean tk

Chaohui Wang Introduction to Machine Learning 20 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Problems with Least Squares

• Least-squares is good for linearly separable problem?• The following example: 3 classes (red, green, blue)

→ Most green points are misclassified by least-squares!

• Reason for the failure (from the viewpoint of statistics)?• Least-squares corresponds to Maximum Likelihood under

the assumption of a Gaussian conditional distribution• However, it is difficult to linearly transform all the input

points X̃ via the same W̃, i.e., Y(X̃) = X̃W̃, such that thedistribution of the obtained points of class k is a Gaussiandistribution with mean tk

Chaohui Wang Introduction to Machine Learning 20 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Problems with Least Squares

• Least-squares is good for linearly separable problem?• The following example: 3 classes (red, green, blue)

→ Most green points are misclassified by least-squares!

• Reason for the failure (from the viewpoint of statistics)?• Least-squares corresponds to Maximum Likelihood under

the assumption of a Gaussian conditional distribution• However, it is difficult to linearly transform all the input

points X̃ via the same W̃, i.e., Y(X̃) = X̃W̃, such that thedistribution of the obtained points of class k is a Gaussiandistribution with mean tk

Chaohui Wang Introduction to Machine Learning 20 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Problems with Least Squares

• Least-squares is good for linearly separable problem?• The following example: 3 classes (red, green, blue)

→ Most green points are misclassified by least-squares!

• Reason for the failure (from the viewpoint of statistics)?• Least-squares corresponds to Maximum Likelihood under

the assumption of a Gaussian conditional distribution• However, it is difficult to linearly transform all the input

points X̃ via the same W̃, i.e., Y(X̃) = X̃W̃, such that thedistribution of the obtained points of class k is a Gaussiandistribution with mean tk

Chaohui Wang Introduction to Machine Learning 20 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Problems with Least Squares

• Least-squares is good for linearly separable problem?• The following example: 3 classes (red, green, blue)

→ Most green points are misclassified by least-squares!

• Reason for the failure (from the viewpoint of statistics)?• Least-squares corresponds to Maximum Likelihood under

the assumption of a Gaussian conditional distribution• However, it is difficult to linearly transform all the input

points X̃ via the same W̃, i.e., Y(X̃) = X̃W̃, such that thedistribution of the obtained points of class k is a Gaussiandistribution with mean tk

Chaohui Wang Introduction to Machine Learning 20 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Problems with Least Squares

• Least-squares is good for linearly separable problem?• The following example: 3 classes (red, green, blue)

→ Most green points are misclassified by least-squares!

• Reason for the failure (from the viewpoint of statistics)?• Least-squares corresponds to Maximum Likelihood under

the assumption of a Gaussian conditional distribution• However, it is difficult to linearly transform all the input

points X̃ via the same W̃, i.e., Y(X̃) = X̃W̃, such that thedistribution of the obtained points of class k is a Gaussiandistribution with mean tk

Chaohui Wang Introduction to Machine Learning 20 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

SVMs

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Chaohui Wang Introduction to Machine Learning 21 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalizations of Linear Discriminants

(Recap) Linear Discriminant: y(x) = W̃T x̃

→ Generalization in two main aspects:

• Generalized linear models: y(x) = g(W̃T x̃)• Generalized linear discriminants: y(x) = W̃Tφ(x̃)

Chaohui Wang Introduction to Machine Learning 22 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalizations of Linear Discriminants

(Recap) Linear Discriminant: y(x) = W̃T x̃

→ Generalization in two main aspects:

• Generalized linear models: y(x) = g(W̃T x̃)• Generalized linear discriminants: y(x) = W̃Tφ(x̃)

Chaohui Wang Introduction to Machine Learning 22 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalizations of Linear Discriminants

(Recap) Linear Discriminant: y(x) = W̃T x̃

→ Generalization in two main aspects:

• Generalized linear models: y(x) = g(W̃T x̃)• Generalized linear discriminants: y(x) = W̃Tφ(x̃)

Chaohui Wang Introduction to Machine Learning 22 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalizations of Linear Discriminants

(Recap) Linear Discriminant: y(x) = W̃T x̃

→ Generalization in two main aspects:

• Generalized linear models: y(x) = g(W̃T x̃)• Generalized linear discriminants: y(x) = W̃Tφ(x̃)

Chaohui Wang Introduction to Machine Learning 22 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Generalized linear models: y(x) = g(W̃T x̃)• x̃ is first transformed linearly by matrix W̃T

• And then transformed by g(·)→ g(·) is called activation function, and can be nonlinear→ If g is monotonous (which is typically the case), theresulting decision boundaries are still linear functions of x

Chaohui Wang Introduction to Machine Learning 23 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Generalized linear models: y(x) = g(W̃T x̃)• x̃ is first transformed linearly by matrix W̃T

• And then transformed by g(·)→ g(·) is called activation function, and can be nonlinear→ If g is monotonous (which is typically the case), theresulting decision boundaries are still linear functions of x

Chaohui Wang Introduction to Machine Learning 23 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Generalized linear models: y(x) = g(W̃T x̃)• x̃ is first transformed linearly by matrix W̃T

• And then transformed by g(·)→ g(·) is called activation function, and can be nonlinear→ If g is monotonous (which is typically the case), theresulting decision boundaries are still linear functions of x

Chaohui Wang Introduction to Machine Learning 23 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Generalized linear models: y(x) = g(W̃T x̃)• x̃ is first transformed linearly by matrix W̃T

• And then transformed by g(·)→ g(·) is called activation function, and can be nonlinear→ If g is monotonous (which is typically the case), theresulting decision boundaries are still linear functions of x

Chaohui Wang Introduction to Machine Learning 23 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Generalized linear models: y(x) = g(W̃T x̃)• x̃ is first transformed linearly by matrix W̃T

• And then transformed by g(·)→ g(·) is called activation function, and can be nonlinear→ If g is monotonous (which is typically the case), theresulting decision boundaries are still linear functions of x

Chaohui Wang Introduction to Machine Learning 23 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Consider 2 classes:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + p(x|C2)p(C2)p(x|C1)p(C1)

=1

1 + exp(−a), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ Logistic sigmoid function: g(a) = 11+exp(−a)

Chaohui Wang Introduction to Machine Learning 24 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Consider 2 classes:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + p(x|C2)p(C2)p(x|C1)p(C1)

=1

1 + exp(−a), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ Logistic sigmoid function: g(a) = 11+exp(−a)

Chaohui Wang Introduction to Machine Learning 24 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Consider 2 classes:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + p(x|C2)p(C2)p(x|C1)p(C1)

=1

1 + exp(−a), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ Logistic sigmoid function: g(a) = 11+exp(−a)

Chaohui Wang Introduction to Machine Learning 24 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Consider 2 classes:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + p(x|C2)p(C2)p(x|C1)p(C1)

=1

1 + exp(−a), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ Logistic sigmoid function: g(a) = 11+exp(−a)

Chaohui Wang Introduction to Machine Learning 24 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Consider 2 classes:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + p(x|C2)p(C2)p(x|C1)p(C1)

=1

1 + exp(−a), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ Logistic sigmoid function: g(a) = 11+exp(−a)

Chaohui Wang Introduction to Machine Learning 24 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Consider 2 classes:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + p(x|C2)p(C2)p(x|C1)p(C1)

=1

1 + exp(−a), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ Logistic sigmoid function: g(a) = 11+exp(−a)

Chaohui Wang Introduction to Machine Learning 24 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Consider K > 2 classes:

p(Ck|x) =p(x|Ck)p(Ck)∑

j p(x|Cj)p(Cj)

=exp(ak)∑

j exp(aj), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ The normalized exponential or softmax function:

gk(a) =exp(ak)∑

j exp(aj)

→ Can be regarded as a multiclass generalization of thelogistic sigmoid

Chaohui Wang Introduction to Machine Learning 25 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Consider K > 2 classes:

p(Ck|x) =p(x|Ck)p(Ck)∑

j p(x|Cj)p(Cj)

=exp(ak)∑

j exp(aj), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ The normalized exponential or softmax function:

gk(a) =exp(ak)∑

j exp(aj)

→ Can be regarded as a multiclass generalization of thelogistic sigmoid

Chaohui Wang Introduction to Machine Learning 25 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Consider K > 2 classes:

p(Ck|x) =p(x|Ck)p(Ck)∑

j p(x|Cj)p(Cj)

=exp(ak)∑

j exp(aj), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ The normalized exponential or softmax function:

gk(a) =exp(ak)∑

j exp(aj)

→ Can be regarded as a multiclass generalization of thelogistic sigmoid

Chaohui Wang Introduction to Machine Learning 25 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Consider K > 2 classes:

p(Ck|x) =p(x|Ck)p(Ck)∑

j p(x|Cj)p(Cj)

=exp(ak)∑

j exp(aj), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ The normalized exponential or softmax function:

gk(a) =exp(ak)∑

j exp(aj)

→ Can be regarded as a multiclass generalization of thelogistic sigmoid

Chaohui Wang Introduction to Machine Learning 25 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Consider K > 2 classes:

p(Ck|x) =p(x|Ck)p(Ck)∑

j p(x|Cj)p(Cj)

=exp(ak)∑

j exp(aj), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ The normalized exponential or softmax function:

gk(a) =exp(ak)∑

j exp(aj)

→ Can be regarded as a multiclass generalization of thelogistic sigmoid

Chaohui Wang Introduction to Machine Learning 25 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized linear models

• Consider K > 2 classes:

p(Ck|x) =p(x|Ck)p(Ck)∑

j p(x|Cj)p(Cj)

=exp(ak)∑

j exp(aj), where a = ln

p(x|C1)p(C1)

p(x|C2)p(C2)

→ The normalized exponential or softmax function:

gk(a) =exp(ak)∑

j exp(aj)

→ Can be regarded as a multiclass generalization of thelogistic sigmoid

Chaohui Wang Introduction to Machine Learning 25 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Other Motivation for Nonlinearity

• Recall least-squares classification• One problem: data points that are “too correct” have a

strong influence on the decision boundary under asquared-error criterion

• Reason: the output of y(x;w) can grow arbitrarily large forsome xn:

y(x) = wTx + w0

• By choosing a suitable nonlinearity (e.g., a sigmoid), wecan limit those influences:

y(x) = g(wTx + w0)

Chaohui Wang Introduction to Machine Learning 26 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Other Motivation for Nonlinearity

• Recall least-squares classification• One problem: data points that are “too correct” have a

strong influence on the decision boundary under asquared-error criterion

• Reason: the output of y(x;w) can grow arbitrarily large forsome xn:

y(x) = wTx + w0

• By choosing a suitable nonlinearity (e.g., a sigmoid), wecan limit those influences:

y(x) = g(wTx + w0)

Chaohui Wang Introduction to Machine Learning 26 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Other Motivation for Nonlinearity

• Recall least-squares classification• One problem: data points that are “too correct” have a

strong influence on the decision boundary under asquared-error criterion

• Reason: the output of y(x;w) can grow arbitrarily large forsome xn:

y(x) = wTx + w0

• By choosing a suitable nonlinearity (e.g., a sigmoid), wecan limit those influences:

y(x) = g(wTx + w0)

Chaohui Wang Introduction to Machine Learning 26 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Other Motivation for Nonlinearity

• Recall least-squares classification• One problem: data points that are “too correct” have a

strong influence on the decision boundary under asquared-error criterion

• Reason: the output of y(x;w) can grow arbitrarily large forsome xn:

y(x) = wTx + w0

• By choosing a suitable nonlinearity (e.g., a sigmoid), wecan limit those influences:

y(x) = g(wTx + w0)

Chaohui Wang Introduction to Machine Learning 26 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discussion: Generalized Linear Models

• Advantages:• The nonlinearity→ more flexibility• Can be used to limit the effect of outliers• Choice of a sigmoid→ a nice probabilistic interpretation

• Disadvantage:• Parameter learning: in general no longer have a

closed-form analytical solution (even with least-squares)→ iterative methods + gradient decent

→ Next: let’s see Generalized Linear Discriminants

Chaohui Wang Introduction to Machine Learning 27 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discussion: Generalized Linear Models

• Advantages:• The nonlinearity→ more flexibility• Can be used to limit the effect of outliers• Choice of a sigmoid→ a nice probabilistic interpretation

• Disadvantage:• Parameter learning: in general no longer have a

closed-form analytical solution (even with least-squares)→ iterative methods + gradient decent

→ Next: let’s see Generalized Linear Discriminants

Chaohui Wang Introduction to Machine Learning 27 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discussion: Generalized Linear Models

• Advantages:• The nonlinearity→ more flexibility• Can be used to limit the effect of outliers• Choice of a sigmoid→ a nice probabilistic interpretation

• Disadvantage:• Parameter learning: in general no longer have a

closed-form analytical solution (even with least-squares)→ iterative methods + gradient decent

→ Next: let’s see Generalized Linear Discriminants

Chaohui Wang Introduction to Machine Learning 27 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discussion: Generalized Linear Models

• Advantages:• The nonlinearity→ more flexibility• Can be used to limit the effect of outliers• Choice of a sigmoid→ a nice probabilistic interpretation

• Disadvantage:• Parameter learning: in general no longer have a

closed-form analytical solution (even with least-squares)→ iterative methods + gradient decent

→ Next: let’s see Generalized Linear Discriminants

Chaohui Wang Introduction to Machine Learning 27 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discussion: Generalized Linear Models

• Advantages:• The nonlinearity→ more flexibility• Can be used to limit the effect of outliers• Choice of a sigmoid→ a nice probabilistic interpretation

• Disadvantage:• Parameter learning: in general no longer have a

closed-form analytical solution (even with least-squares)→ iterative methods + gradient decent

→ Next: let’s see Generalized Linear Discriminants

Chaohui Wang Introduction to Machine Learning 27 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discussion: Generalized Linear Models

• Advantages:• The nonlinearity→ more flexibility• Can be used to limit the effect of outliers• Choice of a sigmoid→ a nice probabilistic interpretation

• Disadvantage:• Parameter learning: in general no longer have a

closed-form analytical solution (even with least-squares)→ iterative methods + gradient decent

→ Next: let’s see Generalized Linear Discriminants

Chaohui Wang Introduction to Machine Learning 27 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discussion: Generalized Linear Models

• Advantages:• The nonlinearity→ more flexibility• Can be used to limit the effect of outliers• Choice of a sigmoid→ a nice probabilistic interpretation

• Disadvantage:• Parameter learning: in general no longer have a

closed-form analytical solution (even with least-squares)→ iterative methods + gradient decent

→ Next: let’s see Generalized Linear Discriminants

Chaohui Wang Introduction to Machine Learning 27 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Discussion: Generalized Linear Models

• Advantages:• The nonlinearity→ more flexibility• Can be used to limit the effect of outliers• Choice of a sigmoid→ a nice probabilistic interpretation

• Disadvantage:• Parameter learning: in general no longer have a

closed-form analytical solution (even with least-squares)→ iterative methods + gradient decent

→ Next: let’s see Generalized Linear Discriminants

Chaohui Wang Introduction to Machine Learning 27 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized Linear Discriminants

• Generalized Linear Discriminants: y(x) = W̃Tφ(x̃)• Basis functions: φ(·) = [φ0(·), φ1(·), . . . , φM(·)]T with φ0 ≡ 1• yk(x) =

∑Mm=1 wkmφj(x̃) + wk0

→ Allow non-linear decision boundaries→ By choosing the right φ, every continuous function yk(·)can (in principle) be approximated with arbitrary accuracy

• Simple sequential learning approach available forparameter estimation using gradient descent

• Better 2nd order gradient descent approaches available(e.g., Newton-Raphson)

Chaohui Wang Introduction to Machine Learning 28 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized Linear Discriminants

• Generalized Linear Discriminants: y(x) = W̃Tφ(x̃)• Basis functions: φ(·) = [φ0(·), φ1(·), . . . , φM(·)]T with φ0 ≡ 1• yk(x) =

∑Mm=1 wkmφj(x̃) + wk0

→ Allow non-linear decision boundaries→ By choosing the right φ, every continuous function yk(·)can (in principle) be approximated with arbitrary accuracy

• Simple sequential learning approach available forparameter estimation using gradient descent

• Better 2nd order gradient descent approaches available(e.g., Newton-Raphson)

Chaohui Wang Introduction to Machine Learning 28 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized Linear Discriminants

• Generalized Linear Discriminants: y(x) = W̃Tφ(x̃)• Basis functions: φ(·) = [φ0(·), φ1(·), . . . , φM(·)]T with φ0 ≡ 1• yk(x) =

∑Mm=1 wkmφj(x̃) + wk0

→ Allow non-linear decision boundaries→ By choosing the right φ, every continuous function yk(·)can (in principle) be approximated with arbitrary accuracy

• Simple sequential learning approach available forparameter estimation using gradient descent

• Better 2nd order gradient descent approaches available(e.g., Newton-Raphson)

Chaohui Wang Introduction to Machine Learning 28 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized Linear Discriminants

• Generalized Linear Discriminants: y(x) = W̃Tφ(x̃)• Basis functions: φ(·) = [φ0(·), φ1(·), . . . , φM(·)]T with φ0 ≡ 1• yk(x) =

∑Mm=1 wkmφj(x̃) + wk0

→ Allow non-linear decision boundaries→ By choosing the right φ, every continuous function yk(·)can (in principle) be approximated with arbitrary accuracy

• Simple sequential learning approach available forparameter estimation using gradient descent

• Better 2nd order gradient descent approaches available(e.g., Newton-Raphson)

Chaohui Wang Introduction to Machine Learning 28 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized Linear Discriminants

• Generalized Linear Discriminants: y(x) = W̃Tφ(x̃)• Basis functions: φ(·) = [φ0(·), φ1(·), . . . , φM(·)]T with φ0 ≡ 1• yk(x) =

∑Mm=1 wkmφj(x̃) + wk0

→ Allow non-linear decision boundaries→ By choosing the right φ, every continuous function yk(·)can (in principle) be approximated with arbitrary accuracy

• Simple sequential learning approach available forparameter estimation using gradient descent

• Better 2nd order gradient descent approaches available(e.g., Newton-Raphson)

Chaohui Wang Introduction to Machine Learning 28 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized Linear Discriminants

• Generalized Linear Discriminants: y(x) = W̃Tφ(x̃)• Basis functions: φ(·) = [φ0(·), φ1(·), . . . , φM(·)]T with φ0 ≡ 1• yk(x) =

∑Mm=1 wkmφj(x̃) + wk0

→ Allow non-linear decision boundaries→ By choosing the right φ, every continuous function yk(·)can (in principle) be approximated with arbitrary accuracy

• Simple sequential learning approach available forparameter estimation using gradient descent

• Better 2nd order gradient descent approaches available(e.g., Newton-Raphson)

Chaohui Wang Introduction to Machine Learning 28 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized Linear Discriminants

• Generalized Linear Discriminants: y(x) = W̃Tφ(x̃)• Basis functions: φ(·) = [φ0(·), φ1(·), . . . , φM(·)]T with φ0 ≡ 1• yk(x) =

∑Mm=1 wkmφj(x̃) + wk0

→ Allow non-linear decision boundaries→ By choosing the right φ, every continuous function yk(·)can (in principle) be approximated with arbitrary accuracy

• Simple sequential learning approach available forparameter estimation using gradient descent

• Better 2nd order gradient descent approaches available(e.g., Newton-Raphson)

Chaohui Wang Introduction to Machine Learning 28 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized Linear Discriminants

• Limitations/Caveats:• Generally, can no longer be learned in closed form• Flexibility of model is limited by curse of dimensionality• Need to avoid overfitting, since: e.g., linearly separable

case often leads to overfitting, etc.

Chaohui Wang Introduction to Machine Learning 29 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized Linear Discriminants

• Limitations/Caveats:• Generally, can no longer be learned in closed form• Flexibility of model is limited by curse of dimensionality• Need to avoid overfitting, since: e.g., linearly separable

case often leads to overfitting, etc.

Chaohui Wang Introduction to Machine Learning 29 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized Linear Discriminants

• Limitations/Caveats:• Generally, can no longer be learned in closed form• Flexibility of model is limited by curse of dimensionality• Need to avoid overfitting, since: e.g., linearly separable

case often leads to overfitting, etc.

Chaohui Wang Introduction to Machine Learning 29 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalized Linear Discriminants

• Limitations/Caveats:• Generally, can no longer be learned in closed form• Flexibility of model is limited by curse of dimensionality• Need to avoid overfitting, since: e.g., linearly separable

case often leads to overfitting, etc.

Chaohui Wang Introduction to Machine Learning 29 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

SVMs

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Chaohui Wang Introduction to Machine Learning 30 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalization and Overfitting

• Goal of the classification: predict class labels of newobservations (generalization)• Train classification model on limited training set• The further we optimize the model/parameters, the more

the training error will decrease• However, at some point the test error will go up again→ Overfitting to the training set

Chaohui Wang Introduction to Machine Learning 31 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalization and Overfitting

• Goal of the classification: predict class labels of newobservations (generalization)• Train classification model on limited training set• The further we optimize the model/parameters, the more

the training error will decrease• However, at some point the test error will go up again→ Overfitting to the training set

Chaohui Wang Introduction to Machine Learning 31 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalization and Overfitting

• Goal of the classification: predict class labels of newobservations (generalization)• Train classification model on limited training set• The further we optimize the model/parameters, the more

the training error will decrease• However, at some point the test error will go up again→ Overfitting to the training set

Chaohui Wang Introduction to Machine Learning 31 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Generalization and Overfitting

• Goal of the classification: predict class labels of newobservations (generalization)• Train classification model on limited training set• The further we optimize the model/parameters, the more

the training error will decrease• However, at some point the test error will go up again→ Overfitting to the training set

Chaohui Wang Introduction to Machine Learning 31 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Typical Example: Linearly Separable Data

• Overfitting: a common problem with linearly separable data

• All of them have zero error on the training set→Which of the many possible decision boundaries iscorrect?

• However, in general they perform differently on novel testdata → Different generalization performances

Chaohui Wang Introduction to Machine Learning 32 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Typical Example: Linearly Separable Data

• Overfitting: a common problem with linearly separable data

• All of them have zero error on the training set→Which of the many possible decision boundaries iscorrect?

• However, in general they perform differently on novel testdata → Different generalization performances

Chaohui Wang Introduction to Machine Learning 32 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Typical Example: Linearly Separable Data

• Overfitting: a common problem with linearly separable data

• All of them have zero error on the training set→Which of the many possible decision boundaries iscorrect?

• However, in general they perform differently on novel testdata → Different generalization performances

Chaohui Wang Introduction to Machine Learning 32 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Typical Example: Linearly Separable Data

• How to select the classifier with the best generalizationperformance?

• Intuitively, select the classifier which leaves maximal “safetyroom” for future data points→ Refer to Statistical Learning Theory for theoreticalanalysis

• This can be obtained by maximizing the margin betweenpositive and negative data points

Chaohui Wang Introduction to Machine Learning 33 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Typical Example: Linearly Separable Data

• How to select the classifier with the best generalizationperformance?

• Intuitively, select the classifier which leaves maximal “safetyroom” for future data points→ Refer to Statistical Learning Theory for theoreticalanalysis

• This can be obtained by maximizing the margin betweenpositive and negative data points

Chaohui Wang Introduction to Machine Learning 33 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Typical Example: Linearly Separable Data

• How to select the classifier with the best generalizationperformance?

• Intuitively, select the classifier which leaves maximal “safetyroom” for future data points→ Refer to Statistical Learning Theory for theoreticalanalysis

• This can be obtained by maximizing the margin betweenpositive and negative data points

Chaohui Wang Introduction to Machine Learning 33 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Typical Example: Linearly Separable Data

• How to select the classifier with the best generalizationperformance?

• Intuitively, select the classifier which leaves maximal “safetyroom” for future data points→ Refer to Statistical Learning Theory for theoreticalanalysis

• This can be obtained by maximizing the margin betweenpositive and negative data points

Chaohui Wang Introduction to Machine Learning 33 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Support Vector Machines (SVMs)

• The Support Vector Machines (SVM) takes theaformentioned idea:• Search for the classifier with maximum margin• Formulate the training as a convex optimization problem→ Possible to find the globally optimal solution!

Chaohui Wang Introduction to Machine Learning 34 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Support Vector Machines (SVMs)

• The Support Vector Machines (SVM) takes theaformentioned idea:• Search for the classifier with maximum margin• Formulate the training as a convex optimization problem→ Possible to find the globally optimal solution!

Chaohui Wang Introduction to Machine Learning 34 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Support Vector Machines (SVMs)

• The Support Vector Machines (SVM) takes theaformentioned idea:• Search for the classifier with maximum margin• Formulate the training as a convex optimization problem→ Possible to find the globally optimal solution!

Chaohui Wang Introduction to Machine Learning 34 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Support Vector Machines (SVMs)

• The Support Vector Machines (SVM) takes theaformentioned idea:• Search for the classifier with maximum margin• Formulate the training as a convex optimization problem→ Possible to find the globally optimal solution!

Chaohui Wang Introduction to Machine Learning 34 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Support Vector Machines (SVMs)

• Notations:• Given N training examples {(xi, ti)}N

i=1→ Target values ti ∈ {−1, 1}

• The hyperplane separating the data :

wTxn + b = 0

• In the case of linearly separable data, SVMs look for thehyperplane satisfying:

arg minw,b

12||w||2, with constraints tn(wTxn + b) ≥ 1, ∀n

• Quadratic programming problem with linear constraints• Formulated using Lagrange multipliers→ Globally optimal solution

Chaohui Wang Introduction to Machine Learning 35 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Support Vector Machines (SVMs)

• Notations:• Given N training examples {(xi, ti)}N

i=1→ Target values ti ∈ {−1, 1}

• The hyperplane separating the data :

wTxn + b = 0

• In the case of linearly separable data, SVMs look for thehyperplane satisfying:

arg minw,b

12||w||2, with constraints tn(wTxn + b) ≥ 1, ∀n

• Quadratic programming problem with linear constraints• Formulated using Lagrange multipliers→ Globally optimal solution

Chaohui Wang Introduction to Machine Learning 35 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Support Vector Machines (SVMs)

• Notations:• Given N training examples {(xi, ti)}N

i=1→ Target values ti ∈ {−1, 1}

• The hyperplane separating the data :

wTxn + b = 0

• In the case of linearly separable data, SVMs look for thehyperplane satisfying:

arg minw,b

12||w||2, with constraints tn(wTxn + b) ≥ 1, ∀n

• Quadratic programming problem with linear constraints• Formulated using Lagrange multipliers→ Globally optimal solution

Chaohui Wang Introduction to Machine Learning 35 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Support Vector Machines (SVMs)

• Notations:• Given N training examples {(xi, ti)}N

i=1→ Target values ti ∈ {−1, 1}

• The hyperplane separating the data :

wTxn + b = 0

• In the case of linearly separable data, SVMs look for thehyperplane satisfying:

arg minw,b

12||w||2, with constraints tn(wTxn + b) ≥ 1, ∀n

• Quadratic programming problem with linear constraints• Formulated using Lagrange multipliers→ Globally optimal solution

Chaohui Wang Introduction to Machine Learning 35 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Support Vector Machines (SVMs)

• Notations:• Given N training examples {(xi, ti)}N

i=1→ Target values ti ∈ {−1, 1}

• The hyperplane separating the data :

wTxn + b = 0

• In the case of linearly separable data, SVMs look for thehyperplane satisfying:

arg minw,b

12||w||2, with constraints tn(wTxn + b) ≥ 1, ∀n

• Quadratic programming problem with linear constraints• Formulated using Lagrange multipliers→ Globally optimal solution

Chaohui Wang Introduction to Machine Learning 35 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Support Vector Machines (SVMs)

• Notations:• Given N training examples {(xi, ti)}N

i=1→ Target values ti ∈ {−1, 1}

• The hyperplane separating the data :

wTxn + b = 0

• In the case of linearly separable data, SVMs look for thehyperplane satisfying:

arg minw,b

12||w||2, with constraints tn(wTxn + b) ≥ 1, ∀n

• Quadratic programming problem with linear constraints• Formulated using Lagrange multipliers→ Globally optimal solution

Chaohui Wang Introduction to Machine Learning 35 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Support Vector Machines (SVMs)

• Notations:• Given N training examples {(xi, ti)}N

i=1→ Target values ti ∈ {−1, 1}

• The hyperplane separating the data :

wTxn + b = 0

• In the case of linearly separable data, SVMs look for thehyperplane satisfying:

arg minw,b

12||w||2, with constraints tn(wTxn + b) ≥ 1, ∀n

• Quadratic programming problem with linear constraints• Formulated using Lagrange multipliers→ Globally optimal solution

Chaohui Wang Introduction to Machine Learning 35 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Support Vector Machines (SVMs)

• Notations:• Given N training examples {(xi, ti)}N

i=1→ Target values ti ∈ {−1, 1}

• The hyperplane separating the data :

wTxn + b = 0

• In the case of linearly separable data, SVMs look for thehyperplane satisfying:

arg minw,b

12||w||2, with constraints tn(wTxn + b) ≥ 1, ∀n

• Quadratic programming problem with linear constraints• Formulated using Lagrange multipliers→ Globally optimal solution

Chaohui Wang Introduction to Machine Learning 35 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

SVMs for Non-Separable Data

Non-Separable Data: How to address?

• Relax the hard constraints tn(wTxn + b) ≥ 1 for eachtraining data point→ e.g., the hinge loss function: max

(0, 1− tn(wTxn − b)

)• Jointly optimized together with w→ e.g.:

w̃, b̃ = arg minw,b{1

n

n∑i=1

max(0, 1− tn(wTxn − b)

)+ λ

12||w||2}

Chaohui Wang Introduction to Machine Learning 36 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

SVMs for Non-Separable Data

Non-Separable Data: How to address?

• Relax the hard constraints tn(wTxn + b) ≥ 1 for eachtraining data point→ e.g., the hinge loss function: max

(0, 1− tn(wTxn − b)

)• Jointly optimized together with w→ e.g.:

w̃, b̃ = arg minw,b{1

n

n∑i=1

max(0, 1− tn(wTxn − b)

)+ λ

12||w||2}

Chaohui Wang Introduction to Machine Learning 36 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

SVMs for Non-Separable Data

Non-Separable Data: How to address?

• Relax the hard constraints tn(wTxn + b) ≥ 1 for eachtraining data point→ e.g., the hinge loss function: max

(0, 1− tn(wTxn − b)

)• Jointly optimized together with w→ e.g.:

w̃, b̃ = arg minw,b{1

n

n∑i=1

max(0, 1− tn(wTxn − b)

)+ λ

12||w||2}

Chaohui Wang Introduction to Machine Learning 36 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

SVMs for Non-Separable Data

Non-Separable Data: How to address?

• Relax the hard constraints tn(wTxn + b) ≥ 1 for eachtraining data point→ e.g., the hinge loss function: max

(0, 1− tn(wTxn − b)

)• Jointly optimized together with w→ e.g.:

w̃, b̃ = arg minw,b{1

n

n∑i=1

max(0, 1− tn(wTxn − b)

)+ λ

12||w||2}

Chaohui Wang Introduction to Machine Learning 36 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

SVMs for Non-Separable Data

Non-Separable Data: How to address?

• Relax the hard constraints tn(wTxn + b) ≥ 1 for eachtraining data point→ e.g., the hinge loss function: max

(0, 1− tn(wTxn − b)

)• Jointly optimized together with w→ e.g.:

w̃, b̃ = arg minw,b{1

n

n∑i=1

max(0, 1− tn(wTxn − b)

)+ λ

12||w||2}

Chaohui Wang Introduction to Machine Learning 36 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Other Extensions

• Multi-class SVM• Common approach: reduce the multi-class problem into

multiple binary classification problems• More advanced methods exist

• Nonlinear SVMs• Basic idea: map the input data to some higher-dimensional

feature space where the training set is separable

• Implemented with the Kernel Trick• Structured SVM: the label space is structured• . . .

Chaohui Wang Introduction to Machine Learning 37 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Other Extensions

• Multi-class SVM• Common approach: reduce the multi-class problem into

multiple binary classification problems• More advanced methods exist

• Nonlinear SVMs• Basic idea: map the input data to some higher-dimensional

feature space where the training set is separable

• Implemented with the Kernel Trick• Structured SVM: the label space is structured• . . .

Chaohui Wang Introduction to Machine Learning 37 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Other Extensions

• Multi-class SVM• Common approach: reduce the multi-class problem into

multiple binary classification problems• More advanced methods exist

• Nonlinear SVMs• Basic idea: map the input data to some higher-dimensional

feature space where the training set is separable

• Implemented with the Kernel Trick• Structured SVM: the label space is structured• . . .

Chaohui Wang Introduction to Machine Learning 37 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Other Extensions

• Multi-class SVM• Common approach: reduce the multi-class problem into

multiple binary classification problems• More advanced methods exist

• Nonlinear SVMs• Basic idea: map the input data to some higher-dimensional

feature space where the training set is separable

• Implemented with the Kernel Trick• Structured SVM: the label space is structured• . . .

Chaohui Wang Introduction to Machine Learning 37 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Other Extensions

• Multi-class SVM• Common approach: reduce the multi-class problem into

multiple binary classification problems• More advanced methods exist

• Nonlinear SVMs• Basic idea: map the input data to some higher-dimensional

feature space where the training set is separable

• Implemented with the Kernel Trick• Structured SVM: the label space is structured• . . .

Chaohui Wang Introduction to Machine Learning 37 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Other Extensions

• Multi-class SVM• Common approach: reduce the multi-class problem into

multiple binary classification problems• More advanced methods exist

• Nonlinear SVMs• Basic idea: map the input data to some higher-dimensional

feature space where the training set is separable

• Implemented with the Kernel Trick• Structured SVM: the label space is structured• . . .

Chaohui Wang Introduction to Machine Learning 37 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Other Extensions

• Multi-class SVM• Common approach: reduce the multi-class problem into

multiple binary classification problems• More advanced methods exist

• Nonlinear SVMs• Basic idea: map the input data to some higher-dimensional

feature space where the training set is separable

• Implemented with the Kernel Trick• Structured SVM: the label space is structured• . . .

Chaohui Wang Introduction to Machine Learning 37 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Main Supervised Learning Approaches

• Discriminative Approaches• Linear Discriminant Functions• Support Vector Machines• Ensemble Methods & Boosting• (Randomized) Decision Trees & Random Forests• etc.

• Generative Models• Bayesian Networks• Markov Random Fields• etc.

• Deep Models

Chaohui Wang Introduction to Machine Learning 38 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

SVMs

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Chaohui Wang Introduction to Machine Learning 39 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Up to now...

• We have seen a variety of different classifiers• K-NN• Bayes classifiers• Linear discriminants• SVMs

• Each of them has their strengths and weaknesses...

→ Can we improve performance by combining them?

→ One approach: via Ensemble Methods

Chaohui Wang Introduction to Machine Learning 40 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Up to now...

• We have seen a variety of different classifiers• K-NN• Bayes classifiers• Linear discriminants• SVMs

• Each of them has their strengths and weaknesses...

→ Can we improve performance by combining them?

→ One approach: via Ensemble Methods

Chaohui Wang Introduction to Machine Learning 40 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Up to now...

• We have seen a variety of different classifiers• K-NN• Bayes classifiers• Linear discriminants• SVMs

• Each of them has their strengths and weaknesses...

→ Can we improve performance by combining them?

→ One approach: via Ensemble Methods

Chaohui Wang Introduction to Machine Learning 40 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Up to now...

• We have seen a variety of different classifiers• K-NN• Bayes classifiers• Linear discriminants• SVMs

• Each of them has their strengths and weaknesses...

→ Can we improve performance by combining them?→ One approach: via Ensemble Methods

Chaohui Wang Introduction to Machine Learning 40 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

SVMs

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Chaohui Wang Introduction to Machine Learning 41 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Ensembles of Classifiers

• Intuition• Assume we have L classifiers• They are independent (i.e., their errors are uncorrelated)• Each one has an error probability p < 0.5 on training data→ Then a simple majority vote of all classifiers should havea lower error than each individual classifier...

Chaohui Wang Introduction to Machine Learning 42 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Ensembles of Classifiers

• Intuition• Assume we have L classifiers• They are independent (i.e., their errors are uncorrelated)• Each one has an error probability p < 0.5 on training data→ Then a simple majority vote of all classifiers should havea lower error than each individual classifier...

Chaohui Wang Introduction to Machine Learning 42 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Ensembles of Classifiers

• Intuition• Assume we have L classifiers• They are independent (i.e., their errors are uncorrelated)• Each one has an error probability p < 0.5 on training data→ Then a simple majority vote of all classifiers should havea lower error than each individual classifier...

Chaohui Wang Introduction to Machine Learning 42 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Ensembles of Classifiers

• Intuition• Assume we have L classifiers• They are independent (i.e., their errors are uncorrelated)• Each one has an error probability p < 0.5 on training data→ Then a simple majority vote of all classifiers should havea lower error than each individual classifier...

Chaohui Wang Introduction to Machine Learning 42 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Ensembles of Classifiers

• Intuition• Assume we have L classifiers• They are independent (i.e., their errors are uncorrelated)• Each one has an error probability p < 0.5 on training data→ Then a simple majority vote of all classifiers should havea lower error than each individual classifier...

Chaohui Wang Introduction to Machine Learning 42 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Ensembles of Classifiers

• Example• L classifiers with error probability p = 0.3• Probability that exactly L′ classifiers make an error:

pL′(1− p)L−L′

• If L = 20, the probability that 11 or more classifiers make anerror is 0.026

Chaohui Wang Introduction to Machine Learning 43 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Ensembles of Classifiers

• Example• L classifiers with error probability p = 0.3• Probability that exactly L′ classifiers make an error:

pL′(1− p)L−L′

• If L = 20, the probability that 11 or more classifiers make anerror is 0.026

Chaohui Wang Introduction to Machine Learning 43 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Ensembles of Classifiers

• Example• L classifiers with error probability p = 0.3• Probability that exactly L′ classifiers make an error:

pL′(1− p)L−L′

• If L = 20, the probability that 11 or more classifiers make anerror is 0.026

Chaohui Wang Introduction to Machine Learning 43 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Ensembles of Classifiers

• Example• L classifiers with error probability p = 0.3• Probability that exactly L′ classifiers make an error:

pL′(1− p)L−L′

• If L = 20, the probability that 11 or more classifiers make anerror is 0.026

Chaohui Wang Introduction to Machine Learning 43 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

SVMs

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Chaohui Wang Introduction to Machine Learning 44 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Constructing Ensembles

• How do we get different classifiers?• Simplest case: train same classifier on different data• But... where shall we get this additional data from?→ Recall: training data is very expensive!

• Idea: Subsample the training data• Reuse the same training algorithm several times on

different subsets of the training data

Chaohui Wang Introduction to Machine Learning 45 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Constructing Ensembles

• How do we get different classifiers?• Simplest case: train same classifier on different data• But... where shall we get this additional data from?→ Recall: training data is very expensive!

• Idea: Subsample the training data• Reuse the same training algorithm several times on

different subsets of the training data

Chaohui Wang Introduction to Machine Learning 45 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Constructing Ensembles

• How do we get different classifiers?• Simplest case: train same classifier on different data• But... where shall we get this additional data from?→ Recall: training data is very expensive!

• Idea: Subsample the training data• Reuse the same training algorithm several times on

different subsets of the training data

Chaohui Wang Introduction to Machine Learning 45 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Constructing Ensembles

• How do we get different classifiers?• Simplest case: train same classifier on different data• But... where shall we get this additional data from?→ Recall: training data is very expensive!

• Idea: Subsample the training data• Reuse the same training algorithm several times on

different subsets of the training data

Chaohui Wang Introduction to Machine Learning 45 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Constructing Ensembles

• How do we get different classifiers?• Simplest case: train same classifier on different data• But... where shall we get this additional data from?→ Recall: training data is very expensive!

• Idea: Subsample the training data• Reuse the same training algorithm several times on

different subsets of the training data

Chaohui Wang Introduction to Machine Learning 45 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Constructing Ensembles

• How do we get different classifiers?• Simplest case: train same classifier on different data• But... where shall we get this additional data from?→ Recall: training data is very expensive!

• Idea: Subsample the training data• Reuse the same training algorithm several times on

different subsets of the training data

Chaohui Wang Introduction to Machine Learning 45 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Cross-validation

• Cross-Validation : a model validation technique forassessing how the results of a statistical analysis willgeneralize to an independent dataset• K-fold cross-validation:

I Partition the available data into K roughly equal subsetsI In each run, train a classifier based on K − 1 subsetsI Estimate the generalization error on the remaining validation

set

→ E.g., 5-fold cross-validation

Chaohui Wang Introduction to Machine Learning 46 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Cross-validation

• Cross-Validation : a model validation technique forassessing how the results of a statistical analysis willgeneralize to an independent dataset• K-fold cross-validation:

I Partition the available data into K roughly equal subsetsI In each run, train a classifier based on K − 1 subsetsI Estimate the generalization error on the remaining validation

set

→ E.g., 5-fold cross-validation

Chaohui Wang Introduction to Machine Learning 46 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Cross-validation

• Cross-Validation : a model validation technique forassessing how the results of a statistical analysis willgeneralize to an independent dataset• K-fold cross-validation:

I Partition the available data into K roughly equal subsetsI In each run, train a classifier based on K − 1 subsetsI Estimate the generalization error on the remaining validation

set

→ E.g., 5-fold cross-validation

Chaohui Wang Introduction to Machine Learning 46 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Cross-validation

• Cross-Validation : a model validation technique forassessing how the results of a statistical analysis willgeneralize to an independent dataset• K-fold cross-validation:

I Partition the available data into K roughly equal subsetsI In each run, train a classifier based on K − 1 subsetsI Estimate the generalization error on the remaining validation

set

→ E.g., 5-fold cross-validation

Chaohui Wang Introduction to Machine Learning 46 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Cross-validation

• Cross-Validation : a model validation technique forassessing how the results of a statistical analysis willgeneralize to an independent dataset• K-fold cross-validation:

I Partition the available data into K roughly equal subsetsI In each run, train a classifier based on K − 1 subsetsI Estimate the generalization error on the remaining validation

set

→ E.g., 5-fold cross-validation

Chaohui Wang Introduction to Machine Learning 46 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Cross-validation

• Cross-Validation : a model validation technique forassessing how the results of a statistical analysis willgeneralize to an independent dataset• K-fold cross-validation:

I Partition the available data into K roughly equal subsetsI In each run, train a classifier based on K − 1 subsetsI Estimate the generalization error on the remaining validation

set

→ E.g., 5-fold cross-validation

Chaohui Wang Introduction to Machine Learning 46 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Bagging

• Bagging = “Bootstrap aggregating” (Breiman 1996): aspecial Bayesian model averaging approach to improvingthe model by combining multiple models trained on a set ofrandomly generated training sets• In each run of the training algorithm, randomly select N′

samples (with replacement) from the full set of N trainingdata points

• If N′ = N, then on average, 63.2% of the training points willbe represented. The rest are duplicates

Chaohui Wang Introduction to Machine Learning 47 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Bagging

• Bagging = “Bootstrap aggregating” (Breiman 1996): aspecial Bayesian model averaging approach to improvingthe model by combining multiple models trained on a set ofrandomly generated training sets• In each run of the training algorithm, randomly select N′

samples (with replacement) from the full set of N trainingdata points

• If N′ = N, then on average, 63.2% of the training points willbe represented. The rest are duplicates

Chaohui Wang Introduction to Machine Learning 47 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Bagging

• Bagging = “Bootstrap aggregating” (Breiman 1996): aspecial Bayesian model averaging approach to improvingthe model by combining multiple models trained on a set ofrandomly generated training sets• In each run of the training algorithm, randomly select N′

samples (with replacement) from the full set of N trainingdata points

• If N′ = N, then on average, 63.2% of the training points willbe represented. The rest are duplicates

Chaohui Wang Introduction to Machine Learning 47 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Others

• Injecting randomness• Many (iterative) learning algorithms need a random

initialization (e.g. k-means, EM)→ Perform multiple runs of the learning algorithm withdifferent random initializations

Chaohui Wang Introduction to Machine Learning 48 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Others

• Injecting randomness• Many (iterative) learning algorithms need a random

initialization (e.g. k-means, EM)→ Perform multiple runs of the learning algorithm withdifferent random initializations

Chaohui Wang Introduction to Machine Learning 48 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Others

• Injecting randomness• Many (iterative) learning algorithms need a random

initialization (e.g. k-means, EM)→ Perform multiple runs of the learning algorithm withdifferent random initializations

Chaohui Wang Introduction to Machine Learning 48 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

SVMs

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Chaohui Wang Introduction to Machine Learning 49 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Idea• Learn L classifiers (based on the training data)• Find a meta-classifier that takes as input the output of the L

first-level classifiers

• An example• Learn L classifiers with leave-one-out cross-validation• Interpret the prediction of the L classifiers as L-dimensional

feature vector• Learn “level-2” classifier based on the examples generated

this way

Chaohui Wang Introduction to Machine Learning 50 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Idea• Learn L classifiers (based on the training data)• Find a meta-classifier that takes as input the output of the L

first-level classifiers

• An example• Learn L classifiers with leave-one-out cross-validation• Interpret the prediction of the L classifiers as L-dimensional

feature vector• Learn “level-2” classifier based on the examples generated

this way

Chaohui Wang Introduction to Machine Learning 50 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Idea• Learn L classifiers (based on the training data)• Find a meta-classifier that takes as input the output of the L

first-level classifiers

• An example• Learn L classifiers with leave-one-out cross-validation• Interpret the prediction of the L classifiers as L-dimensional

feature vector• Learn “level-2” classifier based on the examples generated

this way

Chaohui Wang Introduction to Machine Learning 50 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Idea• Learn L classifiers (based on the training data)• Find a meta-classifier that takes as input the output of the L

first-level classifiers

• An example• Learn L classifiers with leave-one-out cross-validation• Interpret the prediction of the L classifiers as L-dimensional

feature vector• Learn “level-2” classifier based on the examples generated

this way

Chaohui Wang Introduction to Machine Learning 50 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Idea• Learn L classifiers (based on the training data)• Find a meta-classifier that takes as input the output of the L

first-level classifiers

• An example• Learn L classifiers with leave-one-out cross-validation• Interpret the prediction of the L classifiers as L-dimensional

feature vector• Learn “level-2” classifier based on the examples generated

this way

Chaohui Wang Introduction to Machine Learning 50 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Idea• Learn L classifiers (based on the training data)• Find a meta-classifier that takes as input the output of the L

first-level classifiers

• An example• Learn L classifiers with leave-one-out cross-validation• Interpret the prediction of the L classifiers as L-dimensional

feature vector• Learn “level-2” classifier based on the examples generated

this way

Chaohui Wang Introduction to Machine Learning 50 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Idea• Learn L classifiers (based on the training data)• Find a meta-classifier that takes as input the output of the L

first-level classifiers

• An example• Learn L classifiers with leave-one-out cross-validation• Interpret the prediction of the L classifiers as L-dimensional

feature vector• Learn “level-2” classifier based on the examples generated

this way

Chaohui Wang Introduction to Machine Learning 50 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Why can this be useful?• Simplicity: we may already have several existing classifiers

available→ No need to re-train those, they can just be combinedwith the additional new ones

• Correlation between classifiers: the combined classifier canexploit such correlation

• Feature combination: E.g., we can integrate informationfrom multi-modal data (video, audio, sub-title, etc.) via thefollowing scheme :

I First train each of the L classifiers on its own input dataI Then train the combination classifier on the combined input

Chaohui Wang Introduction to Machine Learning 51 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Why can this be useful?• Simplicity: we may already have several existing classifiers

available→ No need to re-train those, they can just be combinedwith the additional new ones

• Correlation between classifiers: the combined classifier canexploit such correlation

• Feature combination: E.g., we can integrate informationfrom multi-modal data (video, audio, sub-title, etc.) via thefollowing scheme :

I First train each of the L classifiers on its own input dataI Then train the combination classifier on the combined input

Chaohui Wang Introduction to Machine Learning 51 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Why can this be useful?• Simplicity: we may already have several existing classifiers

available→ No need to re-train those, they can just be combinedwith the additional new ones

• Correlation between classifiers: the combined classifier canexploit such correlation

• Feature combination: E.g., we can integrate informationfrom multi-modal data (video, audio, sub-title, etc.) via thefollowing scheme :

I First train each of the L classifiers on its own input dataI Then train the combination classifier on the combined input

Chaohui Wang Introduction to Machine Learning 51 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Why can this be useful?• Simplicity: we may already have several existing classifiers

available→ No need to re-train those, they can just be combinedwith the additional new ones

• Correlation between classifiers: the combined classifier canexploit such correlation

• Feature combination: E.g., we can integrate informationfrom multi-modal data (video, audio, sub-title, etc.) via thefollowing scheme :

I First train each of the L classifiers on its own input dataI Then train the combination classifier on the combined input

Chaohui Wang Introduction to Machine Learning 51 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Why can this be useful?• Simplicity: we may already have several existing classifiers

available→ No need to re-train those, they can just be combinedwith the additional new ones

• Correlation between classifiers: the combined classifier canexploit such correlation

• Feature combination: E.g., we can integrate informationfrom multi-modal data (video, audio, sub-title, etc.) via thefollowing scheme :

I First train each of the L classifiers on its own input dataI Then train the combination classifier on the combined input

Chaohui Wang Introduction to Machine Learning 51 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Why can this be useful?• Simplicity: we may already have several existing classifiers

available→ No need to re-train those, they can just be combinedwith the additional new ones

• Correlation between classifiers: the combined classifier canexploit such correlation

• Feature combination: E.g., we can integrate informationfrom multi-modal data (video, audio, sub-title, etc.) via thefollowing scheme :

I First train each of the L classifiers on its own input dataI Then train the combination classifier on the combined input

Chaohui Wang Introduction to Machine Learning 51 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Stacking

• Why can this be useful?• Simplicity: we may already have several existing classifiers

available→ No need to re-train those, they can just be combinedwith the additional new ones

• Correlation between classifiers: the combined classifier canexploit such correlation

• Feature combination: E.g., we can integrate informationfrom multi-modal data (video, audio, sub-title, etc.) via thefollowing scheme :

I First train each of the L classifiers on its own input dataI Then train the combination classifier on the combined input

Chaohui Wang Introduction to Machine Learning 51 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Combine L predictors yl(x) for target output h(x)• The committee prediction is given by

ycom =1L

L∑l=1

yl(x)

• The output can be written as the true value plus some error:

y(x) = h(x) + ε(x)

• Thus, the average sum-of-squares error takes the form:

Ex[εl(x)2] = Ex[{yl(x)− h(x)}2]

Chaohui Wang Introduction to Machine Learning 52 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Combine L predictors yl(x) for target output h(x)• The committee prediction is given by

ycom =1L

L∑l=1

yl(x)

• The output can be written as the true value plus some error:

y(x) = h(x) + ε(x)

• Thus, the average sum-of-squares error takes the form:

Ex[εl(x)2] = Ex[{yl(x)− h(x)}2]

Chaohui Wang Introduction to Machine Learning 52 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Combine L predictors yl(x) for target output h(x)• The committee prediction is given by

ycom =1L

L∑l=1

yl(x)

• The output can be written as the true value plus some error:

y(x) = h(x) + ε(x)

• Thus, the average sum-of-squares error takes the form:

Ex[εl(x)2] = Ex[{yl(x)− h(x)}2]

Chaohui Wang Introduction to Machine Learning 52 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Combine L predictors yl(x) for target output h(x)• The committee prediction is given by

ycom =1L

L∑l=1

yl(x)

• The output can be written as the true value plus some error:

y(x) = h(x) + ε(x)

• Thus, the average sum-of-squares error takes the form:

Ex[εl(x)2] = Ex[{yl(x)− h(x)}2]

Chaohui Wang Introduction to Machine Learning 52 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of individual models

EAV =1L

L∑l=1

Ex[εl(x)2]

• Average error of committee

ECOM = Ex[{1L

L∑l=1

yl(x)− h(x)}2]

= Ex[1L2 {

L∑l=1

(yl(x)− h(x))}2]

= Ex[1L2 {

L∑l=1

εl(x)}2]

Chaohui Wang Introduction to Machine Learning 53 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of individual models

EAV =1L

L∑l=1

Ex[εl(x)2]

• Average error of committee

ECOM = Ex[{1L

L∑l=1

yl(x)− h(x)}2]

= Ex[1L2 {

L∑l=1

(yl(x)− h(x))}2]

= Ex[1L2 {

L∑l=1

εl(x)}2]

Chaohui Wang Introduction to Machine Learning 53 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of individual models

EAV =1L

L∑l=1

Ex[εl(x)2]

• Average error of committee

ECOM = Ex[{1L

L∑l=1

yl(x)− h(x)}2]

= Ex[1L2 {

L∑l=1

(yl(x)− h(x))}2]

= Ex[1L2 {

L∑l=1

εl(x)}2]

Chaohui Wang Introduction to Machine Learning 53 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of individual models

EAV =1L

L∑l=1

Ex[εl(x)2]

• Average error of committee

ECOM = Ex[{1L

L∑l=1

yl(x)− h(x)}2]

= Ex[1L2 {

L∑l=1

(yl(x)− h(x))}2]

= Ex[1L2 {

L∑l=1

εl(x)}2]

Chaohui Wang Introduction to Machine Learning 53 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of individual models

EAV =1L

L∑l=1

Ex[εl(x)2]

• Average error of committee

ECOM = Ex[{1L

L∑l=1

yl(x)− h(x)}2]

= Ex[1L2 {

L∑l=1

(yl(x)− h(x))}2]

= Ex[1L2 {

L∑l=1

εl(x)}2]

Chaohui Wang Introduction to Machine Learning 53 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of individual models

EAV =1L

L∑l=1

Ex[εl(x)2]

• Average error of committee

ECOM = Ex[1L2 {

L∑l=1

εl(x)}2]

• Assumptions• Errors have zero mean: Ex[εl(x)] = 0• Errors are uncorrelated: Ex[εl(x)εl′(x)] = 0→ Then:

ECOM =1L

EAV

Chaohui Wang Introduction to Machine Learning 54 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of individual models

EAV =1L

L∑l=1

Ex[εl(x)2]

• Average error of committee

ECOM = Ex[1L2 {

L∑l=1

εl(x)}2]

• Assumptions• Errors have zero mean: Ex[εl(x)] = 0• Errors are uncorrelated: Ex[εl(x)εl′(x)] = 0→ Then:

ECOM =1L

EAV

Chaohui Wang Introduction to Machine Learning 54 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of individual models

EAV =1L

L∑l=1

Ex[εl(x)2]

• Average error of committee

ECOM = Ex[1L2 {

L∑l=1

εl(x)}2]

• Assumptions• Errors have zero mean: Ex[εl(x)] = 0• Errors are uncorrelated: Ex[εl(x)εl′(x)] = 0→ Then:

ECOM =1L

EAV

Chaohui Wang Introduction to Machine Learning 54 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of individual models

EAV =1L

L∑l=1

Ex[εl(x)2]

• Average error of committee

ECOM = Ex[1L2 {

L∑l=1

εl(x)}2]

• Assumptions• Errors have zero mean: Ex[εl(x)] = 0• Errors are uncorrelated: Ex[εl(x)εl′(x)] = 0→ Then:

ECOM =1L

EAV

Chaohui Wang Introduction to Machine Learning 54 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of individual models

EAV =1L

L∑l=1

Ex[εl(x)2]

• Average error of committee

ECOM = Ex[1L2 {

L∑l=1

εl(x)}2]

• Assumptions• Errors have zero mean: Ex[εl(x)] = 0• Errors are uncorrelated: Ex[εl(x)εl′(x)] = 0→ Then:

ECOM =1L

EAV

Chaohui Wang Introduction to Machine Learning 54 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of committee?

ECOM =1L

EAV

• This suggests that the average error of a model can bereduced by a factor of L simply by averaging L versions ofthe model!→ This sounds almost too good to be true...

• Can you see where the problem is?

• Unfortunately, this result depends on the assumption thatthe errors are all uncorrelated

• In practice, they will typically be highly correlated• Still, it can be shown that:

ECOM ≤ EAV

Chaohui Wang Introduction to Machine Learning 55 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of committee?

ECOM =1L

EAV

• This suggests that the average error of a model can bereduced by a factor of L simply by averaging L versions ofthe model!→ This sounds almost too good to be true...

• Can you see where the problem is?

• Unfortunately, this result depends on the assumption thatthe errors are all uncorrelated

• In practice, they will typically be highly correlated• Still, it can be shown that:

ECOM ≤ EAV

Chaohui Wang Introduction to Machine Learning 55 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of committee?

ECOM =1L

EAV

• This suggests that the average error of a model can bereduced by a factor of L simply by averaging L versions ofthe model!→ This sounds almost too good to be true...

• Can you see where the problem is?

• Unfortunately, this result depends on the assumption thatthe errors are all uncorrelated

• In practice, they will typically be highly correlated• Still, it can be shown that:

ECOM ≤ EAV

Chaohui Wang Introduction to Machine Learning 55 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of committee?

ECOM =1L

EAV

• This suggests that the average error of a model can bereduced by a factor of L simply by averaging L versions ofthe model!→ This sounds almost too good to be true...

• Can you see where the problem is?

• Unfortunately, this result depends on the assumption thatthe errors are all uncorrelated

• In practice, they will typically be highly correlated• Still, it can be shown that:

ECOM ≤ EAV

Chaohui Wang Introduction to Machine Learning 55 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of committee?

ECOM =1L

EAV

• This suggests that the average error of a model can bereduced by a factor of L simply by averaging L versions ofthe model!→ This sounds almost too good to be true...

• Can you see where the problem is?• Unfortunately, this result depends on the assumption that

the errors are all uncorrelated

• In practice, they will typically be highly correlated• Still, it can be shown that:

ECOM ≤ EAV

Chaohui Wang Introduction to Machine Learning 55 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of committee?

ECOM =1L

EAV

• This suggests that the average error of a model can bereduced by a factor of L simply by averaging L versions ofthe model!→ This sounds almost too good to be true...

• Can you see where the problem is?• Unfortunately, this result depends on the assumption that

the errors are all uncorrelated• In practice, they will typically be highly correlated

• Still, it can be shown that:

ECOM ≤ EAV

Chaohui Wang Introduction to Machine Learning 55 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Model Averaging: Expected Error

• Average error of committee?

ECOM =1L

EAV

• This suggests that the average error of a model can bereduced by a factor of L simply by averaging L versions ofthe model!→ This sounds almost too good to be true...

• Can you see where the problem is?• Unfortunately, this result depends on the assumption that

the errors are all uncorrelated• In practice, they will typically be highly correlated• Still, it can be shown that:

ECOM ≤ EAV

Chaohui Wang Introduction to Machine Learning 55 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• Main idea• Sequential classifier selection: Train successively

component classifiers on a subset of the training data→ Select the subset that is most informative given thecurrent set of classifiers

Chaohui Wang Introduction to Machine Learning 56 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• Main idea• Sequential classifier selection: Train successively

component classifiers on a subset of the training data→ Select the subset that is most informative given thecurrent set of classifiers

Chaohui Wang Introduction to Machine Learning 56 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• Main idea• Sequential classifier selection: Train successively

component classifiers on a subset of the training data→ Select the subset that is most informative given thecurrent set of classifiers

Chaohui Wang Introduction to Machine Learning 56 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• Algorithm (3-component classifier) [Schapire 1989]1 Sample N1 < N training examples (without replacement)

from training set D to get set D1→ Train weak classifier C1 on D1

2 Sample N2 < N training examples (without replacement),half of which were misclassified by C1 to get set D2→ Train weak classifier C2 on D2

3 Choose all data in D on which C1 and C2 disagree to getset D3→ Train weak classifier C3 on D3

4 Get the final classifier output by majority voting of C1, C2,and C3

→ Problem: How should we choose the number ofsamples N1?

Chaohui Wang Introduction to Machine Learning 57 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• Algorithm (3-component classifier) [Schapire 1989]1 Sample N1 < N training examples (without replacement)

from training set D to get set D1→ Train weak classifier C1 on D1

2 Sample N2 < N training examples (without replacement),half of which were misclassified by C1 to get set D2→ Train weak classifier C2 on D2

3 Choose all data in D on which C1 and C2 disagree to getset D3→ Train weak classifier C3 on D3

4 Get the final classifier output by majority voting of C1, C2,and C3

→ Problem: How should we choose the number ofsamples N1?

Chaohui Wang Introduction to Machine Learning 57 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• Algorithm (3-component classifier) [Schapire 1989]1 Sample N1 < N training examples (without replacement)

from training set D to get set D1→ Train weak classifier C1 on D1

2 Sample N2 < N training examples (without replacement),half of which were misclassified by C1 to get set D2→ Train weak classifier C2 on D2

3 Choose all data in D on which C1 and C2 disagree to getset D3→ Train weak classifier C3 on D3

4 Get the final classifier output by majority voting of C1, C2,and C3

→ Problem: How should we choose the number ofsamples N1?

Chaohui Wang Introduction to Machine Learning 57 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• Algorithm (3-component classifier) [Schapire 1989]1 Sample N1 < N training examples (without replacement)

from training set D to get set D1→ Train weak classifier C1 on D1

2 Sample N2 < N training examples (without replacement),half of which were misclassified by C1 to get set D2→ Train weak classifier C2 on D2

3 Choose all data in D on which C1 and C2 disagree to getset D3→ Train weak classifier C3 on D3

4 Get the final classifier output by majority voting of C1, C2,and C3

→ Problem: How should we choose the number ofsamples N1?

Chaohui Wang Introduction to Machine Learning 57 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• Algorithm (3-component classifier) [Schapire 1989]1 Sample N1 < N training examples (without replacement)

from training set D to get set D1→ Train weak classifier C1 on D1

2 Sample N2 < N training examples (without replacement),half of which were misclassified by C1 to get set D2→ Train weak classifier C2 on D2

3 Choose all data in D on which C1 and C2 disagree to getset D3→ Train weak classifier C3 on D3

4 Get the final classifier output by majority voting of C1, C2,and C3

→ Problem: How should we choose the number ofsamples N1?

Chaohui Wang Introduction to Machine Learning 57 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• Algorithm (3-component classifier) [Schapire 1989]1 Sample N1 < N training examples (without replacement)

from training set D to get set D1→ Train weak classifier C1 on D1

2 Sample N2 < N training examples (without replacement),half of which were misclassified by C1 to get set D2→ Train weak classifier C2 on D2

3 Choose all data in D on which C1 and C2 disagree to getset D3→ Train weak classifier C3 on D3

4 Get the final classifier output by majority voting of C1, C2,and C3

→ Problem: How should we choose the number ofsamples N1?

Chaohui Wang Introduction to Machine Learning 57 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• Algorithm (3-component classifier) [Schapire 1989]1 Sample N1 < N training examples (without replacement)

from training set D to get set D1→ Train weak classifier C1 on D1

2 Sample N2 < N training examples (without replacement),half of which were misclassified by C1 to get set D2→ Train weak classifier C2 on D2

3 Choose all data in D on which C1 and C2 disagree to getset D3→ Train weak classifier C3 on D3

4 Get the final classifier output by majority voting of C1, C2,and C3

→ Problem: How should we choose the number ofsamples N1?

Chaohui Wang Introduction to Machine Learning 57 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• Algorithm (3-component classifier) [Schapire 1989]1 Sample N1 < N training examples (without replacement)

from training set D to get set D1→ Train weak classifier C1 on D1

2 Sample N2 < N training examples (without replacement),half of which were misclassified by C1 to get set D2→ Train weak classifier C2 on D2

3 Choose all data in D on which C1 and C2 disagree to getset D3→ Train weak classifier C3 on D3

4 Get the final classifier output by majority voting of C1, C2,and C3

→ Problem: How should we choose the number ofsamples N1?

Chaohui Wang Introduction to Machine Learning 57 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• Algorithm (3-component classifier) [Schapire 1989]1 Sample N1 < N training examples (without replacement)

from training set D to get set D1→ Train weak classifier C1 on D1

2 Sample N2 < N training examples (without replacement),half of which were misclassified by C1 to get set D2→ Train weak classifier C2 on D2

3 Choose all data in D on which C1 and C2 disagree to getset D3→ Train weak classifier C3 on D3

4 Get the final classifier output by majority voting of C1, C2,and C3

→ Problem: How should we choose the number ofsamples N1?

Chaohui Wang Introduction to Machine Learning 57 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• How should we choose the number of samples N1?• Ideally, the number of samples should be roughly equal in

all 3 component classifiers• Reasonable first guess: N1 ≈ N/3• However, if the problem is very simple→ C1 will explain most of the data→ N2 and N3 will be very small→ Not all of the data will be used effectively

• Similarly, if the problem is extremely hard→ C1 will explain only a small part of the data→ N2 may be unacceptably large

• In practice, we may need to run the boosting procedure afew times and adjust N1 in order to explore the full trainingset

Chaohui Wang Introduction to Machine Learning 58 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• How should we choose the number of samples N1?• Ideally, the number of samples should be roughly equal in

all 3 component classifiers• Reasonable first guess: N1 ≈ N/3• However, if the problem is very simple→ C1 will explain most of the data→ N2 and N3 will be very small→ Not all of the data will be used effectively

• Similarly, if the problem is extremely hard→ C1 will explain only a small part of the data→ N2 may be unacceptably large

• In practice, we may need to run the boosting procedure afew times and adjust N1 in order to explore the full trainingset

Chaohui Wang Introduction to Machine Learning 58 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• How should we choose the number of samples N1?• Ideally, the number of samples should be roughly equal in

all 3 component classifiers• Reasonable first guess: N1 ≈ N/3• However, if the problem is very simple→ C1 will explain most of the data→ N2 and N3 will be very small→ Not all of the data will be used effectively

• Similarly, if the problem is extremely hard→ C1 will explain only a small part of the data→ N2 may be unacceptably large

• In practice, we may need to run the boosting procedure afew times and adjust N1 in order to explore the full trainingset

Chaohui Wang Introduction to Machine Learning 58 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• How should we choose the number of samples N1?• Ideally, the number of samples should be roughly equal in

all 3 component classifiers• Reasonable first guess: N1 ≈ N/3• However, if the problem is very simple→ C1 will explain most of the data→ N2 and N3 will be very small→ Not all of the data will be used effectively

• Similarly, if the problem is extremely hard→ C1 will explain only a small part of the data→ N2 may be unacceptably large

• In practice, we may need to run the boosting procedure afew times and adjust N1 in order to explore the full trainingset

Chaohui Wang Introduction to Machine Learning 58 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• How should we choose the number of samples N1?• Ideally, the number of samples should be roughly equal in

all 3 component classifiers• Reasonable first guess: N1 ≈ N/3• However, if the problem is very simple→ C1 will explain most of the data→ N2 and N3 will be very small→ Not all of the data will be used effectively

• Similarly, if the problem is extremely hard→ C1 will explain only a small part of the data→ N2 may be unacceptably large

• In practice, we may need to run the boosting procedure afew times and adjust N1 in order to explore the full trainingset

Chaohui Wang Introduction to Machine Learning 58 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• How should we choose the number of samples N1?• Ideally, the number of samples should be roughly equal in

all 3 component classifiers• Reasonable first guess: N1 ≈ N/3• However, if the problem is very simple→ C1 will explain most of the data→ N2 and N3 will be very small→ Not all of the data will be used effectively

• Similarly, if the problem is extremely hard→ C1 will explain only a small part of the data→ N2 may be unacceptably large

• In practice, we may need to run the boosting procedure afew times and adjust N1 in order to explore the full trainingset

Chaohui Wang Introduction to Machine Learning 58 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• How should we choose the number of samples N1?• Ideally, the number of samples should be roughly equal in

all 3 component classifiers• Reasonable first guess: N1 ≈ N/3• However, if the problem is very simple→ C1 will explain most of the data→ N2 and N3 will be very small→ Not all of the data will be used effectively

• Similarly, if the problem is extremely hard→ C1 will explain only a small part of the data→ N2 may be unacceptably large

• In practice, we may need to run the boosting procedure afew times and adjust N1 in order to explore the full trainingset

Chaohui Wang Introduction to Machine Learning 58 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• How should we choose the number of samples N1?• Ideally, the number of samples should be roughly equal in

all 3 component classifiers• Reasonable first guess: N1 ≈ N/3• However, if the problem is very simple→ C1 will explain most of the data→ N2 and N3 will be very small→ Not all of the data will be used effectively

• Similarly, if the problem is extremely hard→ C1 will explain only a small part of the data→ N2 may be unacceptably large

• In practice, we may need to run the boosting procedure afew times and adjust N1 in order to explore the full trainingset

Chaohui Wang Introduction to Machine Learning 58 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• How should we choose the number of samples N1?• Ideally, the number of samples should be roughly equal in

all 3 component classifiers• Reasonable first guess: N1 ≈ N/3• However, if the problem is very simple→ C1 will explain most of the data→ N2 and N3 will be very small→ Not all of the data will be used effectively

• Similarly, if the problem is extremely hard→ C1 will explain only a small part of the data→ N2 may be unacceptably large

• In practice, we may need to run the boosting procedure afew times and adjust N1 in order to explore the full trainingset

Chaohui Wang Introduction to Machine Learning 58 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Boosting

• How should we choose the number of samples N1?• Ideally, the number of samples should be roughly equal in

all 3 component classifiers• Reasonable first guess: N1 ≈ N/3• However, if the problem is very simple→ C1 will explain most of the data→ N2 and N3 will be very small→ Not all of the data will be used effectively

• Similarly, if the problem is extremely hard→ C1 will explain only a small part of the data→ N2 may be unacceptably large

• In practice, we may need to run the boosting procedure afew times and adjust N1 in order to explore the full trainingset

Chaohui Wang Introduction to Machine Learning 58 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Idea

• AdaBoost: “Adaptive Boosting” ([Freund & Schapire,1996])

• Main idea: Reweight misclassified training examples,instead of resampling as original boosting algorithm• Increase the chance of being selected in a sampled training

set• Or: increase the misclassification cost when training on the

full set

Chaohui Wang Introduction to Machine Learning 59 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Idea

• AdaBoost: “Adaptive Boosting” ([Freund & Schapire,1996])

• Main idea: Reweight misclassified training examples,instead of resampling as original boosting algorithm• Increase the chance of being selected in a sampled training

set• Or: increase the misclassification cost when training on the

full set

Chaohui Wang Introduction to Machine Learning 59 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Idea

• AdaBoost: “Adaptive Boosting” ([Freund & Schapire,1996])

• Main idea: Reweight misclassified training examples,instead of resampling as original boosting algorithm• Increase the chance of being selected in a sampled training

set• Or: increase the misclassification cost when training on the

full set

Chaohui Wang Introduction to Machine Learning 59 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Idea

• AdaBoost: “Adaptive Boosting” ([Freund & Schapire,1996])

• Main idea: Reweight misclassified training examples,instead of resampling as original boosting algorithm• Increase the chance of being selected in a sampled training

set• Or: increase the misclassification cost when training on the

full set

Chaohui Wang Introduction to Machine Learning 59 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Formulation

• Formulation: Construct a strong classifier as a thresholdedlinear combination of the weighted weak classifiers:

H(x) = sign(L∑

l=1

αlhl(x))

Notations:• H(x): strong classifier (final classifier)• hl(x) (l ∈ {1, . . . ,L}): weak classifier (base classifier)→ Condition: < 50% training error over any distribution→Why? Discover the reason after studying the algorithm inthe next slide

• αl (l ∈ {1, . . . ,L}): weigh for weak classifier l→ Also learned during the training process

Chaohui Wang Introduction to Machine Learning 60 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Formulation

• Formulation: Construct a strong classifier as a thresholdedlinear combination of the weighted weak classifiers:

H(x) = sign(L∑

l=1

αlhl(x))

Notations:• H(x): strong classifier (final classifier)• hl(x) (l ∈ {1, . . . ,L}): weak classifier (base classifier)→ Condition: < 50% training error over any distribution→Why? Discover the reason after studying the algorithm inthe next slide

• αl (l ∈ {1, . . . ,L}): weigh for weak classifier l→ Also learned during the training process

Chaohui Wang Introduction to Machine Learning 60 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Formulation

• Formulation: Construct a strong classifier as a thresholdedlinear combination of the weighted weak classifiers:

H(x) = sign(L∑

l=1

αlhl(x))

Notations:• H(x): strong classifier (final classifier)• hl(x) (l ∈ {1, . . . ,L}): weak classifier (base classifier)→ Condition: < 50% training error over any distribution→Why? Discover the reason after studying the algorithm inthe next slide

• αl (l ∈ {1, . . . ,L}): weigh for weak classifier l→ Also learned during the training process

Chaohui Wang Introduction to Machine Learning 60 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Formulation

• Formulation: Construct a strong classifier as a thresholdedlinear combination of the weighted weak classifiers:

H(x) = sign(L∑

l=1

αlhl(x))

Notations:• H(x): strong classifier (final classifier)• hl(x) (l ∈ {1, . . . ,L}): weak classifier (base classifier)→ Condition: < 50% training error over any distribution→Why? Discover the reason after studying the algorithm inthe next slide

• αl (l ∈ {1, . . . ,L}): weigh for weak classifier l→ Also learned during the training process

Chaohui Wang Introduction to Machine Learning 60 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Formulation

• Formulation: Construct a strong classifier as a thresholdedlinear combination of the weighted weak classifiers:

H(x) = sign(L∑

l=1

αlhl(x))

Notations:• H(x): strong classifier (final classifier)• hl(x) (l ∈ {1, . . . ,L}): weak classifier (base classifier)→ Condition: < 50% training error over any distribution→Why? Discover the reason after studying the algorithm inthe next slide

• αl (l ∈ {1, . . . ,L}): weigh for weak classifier l→ Also learned during the training process

Chaohui Wang Introduction to Machine Learning 60 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Formulation

• Formulation: Construct a strong classifier as a thresholdedlinear combination of the weighted weak classifiers:

H(x) = sign(L∑

l=1

αlhl(x))

Notations:• H(x): strong classifier (final classifier)• hl(x) (l ∈ {1, . . . ,L}): weak classifier (base classifier)→ Condition: < 50% training error over any distribution→Why? Discover the reason after studying the algorithm inthe next slide

• αl (l ∈ {1, . . . ,L}): weigh for weak classifier l→ Also learned during the training process

Chaohui Wang Introduction to Machine Learning 60 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Formulation

• Formulation: Construct a strong classifier as a thresholdedlinear combination of the weighted weak classifiers:

H(x) = sign(L∑

l=1

αlhl(x))

Notations:• H(x): strong classifier (final classifier)• hl(x) (l ∈ {1, . . . ,L}): weak classifier (base classifier)→ Condition: < 50% training error over any distribution→Why? Discover the reason after studying the algorithm inthe next slide

• αl (l ∈ {1, . . . ,L}): weigh for weak classifier l→ Also learned during the training process

Chaohui Wang Introduction to Machine Learning 60 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Formulation

• Formulation: Construct a strong classifier as a thresholdedlinear combination of the weighted weak classifiers:

H(x) = sign(L∑

l=1

αlhl(x))

Notations:• H(x): strong classifier (final classifier)• hl(x) (l ∈ {1, . . . ,L}): weak classifier (base classifier)→ Condition: < 50% training error over any distribution→Why? Discover the reason after studying the algorithm inthe next slide

• αl (l ∈ {1, . . . ,L}): weigh for weak classifier l→ Also learned during the training process

Chaohui Wang Introduction to Machine Learning 60 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Algorithm

• Initialization: Set w(1)n = 1/N (n = 1, . . . ,N)

• For l = 1, . . . ,L, do iterations:1 Train a new weak classifier hl(x) using the current weightis

W(l) by minimizing the weighted error function:

Jl =

N∑n=1

w(l)n [hl(x) 6= tn]

2 Estimate the weighted error of this classifier on all data:

εl =

∑Nn=1 w(l)

n [hl(x) 6= tn]∑Nn=1 w(l)

n

3 Calculate the weight for classifier hl(x): αl = ln 1−εlεl

4 Update the weighting coefficients of all training samples:w(l+1)

n = w(l)n exp (αl[hl(xn) 6= tn])

Chaohui Wang Introduction to Machine Learning 61 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Algorithm

• Initialization: Set w(1)n = 1/N (n = 1, . . . ,N)

• For l = 1, . . . ,L, do iterations:1 Train a new weak classifier hl(x) using the current weightis

W(l) by minimizing the weighted error function:

Jl =

N∑n=1

w(l)n [hl(x) 6= tn]

2 Estimate the weighted error of this classifier on all data:

εl =

∑Nn=1 w(l)

n [hl(x) 6= tn]∑Nn=1 w(l)

n

3 Calculate the weight for classifier hl(x): αl = ln 1−εlεl

4 Update the weighting coefficients of all training samples:w(l+1)

n = w(l)n exp (αl[hl(xn) 6= tn])

Chaohui Wang Introduction to Machine Learning 61 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Algorithm

• Initialization: Set w(1)n = 1/N (n = 1, . . . ,N)

• For l = 1, . . . ,L, do iterations:1 Train a new weak classifier hl(x) using the current weightis

W(l) by minimizing the weighted error function:

Jl =

N∑n=1

w(l)n [hl(x) 6= tn]

2 Estimate the weighted error of this classifier on all data:

εl =

∑Nn=1 w(l)

n [hl(x) 6= tn]∑Nn=1 w(l)

n

3 Calculate the weight for classifier hl(x): αl = ln 1−εlεl

4 Update the weighting coefficients of all training samples:w(l+1)

n = w(l)n exp (αl[hl(xn) 6= tn])

Chaohui Wang Introduction to Machine Learning 61 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Algorithm

• Initialization: Set w(1)n = 1/N (n = 1, . . . ,N)

• For l = 1, . . . ,L, do iterations:1 Train a new weak classifier hl(x) using the current weightis

W(l) by minimizing the weighted error function:

Jl =

N∑n=1

w(l)n [hl(x) 6= tn]

2 Estimate the weighted error of this classifier on all data:

εl =

∑Nn=1 w(l)

n [hl(x) 6= tn]∑Nn=1 w(l)

n

3 Calculate the weight for classifier hl(x): αl = ln 1−εlεl

4 Update the weighting coefficients of all training samples:w(l+1)

n = w(l)n exp (αl[hl(xn) 6= tn])

Chaohui Wang Introduction to Machine Learning 61 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Algorithm

• Initialization: Set w(1)n = 1/N (n = 1, . . . ,N)

• For l = 1, . . . ,L, do iterations:1 Train a new weak classifier hl(x) using the current weightis

W(l) by minimizing the weighted error function:

Jl =

N∑n=1

w(l)n [hl(x) 6= tn]

2 Estimate the weighted error of this classifier on all data:

εl =

∑Nn=1 w(l)

n [hl(x) 6= tn]∑Nn=1 w(l)

n

3 Calculate the weight for classifier hl(x): αl = ln 1−εlεl

4 Update the weighting coefficients of all training samples:w(l+1)

n = w(l)n exp (αl[hl(xn) 6= tn])

Chaohui Wang Introduction to Machine Learning 61 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

AdaBoost: Algorithm

• Initialization: Set w(1)n = 1/N (n = 1, . . . ,N)

• For l = 1, . . . ,L, do iterations:1 Train a new weak classifier hl(x) using the current weightis

W(l) by minimizing the weighted error function:

Jl =

N∑n=1

w(l)n [hl(x) 6= tn]

2 Estimate the weighted error of this classifier on all data:

εl =

∑Nn=1 w(l)

n [hl(x) 6= tn]∑Nn=1 w(l)

n

3 Calculate the weight for classifier hl(x): αl = ln 1−εlεl

4 Update the weighting coefficients of all training samples:w(l+1)

n = w(l)n exp (αl[hl(xn) 6= tn])

Chaohui Wang Introduction to Machine Learning 61 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Summary: AdaBoost

• Properties• Simple combination of multiple classifiers• Easy to implement• Can be used with many different types of classifiers→ None of them needs to be too good on its own. Theyonly have to be slightly better than chance

• Commonly used in many areas• Empirically good generalization capabilities

• Limitations• Original AdaBoost is sensitive to misclassified training data

points→ Improvement by for example GentleBoost

• Binary-class classifier, although multiclass extensionsavailable

Chaohui Wang Introduction to Machine Learning 62 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Summary: AdaBoost

• Properties• Simple combination of multiple classifiers• Easy to implement• Can be used with many different types of classifiers→ None of them needs to be too good on its own. Theyonly have to be slightly better than chance

• Commonly used in many areas• Empirically good generalization capabilities

• Limitations• Original AdaBoost is sensitive to misclassified training data

points→ Improvement by for example GentleBoost

• Binary-class classifier, although multiclass extensionsavailable

Chaohui Wang Introduction to Machine Learning 62 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Summary: AdaBoost

• Properties• Simple combination of multiple classifiers• Easy to implement• Can be used with many different types of classifiers→ None of them needs to be too good on its own. Theyonly have to be slightly better than chance

• Commonly used in many areas• Empirically good generalization capabilities

• Limitations• Original AdaBoost is sensitive to misclassified training data

points→ Improvement by for example GentleBoost

• Binary-class classifier, although multiclass extensionsavailable

Chaohui Wang Introduction to Machine Learning 62 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Summary: AdaBoost

• Properties• Simple combination of multiple classifiers• Easy to implement• Can be used with many different types of classifiers→ None of them needs to be too good on its own. Theyonly have to be slightly better than chance

• Commonly used in many areas• Empirically good generalization capabilities

• Limitations• Original AdaBoost is sensitive to misclassified training data

points→ Improvement by for example GentleBoost

• Binary-class classifier, although multiclass extensionsavailable

Chaohui Wang Introduction to Machine Learning 62 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Summary: AdaBoost

• Properties• Simple combination of multiple classifiers• Easy to implement• Can be used with many different types of classifiers→ None of them needs to be too good on its own. Theyonly have to be slightly better than chance

• Commonly used in many areas• Empirically good generalization capabilities

• Limitations• Original AdaBoost is sensitive to misclassified training data

points→ Improvement by for example GentleBoost

• Binary-class classifier, although multiclass extensionsavailable

Chaohui Wang Introduction to Machine Learning 62 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Summary: AdaBoost

• Properties• Simple combination of multiple classifiers• Easy to implement• Can be used with many different types of classifiers→ None of them needs to be too good on its own. Theyonly have to be slightly better than chance

• Commonly used in many areas• Empirically good generalization capabilities

• Limitations• Original AdaBoost is sensitive to misclassified training data

points→ Improvement by for example GentleBoost

• Binary-class classifier, although multiclass extensionsavailable

Chaohui Wang Introduction to Machine Learning 62 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Summary: AdaBoost

• Properties• Simple combination of multiple classifiers• Easy to implement• Can be used with many different types of classifiers→ None of them needs to be too good on its own. Theyonly have to be slightly better than chance

• Commonly used in many areas• Empirically good generalization capabilities

• Limitations• Original AdaBoost is sensitive to misclassified training data

points→ Improvement by for example GentleBoost

• Binary-class classifier, although multiclass extensionsavailable

Chaohui Wang Introduction to Machine Learning 62 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Summary: AdaBoost

• Properties• Simple combination of multiple classifiers• Easy to implement• Can be used with many different types of classifiers→ None of them needs to be too good on its own. Theyonly have to be slightly better than chance

• Commonly used in many areas• Empirically good generalization capabilities

• Limitations• Original AdaBoost is sensitive to misclassified training data

points→ Improvement by for example GentleBoost

• Binary-class classifier, although multiclass extensionsavailable

Chaohui Wang Introduction to Machine Learning 62 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Summary: AdaBoost

• Properties• Simple combination of multiple classifiers• Easy to implement• Can be used with many different types of classifiers→ None of them needs to be too good on its own. Theyonly have to be slightly better than chance

• Commonly used in many areas• Empirically good generalization capabilities

• Limitations• Original AdaBoost is sensitive to misclassified training data

points→ Improvement by for example GentleBoost

• Binary-class classifier, although multiclass extensionsavailable

Chaohui Wang Introduction to Machine Learning 62 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Summary: AdaBoost

• Properties• Simple combination of multiple classifiers• Easy to implement• Can be used with many different types of classifiers→ None of them needs to be too good on its own. Theyonly have to be slightly better than chance

• Commonly used in many areas• Empirically good generalization capabilities

• Limitations• Original AdaBoost is sensitive to misclassified training data

points→ Improvement by for example GentleBoost

• Binary-class classifier, although multiclass extensionsavailable

Chaohui Wang Introduction to Machine Learning 62 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Applications: Viola-Jones Face Detector

• Viola-Jones Face Detector [Viola & Jones 2004]

Chaohui Wang Introduction to Machine Learning 63 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Applications: Viola-Jones Face Detector

• Viola-Jones Face Detector [Viola & Jones 2004]

Chaohui Wang Introduction to Machine Learning 63 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Applications: Viola-Jones Face Detector

• Viola-Jones Face Detector [Viola & Jones 2004]

Chaohui Wang Introduction to Machine Learning 63 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

SVMs

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Chaohui Wang Introduction to Machine Learning 64 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

SVMs

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Chaohui Wang Introduction to Machine Learning 65 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees

• A tree• Each node specifies a test for some attribute• Each branch corresponds to a possible (attribute) value

→ Example: is Sunday suitable for playing tennis?

• A tree⇒ a set of if-then rules⇒ logical expressions(Outlook = Sunny ∩ Humidity = Normal)

→ For the question above: ∪(Outlook = Overcast)

∪(Outlook = Rain ∩Wind = Weak)

Chaohui Wang Introduction to Machine Learning 66 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees

• A tree• Each node specifies a test for some attribute• Each branch corresponds to a possible (attribute) value

→ Example: is Sunday suitable for playing tennis?

• A tree⇒ a set of if-then rules⇒ logical expressions(Outlook = Sunny ∩ Humidity = Normal)

→ For the question above: ∪(Outlook = Overcast)

∪(Outlook = Rain ∩Wind = Weak)

Chaohui Wang Introduction to Machine Learning 66 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees

• A tree• Each node specifies a test for some attribute• Each branch corresponds to a possible (attribute) value

→ Example: is Sunday suitable for playing tennis?

• A tree⇒ a set of if-then rules⇒ logical expressions(Outlook = Sunny ∩ Humidity = Normal)

→ For the question above: ∪(Outlook = Overcast)

∪(Outlook = Rain ∩Wind = Weak)

Chaohui Wang Introduction to Machine Learning 66 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees

• A tree• Each node specifies a test for some attribute• Each branch corresponds to a possible (attribute) value

→ Example: is Sunday suitable for playing tennis?

• A tree⇒ a set of if-then rules⇒ logical expressions(Outlook = Sunny ∩ Humidity = Normal)

→ For the question above: ∪(Outlook = Overcast)

∪(Outlook = Rain ∩Wind = Weak)

Chaohui Wang Introduction to Machine Learning 66 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees

• A tree• Each node specifies a test for some attribute• Each branch corresponds to a possible (attribute) value

→ Example: is Sunday suitable for playing tennis?

• A tree⇒ a set of if-then rules⇒ logical expressions(Outlook = Sunny ∩ Humidity = Normal)

→ For the question above: ∪(Outlook = Overcast)

∪(Outlook = Rain ∩Wind = Weak)

Chaohui Wang Introduction to Machine Learning 66 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees

• A tree• Each node specifies a test for some attribute• Each branch corresponds to a possible (attribute) value

→ Example: is Sunday suitable for playing tennis?

• A tree⇒ a set of if-then rules⇒ logical expressions(Outlook = Sunny ∩ Humidity = Normal)

→ For the question above: ∪(Outlook = Overcast)

∪(Outlook = Rain ∩Wind = Weak)

Chaohui Wang Introduction to Machine Learning 66 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees: Training

• Common procedure: Greedy top-down growing1 Start at the root node2 Progressively split the training data into smaller and smaller

subsets3 In each step, pick the best attribute to split the data4 If the resulting subsets are pure (only one label) or if no

further attribute can be found that splits them, terminate thetree

5 Else, recursively apply the procedure to the subsets

• Study & formalization of the different design choices→ E.g., CART framework (Classification And RegressionTrees [Breiman et al. 1984]

Chaohui Wang Introduction to Machine Learning 67 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees: Training

• Common procedure: Greedy top-down growing1 Start at the root node2 Progressively split the training data into smaller and smaller

subsets3 In each step, pick the best attribute to split the data4 If the resulting subsets are pure (only one label) or if no

further attribute can be found that splits them, terminate thetree

5 Else, recursively apply the procedure to the subsets

• Study & formalization of the different design choices→ E.g., CART framework (Classification And RegressionTrees [Breiman et al. 1984]

Chaohui Wang Introduction to Machine Learning 67 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees: Training

• Common procedure: Greedy top-down growing1 Start at the root node2 Progressively split the training data into smaller and smaller

subsets3 In each step, pick the best attribute to split the data4 If the resulting subsets are pure (only one label) or if no

further attribute can be found that splits them, terminate thetree

5 Else, recursively apply the procedure to the subsets

• Study & formalization of the different design choices→ E.g., CART framework (Classification And RegressionTrees [Breiman et al. 1984]

Chaohui Wang Introduction to Machine Learning 67 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees: Training

• Common procedure: Greedy top-down growing1 Start at the root node2 Progressively split the training data into smaller and smaller

subsets3 In each step, pick the best attribute to split the data4 If the resulting subsets are pure (only one label) or if no

further attribute can be found that splits them, terminate thetree

5 Else, recursively apply the procedure to the subsets

• Study & formalization of the different design choices→ E.g., CART framework (Classification And RegressionTrees [Breiman et al. 1984]

Chaohui Wang Introduction to Machine Learning 67 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees: Training

• Common procedure: Greedy top-down growing1 Start at the root node2 Progressively split the training data into smaller and smaller

subsets3 In each step, pick the best attribute to split the data4 If the resulting subsets are pure (only one label) or if no

further attribute can be found that splits them, terminate thetree

5 Else, recursively apply the procedure to the subsets

• Study & formalization of the different design choices→ E.g., CART framework (Classification And RegressionTrees [Breiman et al. 1984]

Chaohui Wang Introduction to Machine Learning 67 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees: Training

• Common procedure: Greedy top-down growing1 Start at the root node2 Progressively split the training data into smaller and smaller

subsets3 In each step, pick the best attribute to split the data4 If the resulting subsets are pure (only one label) or if no

further attribute can be found that splits them, terminate thetree

5 Else, recursively apply the procedure to the subsets

• Study & formalization of the different design choices→ E.g., CART framework (Classification And RegressionTrees [Breiman et al. 1984]

Chaohui Wang Introduction to Machine Learning 67 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees: Training

• Common procedure: Greedy top-down growing1 Start at the root node2 Progressively split the training data into smaller and smaller

subsets3 In each step, pick the best attribute to split the data4 If the resulting subsets are pure (only one label) or if no

further attribute can be found that splits them, terminate thetree

5 Else, recursively apply the procedure to the subsets

• Study & formalization of the different design choices→ E.g., CART framework (Classification And RegressionTrees [Breiman et al. 1984]

Chaohui Wang Introduction to Machine Learning 67 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees: Issues

• Finding the optimal decision tree is NP-hard, leading to:• Prone to overfitting (the learned model fits the training data

very well, but is not good for unseen data in the testingstage )

• High complexity

→ Next: Randomized decision trees & Random forests

Chaohui Wang Introduction to Machine Learning 68 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees: Issues

• Finding the optimal decision tree is NP-hard, leading to:• Prone to overfitting (the learned model fits the training data

very well, but is not good for unseen data in the testingstage )

• High complexity

→ Next: Randomized decision trees & Random forests

Chaohui Wang Introduction to Machine Learning 68 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees: Issues

• Finding the optimal decision tree is NP-hard, leading to:• Prone to overfitting (the learned model fits the training data

very well, but is not good for unseen data in the testingstage )

• High complexity

→ Next: Randomized decision trees & Random forests

Chaohui Wang Introduction to Machine Learning 68 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Decision Trees: Issues

• Finding the optimal decision tree is NP-hard, leading to:• Prone to overfitting (the learned model fits the training data

very well, but is not good for unseen data in the testingstage )

• High complexity

→ Next: Randomized decision trees & Random forests

Chaohui Wang Introduction to Machine Learning 68 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Outline of This Lecture

Linear Discriminant FunctionsDiscriminant FunctionsLinear Discriminant Functions and Its LearningGeneralizations of Linear Discriminants

SVMs

Ensemble Methods & BoostingEnsembles of ClassifiersConstructing EnsemblesCombining Classifiers

Random ForestsPreliminary: Decision TreesRandom Forests

Chaohui Wang Introduction to Machine Learning 69 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Randomized Decision Trees

• Basic idea: randomize attribute selection• No longer look for globally optimal split• Instead, randomly choose a subset of K attributes base on

which to perform the split• Choose the best splitting attribute, e.g., by maximizing the

information gain (i.e., reducing entropy)

→ The obtained tree is not as powerful as a single classifier . . .→ Regard it as a weak classifier & build multiple ones (“RandomForest”)→What technique can be used here?

“Ensemble Methods”

Chaohui Wang Introduction to Machine Learning 70 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Randomized Decision Trees

• Basic idea: randomize attribute selection• No longer look for globally optimal split• Instead, randomly choose a subset of K attributes base on

which to perform the split• Choose the best splitting attribute, e.g., by maximizing the

information gain (i.e., reducing entropy)

→ The obtained tree is not as powerful as a single classifier . . .→ Regard it as a weak classifier & build multiple ones (“RandomForest”)→What technique can be used here?

“Ensemble Methods”

Chaohui Wang Introduction to Machine Learning 70 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Randomized Decision Trees

• Basic idea: randomize attribute selection• No longer look for globally optimal split• Instead, randomly choose a subset of K attributes base on

which to perform the split• Choose the best splitting attribute, e.g., by maximizing the

information gain (i.e., reducing entropy)

→ The obtained tree is not as powerful as a single classifier . . .→ Regard it as a weak classifier & build multiple ones (“RandomForest”)→What technique can be used here?

“Ensemble Methods”

Chaohui Wang Introduction to Machine Learning 70 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Randomized Decision Trees

• Basic idea: randomize attribute selection• No longer look for globally optimal split• Instead, randomly choose a subset of K attributes base on

which to perform the split• Choose the best splitting attribute, e.g., by maximizing the

information gain (i.e., reducing entropy)

→ The obtained tree is not as powerful as a single classifier . . .→ Regard it as a weak classifier & build multiple ones (“RandomForest”)→What technique can be used here?

“Ensemble Methods”

Chaohui Wang Introduction to Machine Learning 70 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Randomized Decision Trees

• Basic idea: randomize attribute selection• No longer look for globally optimal split• Instead, randomly choose a subset of K attributes base on

which to perform the split• Choose the best splitting attribute, e.g., by maximizing the

information gain (i.e., reducing entropy)

→ The obtained tree is not as powerful as a single classifier . . .→ Regard it as a weak classifier & build multiple ones (“RandomForest”)→What technique can be used here?

“Ensemble Methods”

Chaohui Wang Introduction to Machine Learning 70 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Randomized Decision Trees

• Basic idea: randomize attribute selection• No longer look for globally optimal split• Instead, randomly choose a subset of K attributes base on

which to perform the split• Choose the best splitting attribute, e.g., by maximizing the

information gain (i.e., reducing entropy)

→ The obtained tree is not as powerful as a single classifier . . .→ Regard it as a weak classifier & build multiple ones (“RandomForest”)→What technique can be used here?

“Ensemble Methods”

Chaohui Wang Introduction to Machine Learning 70 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Randomized Decision Trees

• Basic idea: randomize attribute selection• No longer look for globally optimal split• Instead, randomly choose a subset of K attributes base on

which to perform the split• Choose the best splitting attribute, e.g., by maximizing the

information gain (i.e., reducing entropy)

→ The obtained tree is not as powerful as a single classifier . . .→ Regard it as a weak classifier & build multiple ones (“RandomForest”)→What technique can be used here?

“Ensemble Methods”

Chaohui Wang Introduction to Machine Learning 70 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Randomized Decision Trees

• Basic idea: randomize attribute selection• No longer look for globally optimal split• Instead, randomly choose a subset of K attributes base on

which to perform the split• Choose the best splitting attribute, e.g., by maximizing the

information gain (i.e., reducing entropy)

→ The obtained tree is not as powerful as a single classifier . . .→ Regard it as a weak classifier & build multiple ones (“RandomForest”)→What technique can be used here? “Ensemble Methods”

Chaohui Wang Introduction to Machine Learning 70 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forest

• Random forest: composed of a set of L randomizeddecision trees• Internal node: a random test on an input feature vector• Leaf node: stores a histogram H = (h1, . . . , hK) (K: the

number of classes)→ obtained during the training phase by counting thenumber of labeled feature vectors that arrive at this leaf

Chaohui Wang Introduction to Machine Learning 71 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forest

• Random forest: composed of a set of L randomizeddecision trees• Internal node: a random test on an input feature vector• Leaf node: stores a histogram H = (h1, . . . , hK) (K: the

number of classes)→ obtained during the training phase by counting thenumber of labeled feature vectors that arrive at this leaf

Chaohui Wang Introduction to Machine Learning 71 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forest

• Random forest: composed of a set of L randomizeddecision trees• Internal node: a random test on an input feature vector• Leaf node: stores a histogram H = (h1, . . . , hK) (K: the

number of classes)→ obtained during the training phase by counting thenumber of labeled feature vectors that arrive at this leaf

Chaohui Wang Introduction to Machine Learning 71 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forest

• Random forest: composed of a set of L randomizeddecision trees• Internal node: a random test on an input feature vector• Leaf node: stores a histogram H = (h1, . . . , hK) (K: the

number of classes)→ obtained during the training phase by counting thenumber of labeled feature vectors that arrive at this leaf

Chaohui Wang Introduction to Machine Learning 71 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forest

• Random forest: composed of a set of L randomizeddecision trees• In the testing phase, a feature vector is dropped in each

decision tree l and reaches the leaf τl• The probabilities of all the trees are averaged to obtain the

probability over the forest: p(H|x) = 1L

∑Ll=1 pl(H|x)

→ pl(H|x) is the normalized histogram at the leaf τl of tree l

Chaohui Wang Introduction to Machine Learning 72 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forest

• Random forest: composed of a set of L randomizeddecision trees• In the testing phase, a feature vector is dropped in each

decision tree l and reaches the leaf τl• The probabilities of all the trees are averaged to obtain the

probability over the forest: p(H|x) = 1L

∑Ll=1 pl(H|x)

→ pl(H|x) is the normalized histogram at the leaf τl of tree l

Chaohui Wang Introduction to Machine Learning 72 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forest

• Random forest: composed of a set of L randomizeddecision trees• In the testing phase, a feature vector is dropped in each

decision tree l and reaches the leaf τl• The probabilities of all the trees are averaged to obtain the

probability over the forest: p(H|x) = 1L

∑Ll=1 pl(H|x)

→ pl(H|x) is the normalized histogram at the leaf τl of tree l

Chaohui Wang Introduction to Machine Learning 72 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forests: Properties

• Advantages:1 Very simple algorithm2 Straightforward to deal with multiple classes3 Resistant to overfitting - generalizes well to new data4 Fast training

• Limitations:1 Memory consumption: Decision tree construction uses

much more memory2 Well-suited for problems with little training data, whereas,

little performance gain when training data is really large

Chaohui Wang Introduction to Machine Learning 73 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forests: Properties

• Advantages:1 Very simple algorithm2 Straightforward to deal with multiple classes3 Resistant to overfitting - generalizes well to new data4 Fast training

• Limitations:1 Memory consumption: Decision tree construction uses

much more memory2 Well-suited for problems with little training data, whereas,

little performance gain when training data is really large

Chaohui Wang Introduction to Machine Learning 73 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forests: Properties

• Advantages:1 Very simple algorithm2 Straightforward to deal with multiple classes3 Resistant to overfitting - generalizes well to new data4 Fast training

• Limitations:1 Memory consumption: Decision tree construction uses

much more memory2 Well-suited for problems with little training data, whereas,

little performance gain when training data is really large

Chaohui Wang Introduction to Machine Learning 73 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forests: Properties

• Advantages:1 Very simple algorithm2 Straightforward to deal with multiple classes3 Resistant to overfitting - generalizes well to new data4 Fast training

• Limitations:1 Memory consumption: Decision tree construction uses

much more memory2 Well-suited for problems with little training data, whereas,

little performance gain when training data is really large

Chaohui Wang Introduction to Machine Learning 73 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forests: Properties

• Advantages:1 Very simple algorithm2 Straightforward to deal with multiple classes3 Resistant to overfitting - generalizes well to new data4 Fast training

• Limitations:1 Memory consumption: Decision tree construction uses

much more memory2 Well-suited for problems with little training data, whereas,

little performance gain when training data is really large

Chaohui Wang Introduction to Machine Learning 73 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forests: Properties

• Advantages:1 Very simple algorithm2 Straightforward to deal with multiple classes3 Resistant to overfitting - generalizes well to new data4 Fast training

• Limitations:1 Memory consumption: Decision tree construction uses

much more memory2 Well-suited for problems with little training data, whereas,

little performance gain when training data is really large

Chaohui Wang Introduction to Machine Learning 73 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forests: Properties

• Advantages:1 Very simple algorithm2 Straightforward to deal with multiple classes3 Resistant to overfitting - generalizes well to new data4 Fast training

• Limitations:1 Memory consumption: Decision tree construction uses

much more memory2 Well-suited for problems with little training data, whereas,

little performance gain when training data is really large

Chaohui Wang Introduction to Machine Learning 73 / 73

Linear Discriminant Functions SVMs Ensemble Methods & Boosting Random Forests

Random Forests: Properties

• Advantages:1 Very simple algorithm2 Straightforward to deal with multiple classes3 Resistant to overfitting - generalizes well to new data4 Fast training

• Limitations:1 Memory consumption: Decision tree construction uses

much more memory2 Well-suited for problems with little training data, whereas,

little performance gain when training data is really large

Chaohui Wang Introduction to Machine Learning 73 / 73