![Page 1: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/1.jpg)
UNIT 2
Basics of Supervised MachineLearning
![Page 2: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/2.jpg)
32Unit 2: Basics of Supervised Machine Learning
QUESTIONS WE NEED TO ADDRESS
Does learning help in the future, i.e. does experience from pre-viously observed examples help us to solve a future task?What is a good model? How do we assess the quality of amodel?Will a given model be helpful in the future?
![Page 3: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/3.jpg)
33Unit 2: Basics of Supervised Machine Learning
BASIC SETUP: INPUTS
Assume we want to learn something about objects from a set/spaceX. Most often, these objects are represented by vectors of featurevalues, i.e.
x = (x1, . . . , xd) ∈ X1 × · · · ×Xd︸ ︷︷ ︸=X
For simplicity, we will not distinguish between the objects and thefeature vector in the following.
If Xj is a finite set of labels, we speak of a categorical vari-able/feature. IfXj = R, a real interval, etc., we speak of a numericalvariable/feature.
![Page 4: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/4.jpg)
34Unit 2: Basics of Supervised Machine Learning
BASIC SETUP: INPUTS (cont'd)
Assume we are given l objects x1, . . . ,xl that have been observedin the past—the so-called training set. Each of these objects ischaracterized by its feature vector:
xi = (xi1, . . . , xid)
We can write this conveniently in matrix notation (called matrix offeature vectors):
X =
x1
...
xl
=
x11 . . . x1d...
. . ....
xl1 . . . xld
![Page 5: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/5.jpg)
35Unit 2: Basics of Supervised Machine Learning
BASIC SETUP: INPUTS VS. OUTPUTS
Further assume that we know a target value yi ∈ R for each trainingsample xi. All these values constitute the target/label vector:
y = (y1, . . . , yl)T
The training data matrix is then defined as follows:
Z = (X | y) =
x1 y1
......
xl yl
=
x11 . . . x1d y1
.... . .
......
xl1 . . . xld yl
In the following, we denote Z = X × R.
![Page 6: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/6.jpg)
36Unit 2: Basics of Supervised Machine Learning
CLASSIFICATION VS. REGRESSION
Classification: the target/label values are categorical, i.e. from afinite set of labels; we will often consider binary classification,i.e. where we have two classes; in this case, unless indicatedotherwise, we will use the labels -1 (negative class) and +1(positive class)
Regression: the target/label values are numerical
![Page 7: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/7.jpg)
37Unit 2: Basics of Supervised Machine Learning
THE PROBABILISTIC FRAMEWORK
FOR SUPERVISED ML (1/3)
The quality of a model can only be judged on the basis of its per-formance on future data. So assume that future data are generatedaccording to some joint distribution of inputs and outputs, the jointdensity of which we denote as
p(z) = p(x, y)
If we have only finitely many possible data samples, p(z) = p(x, y)
is the probability to observe the datum z = (x, y).
![Page 8: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/8.jpg)
38Unit 2: Basics of Supervised Machine Learning
THE PROBABILISTIC FRAMEWORK
FOR SUPERVISED ML (2/3)
Marginal distributions: p(x) is the density/probability of observ-ing input vector x (regardless of its target value); p(y) is thedensity/probability of observing target value y
Conditional distributions: p(x | y) is the density of input valuesfor a given target value y; p(y | x) is the density/probability toobserve a target value y for a given input x
![Page 9: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/9.jpg)
39Unit 2: Basics of Supervised Machine Learning
THE PROBABILISTIC FRAMEWORK
FOR SUPERVISED ML (3/3)
In case of binary classification, we will use the following notationsto make things a bit clearer:
p(y = −1) probability to observe a negative sample
p(y = +1) probability to observe a positive sample
p(x | y = −1) distribution density of negative class
p(x | y = +1) distribution density of positive class
p(y = −1 | x) probability that x belongs to negative class
p(y = +1 | x) probability that x belongs to positive class
![Page 10: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/10.jpg)
40Unit 2: Basics of Supervised Machine Learning
SOME BASIC CORRESPONDENCES
Using definitions:
p(x, y) = p(x | y) · p(y) p(x, y) = p(y | x) · p(x)
Bayes’ Theorem:
p(y | x) =p(x | y) · p(y)
p(x)p(x | y) =
p(y | x) · p(x)
p(y)
Getting marginal densities by integrating out:
p(x) =
∫R
p(x, y)dy =
∫R
p(x | y) · p(y)dy
p(y) =
∫X
p(x, y)dx =
∫X
p(y | x) · p(x)dx
![Page 11: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/11.jpg)
41Unit 2: Basics of Supervised Machine Learning
SOME BASIC CORRESPONDENCES
(cont'd)
In the case of binary classification:
p(y = −1) + p(y = +1) = 1
p(y = −1 | x) + p(y = +1 | x) = 1 for all x
p(x) = p(x, y = −1) + p(x, y = +1)
= p(x | y = −1) · p(y = −1) + p(x | y = +1) · p(y = +1)
![Page 12: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/12.jpg)
42Unit 2: Basics of Supervised Machine Learning
LOSS FUNCTIONS
Assume that the mapping g corresponds to our model class (para-metric model) in the sense that
g(x; w)
maps the input vector x to the predicted output value using the pa-rameter vector w (i.e. w determines the model).Then a loss function
L(y, g(x; w))
measures the loss/cost that incurs for a given data sample z = (x, y)
(i.e. with real output value y).
![Page 13: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/13.jpg)
43Unit 2: Basics of Supervised Machine Learning
EXAMPLES OF LOSS FUNCTIONS
Zero-one loss:
Lzo(y, g(x; w)) =
0 y = g(x; w)
1 y 6= g(x; w)
Quadratic loss:
Lq(y, g(x; w)) = (y − g(x; w))2
Clearly, the zero-one loss function makes little sense for regression.For binary classification tasks, we have Lq = 4Lzo.
![Page 14: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/14.jpg)
44Unit 2: Basics of Supervised Machine Learning
GENERALIZATION ERROR/RISK
The generalization error (or risk) is the expected loss on future data for agiven model g(.;w):
R(g(.;w)) = Ez
(L(y, g(x;w))
)=
∫Z
L(y, g(x;w)) · p(z)dz
=
∫X
∫R
L(y, g(x;w)) · p(x, y)dydx
=
∫X
p(x)
∫R
L(y, g(x;w)) · p(y | x)dy
︸ ︷︷ ︸=R(g(x;w))=Ey|x(L(y,g(x;w)))
dx
Obviously, R(g(x;w)) denotes the expected loss for input x.
The risk for the quadratic loss is called mean squared error (MSE).
![Page 15: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/15.jpg)
ADVANCEDBACKGROUNDINFORMATION
45Unit 2: Basics of Supervised Machine Learning
GENERALIZATION ERROR FOR A
NOISY FUNCTION
Assume that y is a function of x perturbed by some noise:
y = f(x) + ε
Assume further that ε is distributed according to some noise distri-bution pn(ε). Then we can infer
p(y | x) = pn(y − f(x)),
which implies
p(z) = p(y | x) · p(x) = p(x) · pn(y − f(x)).
![Page 16: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/16.jpg)
ADVANCEDBACKGROUNDINFORMATION
46Unit 2: Basics of Supervised Machine Learning
GENERALIZATION ERROR FOR A
NOISY FUNCTION (cont'd)
Then we obtain
R(g(.; w)) =
∫Z
L(y, g(x; w)) · p(z)dz
=
∫X
p(x)
∫R
L(y, g(x; w)) · pn(y − f(x))dydx.
In the noise-free case, we get
R(g(.; w)) =
∫X
p(x) · L(f(x), g(x; w))dx,
which can be understood as “modeling error”.
![Page 17: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/17.jpg)
47Unit 2: Basics of Supervised Machine Learning
GENERALIZATION ERROR FOR
BINARY CLASSIFICATION (1/3)
For the zero-one loss, we obtain
R(g(.; w)) =
∫X
∫R
p(x, y 6= g(x; w))dydx,
i.e. the misclassification probability. With the notations
X−1 = {x ∈ X | g(x; w) < 0}, X+1 = {x ∈ X | g(x; w) > 0},
we can conclude further:
R(g(.; w)) =
∫X−1
p(x, y = +1)dx +
∫X+1
p(x, y = −1)dx
![Page 18: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/18.jpg)
48Unit 2: Basics of Supervised Machine Learning
GENERALIZATION ERROR FOR
BINARY CLASSIFICATION (2/3)So, we get:
R(g(.;w)) =
∫X−1
p(y = +1 | x) · p(x)dx +
∫X+1
p(y = −1 | x) · p(x)dx
=
∫X
p(y = −1 | x) if g(x;w) = +1
p(y = +1 | x) if g(x;w) = −1
· p(x)dxHence, we can infer an optimal classification function, the so-called Bayes-optimalclassifier:
g(x) =
+1 if p(y = +1 | x) > p(y = −1 | x)−1 if p(y = −1 | x) > p(y = +1 | x)
= sign(p(y = +1 | x)− p(y = −1 | x)) (1)
![Page 19: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/19.jpg)
49Unit 2: Basics of Supervised Machine Learning
GENERALIZATION ERROR FOR
BINARY CLASSIFICATION (3/3)
The resulting minimal risk is
Rmin =
∫X
min(p(x, y = −1), p(x, y = +1))dx
=
∫X
min(p(y = −1 | x), p(y = +1 | x)) · p(x)dx
Obviously, for non-overlapping classes, i.e. min(p(y = −1 | x), p(y = +1 |x)) = 0, the minimal risk is zero and the optimal classification function is
g(x) =
+1 if p(y = +1 | x) > 0,
−1 if p(y = −1 | x) > 0.
![Page 20: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/20.jpg)
50Unit 2: Basics of Supervised Machine Learning
MINIMIZING RISK FOR A GAUSSIAN
CLASSIFICATION TASK (1/4)
Assume that both negative and positive class are distributed ac-cording to d-variate normal distributions, i.e., p(x | y = −1)
is N(µ−1,Σ−1)-distributed and p(x | y = +1) is N(µ+1,Σ+1)-distributed.
Note that the distribution density of a d-variate N(µ,Σ)-distributedrandom variable is given as
p(x) =1
(2π)d/2 ·√
det Σ· exp
(−1
2· (x− µ)Σ−1(x− µ)T
)
![Page 21: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/21.jpg)
51Unit 2: Basics of Supervised Machine Learning
MINIMIZING RISK FOR A GAUSSIAN
CLASSIFICATION TASK (2/4)
Using (1), we can infer
g(x) = sign(g(x)) = sign(g(x))
where
g(x) = p(y = +1 | x)− p(y = −1 | x)
=1
p(x)·(p(x | y = +1) · p(y = +1)
−p(x | y = −1) · p(y = −1))
g(x) = ln p(x | y = +1)− ln p(x | y = −1) + ln p(y = +1)− ln p(y = −1)
g and g are called discriminant functions.
![Page 22: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/22.jpg)
ADVANCEDBACKGROUNDINFORMATION
52Unit 2: Basics of Supervised Machine Learning
MINIMIZING RISK FOR A GAUSSIAN
CLASSIFICATION TASK (3/4)
Determining an optimal discriminant function:
g(x) = − 12(x− µ+1)Σ
−1+1(x− µ+1)
T − d2ln 2π − 1
2ln detΣ+1 + ln p(y = +1)
+ 12(x− µ−1)Σ
−1−1(x− µ−1)
T + d2ln 2π + 1
2ln detΣ−1 − ln p(y = −1)
= − 12(x− µ+1)Σ
−1+1(x− µ+1)
T − 12ln detΣ+1 + ln p(y = +1)
+ 12(x− µ−1)Σ
−1−1(x− µ−1)
T + 12ln detΣ−1 − ln p(y = −1)
= − 12x
=A︷ ︸︸ ︷(Σ−1
+1 −Σ−1−1)xT +
=b︷ ︸︸ ︷(µ+1Σ−1
+1 − µ−1Σ−1−1)xT
− 12µ+1Σ−1
+1µT+1 + 1
2µ−1Σ−1
−1µT−1
− 12ln detΣ+1 + 1
2ln detΣ−1 + ln p(y = +1)− ln p(y = −1)
= c
= −1
2xAxT + bxT + c
![Page 23: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/23.jpg)
53Unit 2: Basics of Supervised Machine Learning
MINIMIZING RISK FOR A GAUSSIAN
CLASSIFICATION TASK (4/4)
Thus, the optimal classification border g(x) = 0 is a d-dimensional
hyper-quadric − 12xAxT + bxT + c = 0.
In the special case Σ−1 = Σ+1, we obtain A = 0, i.e. the optimal
classification border is a linear hyperplane (a separating line in the
case d = 2).
![Page 24: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/24.jpg)
54Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #1 (1/6)The data shown on Slide 9 were created according to the following distributions:
p(x | y = +1) corresponds to a two-variate normal distribution with parameters
µ+1 = (0.3, 0.7) Σ+1 =
0.011875 0.016238
0.016238 0.030625
p(x | y = −1) corresponds to a two-variate normal distribution with parameters
µ−1 = (0.5, 0.3) Σ−1 =
0.011875 −0.016238−0.016238 0.030625
p(y = +1) = 55
120= 0.45833, p(y = −1) = 65
120= 0.54167
![Page 25: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/25.jpg)
55Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #1 (2/6)
p(x) = p(x | y = −1) · p(y = −1) + p(x | y = +1) · p(y = +1):
![Page 26: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/26.jpg)
56Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #1 (3/6)
g(x) = p(x | y = +1) · p(y = +1)− p(x | y = −1) · p(y = −1):
![Page 27: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/27.jpg)
57Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #1 (4/6)
Discriminant Function g(x) = g(x)/p(x):
![Page 28: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/28.jpg)
58Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #1 (5/6)
Data + Optimal Decision Border:
![Page 29: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/29.jpg)
59Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #1 (6/6)
Data + Estimated Decision Border:
![Page 30: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/30.jpg)
60Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #2 (1/6)Let us consider a data set created according to the following distributions:
p(x | y = +1) corresponds to a two-variate normal distribution with parameters
µ+1 = (0.4, 0.8) Σ+1 =
0.09 0.0
0.0 0.0049
p(x | y = −1) corresponds to a two-variate normal distribution with parameters
µ−1 = (0.5, 0.3) Σ−1 =
0.00398011 −0.00730159−0.00730159 0.0385199
p(y = +1) = 55
120= 0.45833, p(y = −1) = 65
120= 0.54167
![Page 31: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/31.jpg)
61Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #2 (2/6)
p(x) = p(x | y = −1) · p(y = −1) + p(x | y = +1) · p(y = +1):
![Page 32: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/32.jpg)
62Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #2 (3/6)
g(x) = p(x | y = +1) · p(y = +1)− p(x | y = −1) · p(y = −1):
![Page 33: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/33.jpg)
63Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #2 (4/6)
Discriminant Function g(x) = g(x)/p(x):
![Page 34: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/34.jpg)
64Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #2 (5/6)
Data + Optimal Decision Border:
![Page 35: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/35.jpg)
65Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #2 (6/6)
Data + Estimated Decision Border:
![Page 36: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/36.jpg)
66Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #3 (1/6)Let us consider a data set created according to the following distributions:
p(x | y = +1) corresponds to a two-variate normal distribution with parameters
µ+1 = (0.3, 0.7) Σ+1 =
0.0016 0.0
0.0 0.0016
p(x | y = −1) corresponds to a two-variate normal distribution with parameters
µ−1 = (0.6, 0.3) Σ−1 =
0.09 0.0
0.0 0.09
p(y = +1) = 1
12= 0.0833, p(y = −1) = 11
12= 0.9167
![Page 37: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/37.jpg)
67Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #3 (2/6)
p(x) = p(x | y = −1) · p(y = −1) + p(x | y = +1) · p(y = +1):
![Page 38: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/38.jpg)
68Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #3 (3/6)
g(x) = p(x | y = +1) · p(y = +1)− p(x | y = −1) · p(y = −1):
![Page 39: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/39.jpg)
69Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #3 (4/6)
Discriminant Function g(x) = g(x)/p(x):
![Page 40: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/40.jpg)
70Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #3 (5/6)
Data + Optimal Decision Border:
![Page 41: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/41.jpg)
71Unit 2: Basics of Supervised Machine Learning
GAUSSIAN CLASSIFICATION
EXAMPLE #3 (6/6)
Data + Estimated Decision Border:
![Page 42: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/42.jpg)
72Unit 2: Basics of Supervised Machine Learning
WHAT ABOUT PRACTICE?
In practice, we hardly have any knowledge about p(x, y)
If we had, we could infer optimal prediction functions directlywithout using any machine learning method.
Therefore,
1. we have to estimate the prediction function with other methods;2. we have to estimate the generalization error.
![Page 43: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/43.jpg)
73Unit 2: Basics of Supervised Machine Learning
A BASIC CLASSIFIER: k-NEAREST
NEIGHBOR
Suppose we have a labeled data set Z and a distance measure on the inputspace. Then the k-nearest neighbor classifier is defined as follows:
gk-NN(x;Z) = class that occurs most often among the k samples
that are closest to x
For k = 1, we simply call this nearest neighbor classifier:
gNN(x;Z) = class of the sample that is closest to x
In case of ties, a special strategy has to be employed, e.g. random classassignment or the class with the larger number of samples.
![Page 44: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/44.jpg)
74Unit 2: Basics of Supervised Machine Learning
k-NEAREST NEIGHBOR CLASSIFIER
EXAMPLE #1
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
![Page 45: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/45.jpg)
74Unit 2: Basics of Supervised Machine Learning
k-NEAREST NEIGHBOR CLASSIFIER
EXAMPLE #1
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
k = 1:
![Page 46: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/46.jpg)
74Unit 2: Basics of Supervised Machine Learning
k-NEAREST NEIGHBOR CLASSIFIER
EXAMPLE #1
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
k = 1:k = 5:
![Page 47: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/47.jpg)
74Unit 2: Basics of Supervised Machine Learning
k-NEAREST NEIGHBOR CLASSIFIER
EXAMPLE #1
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
k = 1:k = 5:k = 13:
![Page 48: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/48.jpg)
75Unit 2: Basics of Supervised Machine Learning
k-NEAREST NEIGHBOR CLASSIFIER
EXAMPLE #2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
![Page 49: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/49.jpg)
75Unit 2: Basics of Supervised Machine Learning
k-NEAREST NEIGHBOR CLASSIFIER
EXAMPLE #2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
k = 1:
![Page 50: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/50.jpg)
75Unit 2: Basics of Supervised Machine Learning
k-NEAREST NEIGHBOR CLASSIFIER
EXAMPLE #2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
k = 1:k = 5:
![Page 51: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/51.jpg)
75Unit 2: Basics of Supervised Machine Learning
k-NEAREST NEIGHBOR CLASSIFIER
EXAMPLE #2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
k = 1:k = 5:k = 13:
![Page 52: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/52.jpg)
75Unit 2: Basics of Supervised Machine Learning
k-NEAREST NEIGHBOR CLASSIFIER
EXAMPLE #2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
k = 1:k = 5:k = 13:k = 25:
![Page 53: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/53.jpg)
76Unit 2: Basics of Supervised Machine Learning
A BASIC NUMERICAL PREDICTOR:
1D LINEAR REGRESSION
Consider a data set Z = {(xi, yi) | i = 1, . . . , l} ⊆ R2 and a linear model
y = w0 + w1 · x = g(x; (w0, w1)︸ ︷︷ ︸
w
).
Suppose we want to find (w0, w1) such that the average quadratic loss,
Q(w0, w1) =1
l
l∑i=1
(w0 + w1 · xi − yi
)2=
1
l
l∑i=1
(g(xi;w)− yi
)2,
is minimized. Then the unique global solution is given as follows:
w1 =Cov(x,y)
Var(x)w0 = y − w1 · x
![Page 54: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/54.jpg)
77Unit 2: Basics of Supervised Machine Learning
LINEAR REGRESSION EXAMPLE #1
1 2 3 4 5 6
2
4
6
8
![Page 55: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/55.jpg)
77Unit 2: Basics of Supervised Machine Learning
LINEAR REGRESSION EXAMPLE #1
1 2 3 4 5 6
2
4
6
8
1 2 3 4 5 6
2
4
6
8
![Page 56: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/56.jpg)
78Unit 2: Basics of Supervised Machine Learning
LINEAR REGRESSION FOR MULTIPLE
VARIABLES
Consider a data set Z = {(xi, yi) | i = 1, . . . , l} and a linear model
y = w0 + w1 · x1 + · · ·+ wd · xd = (1 | x) ·w = g(x; (w0, w1, . . . , wd)︸ ︷︷ ︸
wT
).
Suppose we want to find w = (w0, w1, . . . , wd)T such that the averagequadratic loss is minimized. Then the unique global solution is given as
w =(XT · X
)−1 · XT︸ ︷︷ ︸X+
·y,
where X = (1 | X).
![Page 57: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/57.jpg)
79Unit 2: Basics of Supervised Machine Learning
LINEAR REGRESSION EXAMPLE #2
0.0
0.5
1.0
0.0
0.5
1.0
- 1
0
1
![Page 58: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/58.jpg)
79Unit 2: Basics of Supervised Machine Learning
LINEAR REGRESSION EXAMPLE #2
0.0
0.5
1.0
0.0
0.5
1.0
- 1
0
1
![Page 59: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/59.jpg)
80Unit 2: Basics of Supervised Machine Learning
POLYNOMIAL REGRESSION
Consider a data set Z = {(xi, yi) | i = 1, . . . , l} and a polynomial model ofdegree n
y = w0 + w1 · x+ w2 · x2 + · · ·+ wn · xn = g(x; (w0, w1, . . . , wn)︸ ︷︷ ︸
wT
).
Suppose we want to find w = (w0, w1, . . . , wn)T such that the averagequadratic loss is minimized. Then the unique global solution is given asfollows:
w =(XT · X
)−1 · XT︸ ︷︷ ︸X+
·y with X = (1 | x | x2 | · · · | xn)
![Page 60: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/60.jpg)
81Unit 2: Basics of Supervised Machine Learning
POLYNOMIAL REGRESSION
EXAMPLE
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10
![Page 61: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/61.jpg)
81Unit 2: Basics of Supervised Machine Learning
POLYNOMIAL REGRESSION
EXAMPLE
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 1:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10
![Page 62: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/62.jpg)
81Unit 2: Basics of Supervised Machine Learning
POLYNOMIAL REGRESSION
EXAMPLE
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 1:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 2:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10
![Page 63: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/63.jpg)
81Unit 2: Basics of Supervised Machine Learning
POLYNOMIAL REGRESSION
EXAMPLE
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 1:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 2:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 3:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10
![Page 64: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/64.jpg)
81Unit 2: Basics of Supervised Machine Learning
POLYNOMIAL REGRESSION
EXAMPLE
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 1:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 2:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 3:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 5:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10
![Page 65: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/65.jpg)
81Unit 2: Basics of Supervised Machine Learning
POLYNOMIAL REGRESSION
EXAMPLE
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 1:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 2:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 3:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 5:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 25:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10
![Page 66: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/66.jpg)
81Unit 2: Basics of Supervised Machine Learning
POLYNOMIAL REGRESSION
EXAMPLE
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 1:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 2:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 3:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 5:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 25:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10n = 75:
1 2 3 4 5 6
- 4
- 2
2
4
6
8
10
![Page 67: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/67.jpg)
82Unit 2: Basics of Supervised Machine Learning
EMPIRICAL RISK MINIMIZATION
(ERM)
Linear and polynomial regression have been concerned withminimizing the average loss on a given (training) data set. Thisstrategy is called empirical risk minimization:
Given a training set Zl, empirical risk minimization is concerned withfinding a parameter setting w such that the empirical risk
Remp(g(.; w),Zl) =1
l·
l∑i=1
L(yi, g(xi; w))
is minimal (or at least as small as possible).
![Page 68: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/68.jpg)
83Unit 2: Basics of Supervised Machine Learning
ESTIMATING THE RISK: TEST SET
METHOD
Assume that we have m more data samples Zm = (zl+1, . . . , zl+m),the so-called test set, that are independently and identicallydistributed (i.i.d.) according to p(x, y) (and, therefore, so isL(y, g(x,w))). Then
Remp(g(.; w),Zm) =1
m
m∑j=1
L(yl+j , g(xl+j ; w)) (2)
can be considered an estimate for R(g(.; w)). By the (strong) law oflarge numbers, RE(g(.; w)) converges to R(g(.; w)) for m→∞.
![Page 69: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/69.jpg)
84Unit 2: Basics of Supervised Machine Learning
TEST SET METHOD: PRACTICAL
REALIZATION
The common way of applying the test set method in practice is thefollowing:
1. Split the set of labeled samples into a training set of l samplesand a test set of m samples.
2. Perform model selection, i.e. find a suitable model w, makinguse only of the training set (hence, w = w(Z)), while withhold-ing the test set.
3. Estimate the generalization error by (2) using the test set
This is also called hold-out method.
![Page 70: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/70.jpg)
85Unit 2: Basics of Supervised Machine Learning
TEST SET METHOD: A WORD OF
CAUTION
The model g(.; w) is geared to the training set. Therefore, for train-ing and test samples, the random variables L(y, g(x,w)) are notidentically distributed. Hence, the estimate RE(g(.; w)) becomesinvalid as soon as a single training sample is being used for esti-mating the risk; therefore:
Training samples may never be used for “testing”, i.e. es-timating the generalization error!Test samples may never be used for training!
![Page 71: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/71.jpg)
86Unit 2: Basics of Supervised Machine Learning
TEST SET METHOD: PRACTICAL
CAVEATS
To avoid the pitfall described above, take the following rules intoaccount:
1. Choose training/test samples randomly (unless you can becompletely sure that they are already in random order)! If theprobabilities for being selected as training or test samples arenot equal for all samples, the independence property cannot beguaranteed.
2. Make sure that there is not the slightest influence that test sam-ples have on the selection of the model! Also pre-processingor feature selection steps that use all samples imply that theestimate is biased.
![Page 72: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/72.jpg)
87Unit 2: Basics of Supervised Machine Learning
CROSS VALIDATION: MOTIVATION
The following platitudes can be stated about the test set method:
The more training samples (and the less test samples), the bet-ter the model, but the worse the risk estimate.The more test samples (and the less training samples), thecoarser the model, but the better the risk estimate.
In particular, for small sample sets, the requirement that training andtest set must not overlap is painful.
Question: can we somehow improve the risk estimate without nec-essarily sacrificing model accuracy?
![Page 73: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/73.jpg)
88Unit 2: Basics of Supervised Machine Learning
CROSS VALIDATION: BASIC IDEA
A simple idea would be to perform the splitting into training and setseveral times and to average the estimates. This is incorrect, as thetest sets overlap and, therefore, are not independent anymore.Cross validation somehow follows this line of thought, but splits thesample set into n disjoint fractionsa (so-called folds):
1. Training is done n times, every time leaving out one fold (i.e. takingthe other n− 1 folds as training set)
2. The risk estimate is then computed as the average of the risk es-timate of the n left-out test folds
The special case n = l is commonly called leave-one-out cross vali-dation.
aFor simplicity, assume in the following that l is divisible by n.
![Page 74: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/74.jpg)
89Unit 2: Basics of Supervised Machine Learning
FIVE-FOLD CROSS VALIDATION
VISUALIZED
evaluation training
1.
evaluation
training
2.
5.
evaluationtraining
...
![Page 75: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/75.jpg)
90Unit 2: Basics of Supervised Machine Learning
CROSS VALIDATION: DEFINITION
We denote a given arbitrary sample set with l elements as Zl in the fol-lowing. The j-th fold inside Zl is denoted as Zj
l/n and the sample set
corresponding to the remaining n− 1 folds as Zl\Zjl/n.
Then the risk estimate given by the j-th fold is given as
Rn−cv,j(Zl) =n
l
∑z∈Zj
l/n
L(y, g(x;wj
(Zl\Zj
l/n))).
The n-fold cross validation risk is defined as
Rn−cv(Zl) =1
n
n∑j=1
Rn−cv,j(Zl) =1
l
n∑j=1
∑z∈Zj
l/n
L(y, g(x;wj
(Zl\Zj
l/n))).
![Page 76: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/76.jpg)
ADVANCEDBACKGROUNDINFORMATION
91Unit 2: Basics of Supervised Machine Learning
CROSS VALIDATION: JUSTIFICATION
Theorem (Luntz & Brailovsky). The cross-validation risk estimateis an “almost unbiased estimator”:
EZl−l/n
(R(g(.; w(Zl−l/n))
)= EZl
(Rn−cv(Zl)
)
![Page 77: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/77.jpg)
92Unit 2: Basics of Supervised Machine Learning
CROSS VALIDATION: MISCELLANEA
Obviously, n different models are computed during n-fold cross validation.Questions:
1. Which of these models should we select finally?2. Can we get a better model if we manage to average these n models?
Answers:
1. None, as the selection would be biased to a certain fold.2. It depends on the model class whether this is possible and meaningful
(see later).
A good strategy is, once that we know about the generalization abilities ofour model, to finally train a model using all l samples.
![Page 78: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/78.jpg)
93Unit 2: Basics of Supervised Machine Learning
CROSS VALIDATION: MISCELLANEA
(cont'd)
Cross validation is also commonly applied to finding good choices of hyper-parameters, i.e. by selecting those hyperparameters for which the smallestcross validation risk is obtained.
Note, however, that the obtained risk estimate is then biased to the wholetraining set. If an unbiased estimate for the risk is desired, this can only bedone by a combination of the test set method and cross validation:
1. Split the sample set into training set and test set first.2. Apply cross validation on the training set (completely withholding the
test set) to find the best hyperparameter choice.3. Finally, compute the risk estimate using the test set.
![Page 79: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/79.jpg)
94Unit 2: Basics of Supervised Machine Learning
ERM IN PRACTICE
As said, ERM is concerned with minimizing the training error.Our goal, however, is to minimize the generalization error(which can be estimated by the test error).In other words, ERM does not use the correct objective!
Question: What can go wrong because of using the wrong objec-tive?
![Page 80: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/80.jpg)
95Unit 2: Basics of Supervised Machine Learning
UNDERFITTING AND OVERFITTING
Underfitting: our model is too coarse to fit the data (neither trainingnor test data); this is usually the result of too restrictive modelassumptions (i.e. too low complexity of model).
Overfitting: our model works very well on training data, but gener-alizes poorly to future/test data; this is usually the result of toohigh model complexity.
![Page 81: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/81.jpg)
96Unit 2: Basics of Supervised Machine Learning
NOTORIOUS SITUATION IN
PRACTICE
erro
r
complexity
test error
training error
![Page 82: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/82.jpg)
96Unit 2: Basics of Supervised Machine Learning
NOTORIOUS SITUATION IN
PRACTICE
erro
r
complexity
test error
training error�
�
�
�
�
�
�
�
�
�
unde
rfitti
ng
![Page 83: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/83.jpg)
96Unit 2: Basics of Supervised Machine Learning
NOTORIOUS SITUATION IN
PRACTICE
erro
r
complexity
test error
training error�
�
�
�
�
�
�
�
�
�
unde
rfitti
ng
-
-
-
-
-
-
-
-
-
-overfitting
![Page 84: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/84.jpg)
97Unit 2: Basics of Supervised Machine Learning
BIAS-VARIANCE DECOMPOSITION
FOR QUADRATIC LOSS (1/4)
We are interested in the expected prediction error for a given x0 ∈ X(assuming that the size of the training set is fixed to l examples):
EPE(x0) = Ey|x0,Zl
(Lq(y, g(x0; w(Zl)))
)= Ey|x0,Zl
((y − g(x0; w(Zl)))
2)
Since y | x0 and the selection of training samples are independent(or at least this should be assumed to be the case), we can infer thefollowing:
EPE(x0) = Ey|x0
(EZl
((y − g(x0; w(Zl)))
2))
![Page 85: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/85.jpg)
98Unit 2: Basics of Supervised Machine Learning
BIAS-VARIANCE DECOMPOSITION
FOR QUADRATIC LOSS (2/4)
Using basic properties of expected values, we can infer the followingrepresentation:
EPE(x0) =Var(y | x0)
+(
E(y | x0)− EZl
(g(x0; w(Zl))
))2+ EZl
((g(x0; w(Zl))− EZl
(g(x0; w(Zl))))2)
![Page 86: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/86.jpg)
99Unit 2: Basics of Supervised Machine Learning
BIAS-VARIANCE DECOMPOSITION
FOR QUADRATIC LOSS (3/4)
1. The first term, Var(y | x0) is nothing else but the average amount towhich the label y varies at x0. This is often termed unavoidable error.
2. The second term,
bias2 =(
E(y | x0)− EZl
(g(x0;w(Zl))
))2measures how close the model in average approximates the averagetarget y at x0; thus, it is nothing else but the squared bias.
3. The third term,
variance = EZl
((g(x0;w(Zl))− EZl(g(x0;w(Zl)))
)2)is nothing else but the variance of models at x0, i.e.VarZl(g(x0;w(Zl))).
![Page 87: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/87.jpg)
100Unit 2: Basics of Supervised Machine Learning
BIAS-VARIANCE DECOMPOSITION
FOR QUADRATIC LOSS (4/4)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
p(y | x0)Var(y | x0)12
p(g(x0 ; w(Z l)))
VarZ l(g(x0 ; w(Z l)))12
x0
EZ l(g(x0 ; w(Z l)))
E(y | x0)
bias
![Page 88: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/88.jpg)
101Unit 2: Basics of Supervised Machine Learning
BIAS-VARIANCE DECOMPOSITION:
SIMPLIFICATIONS
Assume that y(x) = f(x)+ε holds, where f is a deterministic functionand ε is a random variable that has mean zero and variance σ2
ε and isindependent of x. Then we can infer the following:
Var(y | x0) = σ2ε ,
E(y | x0) = f(x0),
bias2 =(f(x0)− EZl
(g(x0;w(Zl))
))2.
In the noise-free case (σε = 0), consequently, we get Var(y | x0) = 0,i.e. the unavoidable error vanishes and the rest stays the same.
![Page 89: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/89.jpg)
102Unit 2: Basics of Supervised Machine Learning
BIAS-VARIANCE DECOMPOSITION
FOR BINARY CLASSIFICATION
Now assume that we are given a binary classification task, i.e. y ∈{−1,+1} and g(x;w) ∈ {−1,+1}. Since Lzo = 1
4Lq holds, we can in-
fer the following:
EPE(x0) = Ey|x0,Zl
(Lzo(y, g(x0;w))
)=
1
4· Ey|x0
(EZl
((y − g(x0;w(Zl)))
2))=
1
4·(Var(y | x0) + bias2 + variance
)Note that, in these calculations, g is the final binary classification functionand not an arbitrary discriminant function. If the latter is the case, the aboverepresentation is not valid! (see literature)
![Page 90: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/90.jpg)
103Unit 2: Basics of Supervised Machine Learning
BIAS-VARIANCE DECOMPOSITION
FOR BINARY CLASSIF. (cont'd)
With the notations pR = p(y = +1 | x0) and
pO = pZl(g(x0;w(Zl)) = +1),
we can infer further
Var(y | x0) = 4 · pR · (1− pR),
bias2 = 4 · (pR − pO)2,
variance = 4 · pO · (1− pO),
hence, we obtain
EPE(x0) = pR · (1− pR)︸ ︷︷ ︸unavoidable error
+ (pR − pO)2︸ ︷︷ ︸squared bias
+ pO · (1− pO)︸ ︷︷ ︸variance
.
![Page 91: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/91.jpg)
104Unit 2: Basics of Supervised Machine Learning
THE BIAS-VARIANCE TRADE-OFF
It seems intuitively reasonable that the bias decreases withmodel complexity.Rationale: the more degrees of freedom we allow, the easierwe can fit the actual function/relationship.It also seems intuitively clear that the variance increases withmodel complexity.Rationale: the more degrees of freedom we allow, the higherthe risk to fit to noise.
This is usually referred to as the bias-variance trade-off. sometimeseven bias-variance “dilemma”.
![Page 92: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/92.jpg)
105Unit 2: Basics of Supervised Machine Learning
THE BIAS-VARIANCE TRADE-OFF
(cont'd)
erro
r
complexity
test error
training error
![Page 93: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/93.jpg)
105Unit 2: Basics of Supervised Machine Learning
THE BIAS-VARIANCE TRADE-OFF
(cont'd)
erro
r
complexity
test error
training error
unavoidable error
![Page 94: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/94.jpg)
105Unit 2: Basics of Supervised Machine Learning
THE BIAS-VARIANCE TRADE-OFF
(cont'd)
erro
r
complexity
test error
training error
unavoidable error
HHHHHH
HHHHj
bias
![Page 95: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/95.jpg)
105Unit 2: Basics of Supervised Machine Learning
THE BIAS-VARIANCE TRADE-OFF
(cont'd)
erro
r
complexity
test error
training error
unavoidable error
HHHHHH
HHHHj
bias
�underfitting
![Page 96: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/96.jpg)
105Unit 2: Basics of Supervised Machine Learning
THE BIAS-VARIANCE TRADE-OFF
(cont'd)
erro
r
complexity
test error
training error
unavoidable error
HHHHHH
HHHHj
bias
�underfitting
����
������*
variance
![Page 97: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/97.jpg)
105Unit 2: Basics of Supervised Machine Learning
THE BIAS-VARIANCE TRADE-OFF
(cont'd)
erro
r
complexity
test error
training error
unavoidable error
HHHHHH
HHHHj
bias
�underfitting
����
������*
variance
-overfitting
![Page 98: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/98.jpg)
106Unit 2: Basics of Supervised Machine Learning
BIAS-VARIANCE DECOMPOSITION:
SUMMARY
We can state that minimizing the generalization error (learning)is concerned with optimizing bias and variance simultaneously.Underfitting = high bias = too simple modelOverfitting = high variance = too complex modelIt is clear that empirical risk minimization itself does not in-clude any mechanism to assess bias and variance indepen-dently (how should it?); more specifically, if we do not careabout model complexity (in particular, if we allow highly or evenarbitrarily complex models), ERM has a high risk to produceover-fitted models.
![Page 99: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/99.jpg)
107Unit 2: Basics of Supervised Machine Learning
HOW TO EVALUATE CLASSIFIERS?
So far, the only measure we have considered for assessing the per-formance of a classifier was the generalization error based on thezero-one loss.
What if the data set is unbalanced?What if the misclassification cost depends on the sample’sclass?Can we define a general performance measure independentclass distributions and misclassification costs?
In order to answer these questions, we need to introduce confusionmatrices first.
![Page 100: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/100.jpg)
108Unit 2: Basics of Supervised Machine Learning
CONFUSION MATRIX FOR BINARY
CLASSIFICATION
Let us introduce the following terminology (for a given sample (x, y)
and a classifier g(.)): (x, y) is a
true positive (TP) if y = +1 and g(x) = +1,true negative (TN) if y = −1 and g(x) = −1,false positive (FP) if y = −1 and g(x) = +1,false negative (FN) if y = +1 and g(x) = −1.
![Page 101: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/101.jpg)
109Unit 2: Basics of Supervised Machine Learning
CONFUSION MATRIX FOR BINARY
CLASSIFICATION (cont'd)
Given a data set (z1, . . . , zm), the confusion matrix is defined as follows:
predicted value g(x;w)
+1 -1
+1 #TP #FN
actu
alva
luey
-1 #FP #TN
In this table, the entries #TP, #FP, #FN and #TN denote the numbers of truepositives, . . . , respectively, for the given data set.
![Page 102: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/102.jpg)
110Unit 2: Basics of Supervised Machine Learning
EVALUATION MEASURES DERIVED
FROM CONFUSION MATRIXAccuracy: proportion of correctly classified
items, i.e.
ACC =#TP + #TN
#TP + #FN + #FP + #TN.
True Positive Rate (aka recall/sensitivity):proportion of correctly identified posi-tives, i.e.
TPR =#TP
#TP + #FN.
False Positive Rate: proportion of negativeexamples that were incorrectly classi-fied as positives, i.e.
FPR =#FP
#FP + #TN.
Precision: proportion of predicted positiveexamples that were correct, i.e.
PREC =#TP
#TP + #FP.
True Negative Rate (aka specificity):proportion of correctly identifiednegatives, i.e.
TNR =#TN
#FP + #TN.
False Negative Rate: proportion of positiveexamples that were incorrectly classi-fied as negatives, i.e.
FNR =#FN
#TP + #FN.
![Page 103: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/103.jpg)
111Unit 2: Basics of Supervised Machine Learning
EVALUATION MEASURES DESIGNED
FOR UNBALANCED DATA
Balanced Accuracy: mean of true positive and true negative rate, i.e.
BACC =TPR + TNR
2
Matthews Correlation Coefficient: measure of non-randomness of clas-sification; defined as normalized determinant of confusion matrix, i.e.
MCC =#TP · #TN− #FP · #FN√
(#TP + #FP)(#TP + #FN)(#TN + #FP)(#TN + #FN)
F-score: harmonic mean of precision and recall, i.e.
F1 = 2 · PREC · TPRPREC + TPR
![Page 104: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/104.jpg)
112Unit 2: Basics of Supervised Machine Learning
CONFUSION MATRIX FOR
MULTI-CLASS CLASSIFICATION
Assume that we have a k-class classification task. Given a data set, theconfusion matrix is defined as follows:
predicted class g(x)
1 · · · j · · · k
1 C11 · · · C1j · · · C1k
.
.
....
. . ....
. . ....
i Ci1 · · · Cij · · · Cik
.
.
....
. . ....
. . ....ac
tual
valu
ey
k Ck1 · · · Ckj · · · Ckk
The entries Cij correspond to the numbers of test samples that actuallybelong to class i and have been classified as j by the classifier g(.).
![Page 105: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/105.jpg)
113Unit 2: Basics of Supervised Machine Learning
ACCURACY FOR MULTI-CLASS
CLASSIFICATION
For a multi-class classification task (with the notations as on theprevious slide), the accuracy of a classifier g(.) is defined as
ACC =
k∑i=1
Cii
k∑i,j=1
Cij
=1
m·
k∑i=1
Cii,
i.e., not at all surprisingly, as the proportion of correctly classifiedsamples. The other evaluation measures cannot be generalized tothe multi-class case in a straightforward way.
![Page 106: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/106.jpg)
114Unit 2: Basics of Supervised Machine Learning
OTHER PERFORMANCE MEASURES
FOR MULTI-CLASS CLASSIFICATION
Beside accuracy, the other evaluation measures cannot be generalized to the multi-class case in a direct way, but we can easily define them for each class separately.Given a class j, we can define the confusion matrix of class j as follows:
predicted value g(x;w)
= j 6= j
= j #TPj #FNj
actu
alva
luey
6= j #FPj #TNj
From this confusion matrix, we can easily define all previously known evaluationmeasures (for class j).
![Page 107: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/107.jpg)
ADVANCEDBACKGROUNDINFORMATION
115Unit 2: Basics of Supervised Machine Learning
RISK FOR BINARY CLASSIFICATION:
ASYMMETRIC CASE (1/3)
Consider the following loss function (with lFP, lFN > 0):
Las(y, g(x;w)) =
0 y = g(x;w)
lFP y = −1 and g(x;w) = +1
lFN y = +1 and g(x;w) = −1
Then we obtain the following:
R(g(.;w)) =
∫X−1
lFN · p(x, y = +1)dx +
∫X+1
lFP · p(x, y = −1)dx
=
∫X
lFP · p(y = −1 | x) if g(x;w) = +1
lFN · p(y = +1 | x) if g(x;w) = −1
· p(x)dx
![Page 108: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/108.jpg)
ADVANCEDBACKGROUNDINFORMATION
116Unit 2: Basics of Supervised Machine Learning
RISK FOR BINARY CLASSIFICATION:
ASYMMETRIC CASE (2/3)
We can infer the following optimal classification function:
g(x) =
+1 if lFN · p(y = +1 | x) > lFP · p(y = −1 | x)
−1 if lFP · p(y = −1 | x) > lFN · p(y = +1 | x)
= sign(lFN · p(y = +1 | x)− lFP · p(y = −1 | x)) (3)
The resulting minimal risk is
Rmin =
∫X
min(lFP · p(x, y = −1), lFN · p(x, y = +1))dx
=
∫X
min(lFP · p(y = −1 | x), lFN · p(y = +1 | x)) · p(x)dx
![Page 109: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/109.jpg)
ADVANCEDBACKGROUNDINFORMATION
117Unit 2: Basics of Supervised Machine Learning
RISK FOR BINARY CLASSIFICATION:
ASYMMETRIC CASE (3/3)
SincelFN · p(y = +1 | x) > lFP · p(y = −1 | x)
if and only if (with the convention 1/0 =∞)
p(y = +1 | x)
p(y = −1 | x)>lFP
lFN,
we can rewrite (3) as follows:
g(x) = sign(p(y = +1 | x)
p(y = −1 | x)− lFP
lFN
)Hence, the optimal classification function only depends on the ratio of lFP
and lFN.
![Page 110: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/110.jpg)
118Unit 2: Basics of Supervised Machine Learning
GENERAL PERFORMANCE OF
DISCRIMINANT FUNCTION
If we have a general discriminant function g that maps objects to realvalues, we can adjust to different asymmetric/unbalanced situationsby varying the classification threshold θ (which is by default 0):
g(x) = sign(g(x)− θ)
Question: can we assess the general performance of a classifierwithout choosing a particular discrimination threshold?
![Page 111: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/111.jpg)
119Unit 2: Basics of Supervised Machine Learning
ROC CURVES
ROC stands for Receiver Operator Characteristic. The concepthas been introduced in signal detection theory.ROC curves are a simple means for evaluating the performanceof a binary classifier independent of class distributions and mis-classification costs.The basic idea of ROC curves is to plot the true positive rate(TPR) vs. the false positive rate (FPR) while varying the classi-fication threshold.
![Page 112: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/112.jpg)
120Unit 2: Basics of Supervised Machine Learning
ROC CURVES: PRACTICAL
REALIZATION
Sort samples descendingly according to g.Divide horizontal axis into as many bins as there are negativesamples; divide vertical axis into as many bins as there arepositive samples.Start curve at (0, 0).Iterate over all possible thresholds, i.e. all possible “slots” be-tween two discriminant function values. Every positive sampleis a step up, every negative sample is a step to the right.In case of ties (equal discriminant function values), processthem at once (which results in a ramp in the curve).Finally, end curve in (1, 1).
![Page 113: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/113.jpg)
121Unit 2: Basics of Supervised Machine Learning
AREA UNDER THE ROC CURVE (AUC)
The area under the ROC curve (AUC) is a common measurefor assessing the general performance of a classifier g(.; w).The lowest possible value is 0, the highest possible value is 1.Obviously, the higher the better.An AUC of 1 means that there exists a threshold which perfectlyseparates the test samples.A random classifier produces an AUC of 0.5 in average; hence,an AUC smaller than 0.5 corresponds to a classification that isworse than random and an AUC greater than 0.5 correspondsto a classification that is better than random.
![Page 114: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/114.jpg)
ADVANCEDBACKGROUNDINFORMATION
122Unit 2: Basics of Supervised Machine Learning
AUC: CORRESPONDENCES
Suppose that #p and #n are the numbers of positive and negative sam-ples, respectively, and further assume that y is the label vector if the sam-ples are sorted according to the discriminant function value g(.;w). Thenthe following holds:
AUC =1
#p ·#n ·( ( ∑
yi>0
i)
︸ ︷︷ ︸=R
−#p · (#p+ 1)
2
)
Obviously, R is the sum of ranks of positive examples. Note that
U = R− #p · (#p+ 1)
2
is nothing else but the Mann-Whitney-Wilcoxon statistic.
![Page 115: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/115.jpg)
123Unit 2: Basics of Supervised Machine Learning
ROC EXAMPLE:GAUSSIAN CLASSIF. EXAMPLE #2 REVISITED
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
![Page 116: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/116.jpg)
124Unit 2: Basics of Supervised Machine Learning
ROC CURVE FOR g((x1, x2)) = x1
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.3452
![Page 117: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/117.jpg)
125Unit 2: Basics of Supervised Machine Learning
ROC CURVE FOR g((x1, x2)) = x2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9863
![Page 118: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/118.jpg)
126Unit 2: Basics of Supervised Machine Learning
ROC EXAMPLE:k-NN EXAMPLE #2 REVISITED
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
![Page 119: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/119.jpg)
127Unit 2: Basics of Supervised Machine Learning
ROC CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.6896
![Page 120: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/120.jpg)
127Unit 2: Basics of Supervised Machine Learning
ROC CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.6896
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9121
![Page 121: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/121.jpg)
127Unit 2: Basics of Supervised Machine Learning
ROC CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.6896
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9121
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9286
![Page 122: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/122.jpg)
127Unit 2: Basics of Supervised Machine Learning
ROC CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.6896
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9121
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9286
k = 13:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9299
![Page 123: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/123.jpg)
127Unit 2: Basics of Supervised Machine Learning
ROC CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.6896
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9121
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9286
k = 13:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9299
k = 17:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9313
![Page 124: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/124.jpg)
127Unit 2: Basics of Supervised Machine Learning
ROC CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.6896
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9121
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9286
k = 13:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9299
k = 17:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9313
k = 21:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9272
![Page 125: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/125.jpg)
127Unit 2: Basics of Supervised Machine Learning
ROC CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.6896
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9121
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9286
k = 13:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9299
k = 17:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9313
k = 21:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9272
k = 25:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9478
![Page 126: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/126.jpg)
127Unit 2: Basics of Supervised Machine Learning
ROC CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.6896
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9121
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9286
k = 13:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9299
k = 17:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9313
k = 21:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9272
k = 25:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9478
k = 29:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9437
![Page 127: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/127.jpg)
127Unit 2: Basics of Supervised Machine Learning
ROC CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.6896
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9121
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9286
k = 13:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9299
k = 17:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9313
k = 21:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9272
k = 25:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9478
k = 29:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9437
k = 33:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
AUC = 0.9409
![Page 128: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/128.jpg)
128Unit 2: Basics of Supervised Machine Learning
PRECISION-RECALL (PR) CURVES
For highly unbalanced data sets, in particular, if there are manytrue negatives, the ROC curves may not necessarily provide avery informative picture.For computing a precision-recall curve, similarly to ROCcurves, sweep through all possible thresholds, but plot preci-sion (vertical axis) versus recall (horizontal axis)The higher the area under the curve, the better the classifier.
![Page 129: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/129.jpg)
129Unit 2: Basics of Supervised Machine Learning
PR CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.7639
![Page 130: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/130.jpg)
129Unit 2: Basics of Supervised Machine Learning
PR CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.7639
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9538
![Page 131: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/131.jpg)
129Unit 2: Basics of Supervised Machine Learning
PR CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.7639
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9538
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9628
![Page 132: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/132.jpg)
129Unit 2: Basics of Supervised Machine Learning
PR CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.7639
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9538
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9628
k = 13:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9658
![Page 133: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/133.jpg)
129Unit 2: Basics of Supervised Machine Learning
PR CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.7639
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9538
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9628
k = 13:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9658
k = 17:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9678
![Page 134: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/134.jpg)
129Unit 2: Basics of Supervised Machine Learning
PR CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.7639
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9538
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9628
k = 13:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9658
k = 17:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9678
k = 21:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9669
![Page 135: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/135.jpg)
129Unit 2: Basics of Supervised Machine Learning
PR CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.7639
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9538
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9628
k = 13:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9658
k = 17:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9678
k = 21:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9669
k = 25:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9788
![Page 136: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/136.jpg)
129Unit 2: Basics of Supervised Machine Learning
PR CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.7639
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9538
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9628
k = 13:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9658
k = 17:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9678
k = 21:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9669
k = 25:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9788
k = 29:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9766
![Page 137: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/137.jpg)
129Unit 2: Basics of Supervised Machine Learning
PR CURVES FOR k-NN EXAMPLE #2(75% training, 25% test samples)
k = 1:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.7639
k = 5:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9538
k = 9:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9628
k = 13:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9658
k = 17:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9678
k = 21:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9669
k = 25:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9788
k = 29:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9766
k = 33:
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
TPR
PR
EC
AUC = 0.9754
![Page 138: Machine Learning: Supervised Techniques / Unit 2: Basics ... · Unit 2: Basics of Supervised Machine Learning 32 QUESTIONS WE NEED TO ADDRESS Does learning help in the future, i.e](https://reader030.vdocuments.net/reader030/viewer/2022040909/5e81d1fbdcc7380b5f64d955/html5/thumbnails/138.jpg)
130Unit 2: Basics of Supervised Machine Learning
SUMMARY AND OUTLOOK
In this unit, we have studied the following:
How to evaluate a given model:
� Generalization error/risk� Estimates via test set method and cross validation� Confusion matrices and evaluation measures� ROC and PRC analysis
Simple predictors like k-NN and linear/polynomial regression.ERM and the phenomena of underfitting and overtitting.
The following units will be devoted to state-of-the-art methods for
classification and regression.