data mining & machine learning
TRANSCRIPT
![Page 1: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/1.jpg)
Data Mining & Machine Learning
CS37300Purdue University
October 6, 2017
![Page 2: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/2.jpg)
Extra Credit Competition Update
![Page 3: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/3.jpg)
So far…
So far we have seen… we reviewed Naive Bayes Classifier and the Decision Tree…
We now embark on a quest to find other classifiers
![Page 4: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/4.jpg)
Classifiers for today
• Nearest neighbors
• Linear Regression
• Support vector machines
• Logistic Regression (1-layer neural network)
![Page 5: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/5.jpg)
Classification Task (with C classes)
! Data representation:
◦ Training set: Paired attribute vectors and class labels , , for some set
or table with one class label (y) and attributes (x)
! Knowledge representation: A function , parameterized by
! Model space: all values where ◦ Construct a model that approximates the mapping between and
●Classification: if y is categorical (e.g., {yes, no}, {dog, cat, elephant})
●Regression: if y is real-valued (e.g., stock prices)
(yi, xi)yi ∈ 𝕐, xi ∈ ℝd, d > 0 𝕐
n × p p − 1
f(x; θ) = y θ
θ f(x; θ) ∈ 𝕐
x y
![Page 6: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/6.jpg)
Binary classification
! In its simplest form, a classification model defines a decision boundary (h) and labels for each side of the boundary
! Input: x={x1,x2,...,xn} is a set of attributes, function f assigns a label y to input x, where y is a discrete variable with a finite number of values
![Page 7: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/7.jpg)
Nearest Neighbors
![Page 8: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/8.jpg)
Nearest neighbor
• Instance-based method
• Learning
• Stores training data and delays processing until a new instance must be classified
• Assumes that all points are represented in p-dimensional space
• Prediction
• Nearest neighbors are calculated using Euclidean distance
• Classification is made based on class labels of neighbors
![Page 9: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/9.jpg)
1NN
• Training set: (x1,y1), (x2,y2), ..., (xn,yn) , where s a feature vector of p continuous attributes and yi is a discrete class label
• 1NN algorithmTo predict a class label for new instance j:Find the training instance point xi such that d(xi, xj) is minimizedLet f(xj)= yi
• Key idea: Find instances that are “similar” to the new instance and use their class labels to make prediction for the new instance
• 1NN generalizes to kNN when more neighbors are considered
xi = [xi1, xi2, …, xip]
![Page 10: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/10.jpg)
kNN model: decision boundaries
Source: http://cs231n.github.io/classification/
k
![Page 11: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/11.jpg)
kNN
• kNN algorithmTo predict a class label for new instance j:Find the k nearest neighbors of j, i.e., those that minimize d(xk, xj)Let f(xj)= g( yk ), e.g., majority label in yk
• Algorithm choices
• How many neighbors to consider (i.e., choice of k)?... Usually a small value is used, e.g. k<10
• What distance measure d( ) to use? ... Euclidean norm distance is often used
• What function g( ) to combine the neighbors’ labels into a prediction? ... Majority vote is often used
L2
![Page 12: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/12.jpg)
1NN decision boundary
• For each training example i, we can calculate its Voronoi cell, which corresponds to the space of points for which i is their nearest neighbor
• All points in such a Voronoi cell are labeled by the class of the training point, forming a Voronoi tessellation of the feature space
Source: http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551-Spring2008/slides/cs551_nonbayesian1.pdf
![Page 13: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/13.jpg)
Nearest neighbor
• Strengths:
• Simple model, easy to implement
• Very efficient learning: O(1)
• Weaknesses:
• Inefficient inference: time and space O(n)
• Curse of dimensionality:
• As number of features increase, you need an exponential increase in the size of the data to ensure that you have nearby examples for any given data point
- See python notebook on sphere volume
![Page 14: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/14.jpg)
k-NN learning
• Parameters of the model:
• k (number of neighbors)
• any parameters of distance measure (e.g., weights on features)
• Model space
• Possible tessellations of the feature space
• Search algorithm
• Implicit search: choice of k, d, and g uniquely define a tessellation
• Score function
• Majority vote is minimizing misclassification rate
![Page 15: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/15.jpg)
Least Squares Classifier
![Page 16: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/16.jpg)
! Given x features of a car (length, width, mpg, maximum speed,…)! Classify cars into categories based on x
16
Motivation
![Page 17: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/17.jpg)
Least Squares ClassifierTwo classes, cars:! is a real-valued vector (features of car ) ! is the class of car
! Find linear discriminant weights w
! Score function least squares error
! Search function: find , that minimize score
nxi iyi i
yi = {1 economy−1 luxury
, for i = 1,…, n
f(x) = wTx + b
score =n
∑i=1
(yi − f(xi))2
w b
Feature: Car max speed > 0f (x)
= 0f (x) < 0f (x)
f (x)∥x∥
−b
∥w∥
Feature: Car length
![Page 18: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/18.jpg)
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
Least Squares Solution
18
Issues with Least Squares Classification
h+ε
ε
cares too much about well classified items
score =n
∑i=1
(yi − f(xi))2
![Page 19: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/19.jpg)
Neural networks
• Analogous to biological systems
• Massive parallelism is computationally efficient
• First learning algorithm in 1959 (Rosenblatt)
• Perceptron learning rule
• Provide target outputs with inputs for a single neuron
• Incrementally update weights to learn to produce outputs
![Page 20: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/20.jpg)
Neuron
26
CS590D 56
A Neuron
µk-
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector w
!
w0
w1
wn
x0
x1
xn
)sign(y
ExampleFor
n
0i
kii xw µ+= !=
CS590D 57
Multi-Layer Perceptron
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: xi
wij
! +=i
jiijj OwI θ
jIje
O−
+=
1
1
))(1( jjjjj OTOOErr −−=
jkk
kjjj wErrOOErr !−= )1(
ijijij OErrlww )(+=
jjj Errl)(+=θθ
b
PP
Activation function
f(x) = wT x + b
ParametersParameter (bias)
![Page 21: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/21.jpg)
Logistic regression
! Task: Binary classification! Data representation: observations of attributes and label ! Knowledge representation: Two classes (y=0, y=1)
where
! Score function is the negative log-likelihood:
! Search: find , that minimize score
n xi ∈ ℝp y ∈ {0,1}
P(Y = 1 |X = x) = σ(wT x + b)
score = − log P({xi, yi}ni=1 |w, b)
= −n
∑i=1
1(yi = 1)log σ(wTxi + b) + 1(yi = 0)log(1 − σ(wTxi + b))
w b
Logistic function
Logistic (neuron) Activation (non-linear filter)
• If input is , the output willlook like a probability
• p(y = 1 | x; w) =
x
• We will represent the logistic function with the symbol:12
is one if a = b, zero otherwise1(a = b)
![Page 22: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/22.jpg)
17
How to Deal with Multiple Classes?
![Page 23: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/23.jpg)
• How to classify objects into multiple types?
18
Naïve Approach: one vs. many Classification
y(1)c =
(1 , if car c is “small”
�1 , if car c is “luxury”
y(2)c =
(1 , if car c is “small”
�1 , if car c is “medium”
y(3)c =
(1 , if car c is “medium”
�1 , if car c is “luxury”
Might work OK in some scenarios… but not clear in this case
![Page 24: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/24.jpg)
luxury
19
Issue with using binary classifiers for K classes4.1. Discriminant Functions 183
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
Figure 4.2 Attempting to construct a K class discriminant from a set of two class discriminants leads to am-biguous regions, shown in green. On the left is an example involving the use of two discriminants designed todistinguish points in class Ck from points not in class Ck. On the right is an example involving three discriminantfunctions each of which is used to separate a pair of classes Ck and Cj .
example involving three classes where this approach leads to regions of input spacethat are ambiguously classified.
An alternative is to introduce K(K − 1)/2 binary discriminant functions, onefor every possible pair of classes. This is known as a one-versus-one classifier. Eachpoint is then classified according to a majority vote amongst the discriminant func-tions. However, this too runs into the problem of ambiguous regions, as illustratedin the right-hand diagram of Figure 4.2.
We can avoid these difficulties by considering a single K-class discriminantcomprising K linear functions of the form
yk(x) = wTk x + wk0 (4.9)
and then assigning a point x to class Ck if yk(x) > yj(x) for all j "= k. The decisionboundary between class Ck and class Cj is therefore given by yk(x) = yj(x) andhence corresponds to a (D − 1)-dimensional hyperplane defined by
(wk − wj)Tx + (wk0 − wj0) = 0. (4.10)
This has the same form as the decision boundary for the two-class case discussed inSection 4.1.1, and so analogous geometrical properties apply.
The decision regions of such a discriminant are always singly connected andconvex. To see this, consider two points xA and xB both of which lie inside decisionregion Rk, as illustrated in Figure 4.3. Any point x̂ that lies on the line connectingxA and xB can be expressed in the form
x̂ = λxA + (1 − λ)xB (4.11)
small
medium
Uncertain classification region
Figure: C. Bishop
![Page 25: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/25.jpg)
We will revisit multi-class classification when we see neural networks
![Page 26: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/26.jpg)
Support vector machines (SVMs)
![Page 27: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/27.jpg)
Support vector machines
• Discriminative model
• General idea:
• Find best boundary points (support vectors) and build classifier on top of them
• Linear and non-linear SVMs
![Page 28: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/28.jpg)
Choosing hyperplanes to separate points
Source: Introduction to Data Mining, Tan, Steinbach, and Kumar
![Page 29: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/29.jpg)
Among equivalent hyperplanes, choose one that maximizes “margin”
Source: Introduction to Data Mining, Tan, Steinbach, and Kumar
![Page 30: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/30.jpg)
Linear SVMs
• Same functional form as perceptron
• Different learning procedure:Search for hyperplane with largest margin
• Margin=d+ + d- where d+ is distance to closest positive example and d- is distance to closest negative example
Linear Support Vector Machines
• d+ distance to closest positive example, d- distance to closest negative example.
• Define margin of separating hyperplane to be d++ d-
• SVMs look for hyperplane with the largest margin
o
x
o
oo
o
o
x
xx
x
x
x
d" d#
support +ectors
y = sign
"mX
i=1
wixi + b
#
![Page 31: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/31.jpg)
Constrained optimization for SVMs
Eq1 : x(j) · w + b ⌅ +1 for y(j) = +1Eq2 : x(j) · w + b ⇤ �1 for y(j) = �1
Eq3 : y(j)(x(j) · w + b)� 1 ⌅ 0 ⇧y(j)
H1 : x(j) · w + b = +1H2 : x(j) · w + b = �1
d+ = d� =1
||w||
• Can maximize margin by minimizing ||w|| as it defines the hyperplanes
margin =2
||w||
Linear Support Vector Machines
• d+ distance to closest positive example, d- distance to closest negative example.
• Define margin of separating hyperplane to be d++ d-
• SVMs look for hyperplane with the largest margin
o
x
o
oo
o
o
x
xx
x
x
x
d" d#
support +ectors
H1
H2
Prediction constraint
Hyperplane boundaries
![Page 32: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/32.jpg)
SVM optimization
• Search: Maximize margin by minimizing 0.5||w||2 subject to constraints on Eq3
• Note: Maximizing 2/||w|| is equivalent to minimizing 0.5||w||2
• Introduce Lagrange multipliers (α) for constraints into score function to minimize:
• Minimize LP with respect to w, b, and αN ≥0
• Convex programming problem
• Quadratic programming problem with parameters w, b, α
LP =12
||w||2 �I�
i=1
�iy(i)[x(i) · w + b] +I�
i=1
�i
![Page 33: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/33.jpg)
Constrained optimization
• Linear programming (LP) is a technique for the optimization of a linear objective function, subject to linear constraints on the variables
• Quadratic programming (QP) is a technique for the optimization of a quadratic function of several variables, subject to linear constraints on these variables
![Page 34: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/34.jpg)
SVM components
• Model space
• Set of weights w and b (hyperplane boundary)
• Search algorithm
• Quadratic programming to minimize Lp with constraints
• Score function
• Lp: maximizes margin subject to constraint that all training data is correctly classified
![Page 35: Data Mining & Machine Learning](https://reader030.vdocuments.net/reader030/viewer/2022012709/61a95ef471294c39ed56d66f/html5/thumbnails/35.jpg)
Limitations of linear SVMs
• Linear classifiers cannot deal with:
• Non-linear concepts
• Noisy data
• Solutions:
• Soft margin (e.g., allow mistakes in training data)
• Network of simple linear classifiers (e.g., neural networks)
• Map data into richer feature space (e.g., non-linear features) and then use linear classifier