introduction to machine learning 3rd edition ethem alpaydin modified by prof. carolina ruiz © the...
TRANSCRIPT
![Page 1: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/1.jpg)
INTRODUCTION TO MACHİNE LEARNİNG3RD EDİTİON
ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz© The MIT Press, 2014 for CS539 Machine Learning at WPI
[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml3e
Lecture Slides for
![Page 2: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/2.jpg)
CHAPTER 5:
MULTİVARİATE METHODS
![Page 3: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/3.jpg)
3
Multivariate Data
Nd
NN
d
d
XXX
XXXXXX
21
222
21
112
11
X
Multiple measurements (sensors) d inputs/features/attributes: d-variate N instances/observations/examples
![Page 4: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/4.jpg)
4
Multivariate Parameters
221
22221
11221
μμCov
ddd
d
d
TE
XXX
ji
ijijji
jiTji
Tjjiijiij
Td
XX
XXEXXEXX
E
,Corr :nCorrelatio
,Cov:Covariance
,...,:Mean 1μx
Covariance Matrix:
![Page 5: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/5.jpg)
5
Parameter Estimation from data sample X
ji
ijij
jtj
N
t iti
ij
N
t
ti
i
ss
sr
N
mxmxs
diN
xm
:matrix nCorrelatio
:matrix Covariance
,...,1,: mean Sample
1
1
R
S
m
![Page 6: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/6.jpg)
6
Estimation of Missing Values
What to do if certain instances have missing attribute values?
Ignore those instances. This is not a good idea if the sample is small
Use ‘missing’ as an attribute: may give information
Imputation: Fill in the missing value Mean imputation: Use the most likely value
(e.g., mean) Imputation by regression: Predict based on
other attributes
![Page 7: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/7.jpg)
7
Multivariate Normal Distribution
μxμxx
μx
1212
Σ2
1
Σ2
1
Σ
Td
d
p exp
~
//
,N
![Page 8: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/8.jpg)
8
Multivariate Normal Distribution
Mahalanobis distance: (x – μ)T ∑–1 (x – μ) measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations)
Bivariate: d = 2
2221
2121
iiii xz
zzzzxxp
/
exp,
2
2212122
21
21 212
1
12
1
z-normalization
![Page 9: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/9.jpg)
9
Bivariate NormalIsoprobability [i.e., (x – μ)T ∑–1 (x – μ) = c2] contour plots
when covariance is 0, ellipsoid axes are parallel to coordinate axes
![Page 10: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/10.jpg)
10
![Page 11: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/11.jpg)
11
If xi are independent, offdiagonal values of ∑ are 0, Mahalanobis distance reduces to weighted (by 1/σi) Euclidean distance:
If variances are also equal, reduces to Euclidean distance
Independent Inputs: Naive Bayes
d
i i
iid
ii
d
d
iii
xxpp
1
2
1
2/1 2
1exp
2
1
x
The use of the term “Naïve Bayes” in this chapter is somewhat wrong Naïve Bayes assumes independence in the probability sense,
not in the linear algebra sense
![Page 12: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/12.jpg)
12
Parametric Classification
If p (x | Ci ) ~ N ( μi , ∑i )
Discriminant functions
iiT
i
idiCp μxμxx 1
212Σ
2
1
Σ2
1exp| //
iiiT
ii
iii
CPd
CPCpg
log loglog
log| log
μΣμ2
1Σ
2
12
21 xx
xx
![Page 13: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/13.jpg)
13
Estimation of Parameters from data sample X
tti
Ti
tt i
tti
i
tti
ttt
ii
tti
i
r
r
r
rN
rCP
mxmx
xm
S
ˆ
iiiT
iii CPg ˆ log log mxmxx 1
2
1
2
1SS
![Page 14: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/14.jpg)
14
Assuming a different Si for each Ci
Quadratic discriminant. Expanding the formula on previous slide:
has the form of a quadratic formula
See figure on next slide
iiiiTii
iii
ii
iTii
T
iiiTiii
Ti
Tii
CPw
w
CPg
ˆ log log
where
ˆ log log
SS
S
SW
W
SSSS
2
1
2
1
2
1
22
1
2
1
10
1
1
0
111
mm
mw
xwxx
mmmxxxx
![Page 15: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/15.jpg)
15
likelihoods
posterior for C1
discriminant: P (C1|x ) = 0.5
![Page 16: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/16.jpg)
16
Assuming Common Covariance Matrix S
ii
iCP̂ SS
Shared common sample covariance S
Discriminant reduces to
which is a linear discriminant
iiT
ii CP̂g log21 1 mxmxx S
iiTiiii
iTii
CPw
wg
ˆ log
where
mmmw
xwx
10
1
0
2
1SS
![Page 17: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/17.jpg)
17
Common Covariance Matrix S
Arbitrary covariances but shared by classes
![Page 18: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/18.jpg)
18
Assuming Common Covariance Matrix S is Diagonal
id
j j
ijtj
i CPsmx
g ˆ log
1
2
2
1x
When xj j = 1,..d, are independent, ∑ is diagonal
p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’
assumption)
Classify based on weighted Euclidean distance (in sj units) to the nearest mean
![Page 19: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/19.jpg)
19
Assuming Common Covariance Matrix S is Diagonal
variances may bedifferent
Covariances are 0, so ellipsoid axes are parallel to coordinate axes
![Page 20: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/20.jpg)
20
Assuming Common Covariance Matrix S is Diagonal and variances are equal
id
jij
tj
ii
i
CPmxs
CPs
g
ˆ log
ˆ log
2
12
2
2
2
1
2
mxx
Nearest mean classifier: Classify based on Euclidean distance to the nearest mean
Each mean can be considered a prototype or template and this is template matching
![Page 21: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/21.jpg)
21
Assuming Common Covariance Matrix S is Diagonal and variances are equal
*?
Covariances are 0, so ellipsoid axes are parallel to coordinate axes.Variances are the same, so ellipsoids become circles.
Classifier looks for nearest mean
![Page 22: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/22.jpg)
22
Model Selection
Assumption Covariance matrix No of parameters
Shared, Hyperspheric Si=S=s2I 1
Shared, Axis-aligned Si=S, with sij=0 d
Shared, Hyperellipsoidal Si=S d(d+1)/2
Different, Hyperellipsoidal Si K d(d+1)/2
As we increase complexity (less restricted S), bias decreases and variance increases
Assume simple models (allow some bias) to control variance (regularization)
![Page 23: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/23.jpg)
23
Different cases of covariance matrices
fitted to the same data lead to different decision
boundaries
![Page 24: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/24.jpg)
24
Discrete Features
i
jijjijj
iii
CPpxpx
CPCpg
log log log
log| log
11
xx
Binary features:if xj are independent (Naive Bayes’)
the discriminant is linear
ijij Cxpp |1
d
j
xij
xiji
jj ppCxp1
11|
tti
tti
tj
ij r
rxp̂Estimated parameters
![Page 25: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/25.jpg)
25
Discrete Features
ikjijkijk CvxpCzpp || 1
Multinomial (1-of-nj) features: xj in {v1, v2,..., vnj
}
where zjk = 1 if xj = vk ; or 0 otherwiseif xj are independent
tti
tti
tjk
ijk
iijkj k jki
d
j
n
k
zijki
r
rzp
CPpzg
pCpj
jk
ˆ
log log
|
x
x1 1
![Page 26: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI alpaydin@boun.edu.tr](https://reader033.vdocuments.net/reader033/viewer/2022061612/56649f295503460f94c42178/html5/thumbnails/26.jpg)
Multivariate Regression26
dtt wwwxgr ,...,,| 10
Multivariate linear model
Multivariate polynomial model: Define new higher-order variables
z1=x1, z2=x2, z3=x12, z4=x2
2, z5=x1x2
and use the linear model in this new z space (basis functions, kernel trick: Chapter 13)
211010
22110
2
1
ttdd
ttd
tdd
tt
xwxwwrwwwE
xwxwxww
X|,...,,