medical image analysis machine learning 2 · 2015-11-10 · a.4 l1-regularized logistic regression...
TRANSCRIPT
Medical Image Analysis Machine learning 2 KALLE ÅSTRÖM
Contents
• Review – the basic machine learning problems • Clustering, classification, regression, novelty
• Plug-in classifier vs integration over estimated params
• Classification – Logistic regression
– Regression trees • Novelty detection • Visualization, dimensionality reduction
– The many applications of SVD – Multi-dimensional scaling – ISOMAP – non-linear dimensionality reduction
Review – Basic Machine Learning Questions
• Clustering: to • Classification:
• Regression:
x1, . . . xn y1, . . . yn
(x1, y1), . . . (xn, yn) ! f
f : Rd ! {1, . . . k}(x1, y1), . . . (xn, yn) ! f
f : Rd ! R
Bayes Theorem
• Bayes Theorem
• Interpret P as probabilites, e.g. If Y – discrete
• Interpret P as probability density functions, e.g. If X and/or Y are continuous stochastic variable,
P (Y |X) =P (X|Y )P (Y )
P (X)
P (Y = y|X = x) =fX(x|Y = y)P (Y = y)
fX(x)
fY (y|X = x) =fX(x|Y = y)fY (y)
fX(x)
7 Nearest Neighbour Classification Almost
7 Nearest Neighbors
: An Introduction to Machine Learning 8 / 49
P (Y = 1|X = x)
Non parametric density estimation Kernel (Parzen) density estimation Mixture Density
: An Introduction to Machine Learning 15 / 49
fX(x|Y = 1)
Bin counting Bin counting
: An Introduction to Machine Learning 17 / 49
fX(x|Y = 1)
Estimated density or pdf (probability density function) (discussion) Density Estimate
: An Introduction to Machine Learning 22 / 49
fX(x|Y = 1)
Cross-validation
• Take away subset X’ of training data. • Estimate distribution pX\X’(x) based on the remaining data
X\X’ for one type of width r.
• Compute the mean log-likelihood on the subset X’
• Pick the radius/width r that maximizes this quantity
Crossvalidation Details
Basic IdeaCompute p(X 0|✓(X\X
0)) for various subsets of X andaverage over the corresponding log-likelihoods.
Practical ImplementationGenerate subsets X
i
⇢ X and compute the log-likelihoodestimate
1n
nX
i
1|X
i
| log p(Xi
|✓(X |\X
i
))
Pick the parameter which maximizes the above estimate.Special Case: Leave-one-out Crossvalidation
p
X\x
i
(xi
) =m
m � 1p
X
(xi
)� 1m � 1
k(xi
, x
i
)
: An Introduction to Machine Learning 32 / 49
Cross-validation Cross Validation
: An Introduction to Machine Learning 33 / 49
Estimated P(x|y) Parzen Windows Classifier
: An Introduction to Machine Learning 42 / 49
fX(x|Y = 1)
fX(x|Y = �1)
Estimated Parzen Windows Density Estimate
: An Introduction to Machine Learning 43 / 49
fX(x)
Estimated . Bayes theorem (Discussion -> logistic regression)
Parzen Windows Conditional
: An Introduction to Machine Learning 44 / 49
P (Y = 1|X = x)
HEp2 data (mean density) (Intro to logistic regression)
HEp2 data (mean density) (Intro to logistic regression)
fX(x|Y = 1)
fX(x|Y = �1)
HEp2 data (mean density) (Intro to logistic regression)
P (Y = 1|X = x)
Logistic regression
• Discuss ideas and derivations on blackboard • z = simple function of x, e.g. Linear z = wTx+b • Output y = smooth threshold of z, for example
• Notice that s(z) looks like a typical P(Y=1 | x) function
x 2 R
d, w 2 R
d, b 2 R, f(x) = s(wT
x+ b)
P (Y = 1|x) = 1
1 + e
�z
s(z) =1
1 + e�z
Derivation
• Estimate parameters
P (Y = 1|x) = 1
1 + e
�z
P (Y = �1|x) = 1� 1
1 + e
�z=
e
�z
1 + e
�z=
1
e
z + 1
T = (x1, y1), . . . , (xn, yn)
P (Y = y|x) = 1
1 + e
�yz
Estimate parameters
• Parameters
T = (x1, y1), . . . , (xn, yn)
P (Y = y|x) = 1
1 + e
�yz
✓ = (w, b)
log(P ) = log(Y
i
P (Y = yi|xi, ✓))
X
i
log(
1
1 + eyi(wTxi+b)
)
More machine learning algorithms, where the parameter est => convex opt
• SVM (L2 regularized, L1 loss) • SVM (L2 regularized, L2 loss) • LR (L2 regularized)
• SVM (L1 regularized, L2 loss) • LR (L1 regularized)
• Efficient implementations e g in ’liblinear package’
LIBLINEAR: A Library for Large Linear Classification
Acknowledgments
This work was supported in part by the National Science Council of Taiwan via the grantNSC 95-2221-E-002-205-MY3.
Appendix: Implementation Details and Practical Guide
Appendix A. Formulations
This section briefly describes classifiers supported in LIBLINEAR. Given training vectorsx
i
2 R
n
, i = 1, . . . , l in two class, and a vector y 2 R
l such that y
i
= {1,�1}, a linearclassifier generates a weight vector w as the model. The decision function is
sgn�w
T
x
�.
LIBLINEAR allows the classifier to include a bias term b. See Section 2 for details.
A.1 L2-regularized L1- and L2-loss Support Vector Classification
L2-regularized L1-loss SVC solves the following primal problem:
minw
1
2w
T
w + C
lX
i=1
(max(0, 1� y
i
w
T
x
i
)),
whereas L2-regularized L2-loss SVC solves the following primal problem:
minw
1
2w
T
w + C
lX
i=1
(max(0, 1� y
i
w
T
x
i
))2. (2)
Their dual forms are:
min↵
1
2↵
T
Q̄↵� e
T
↵
subject to 0 ↵
i
U, i = 1, . . . , l.
where e is the vector of all ones, Q̄ = Q+D, D is a diagonal matrix, and Q
ij
= y
i
y
j
x
T
i
x
j
.For L1-loss SVC, U = C and D
ii
= 0, 8i. For L2-loss SVC, U = 1 and D
ii
= 1/(2C), 8i.
A.2 L2-regularized Logistic Regression
L2-regularized LR solves the following unconstrained optimization problem:
minw
1
2w
T
w + C
lX
i=1
log(1 + e
�y
i
w
T
x
i). (3)
Its dual form is:
min↵
1
2↵
T
Q↵+X
i:↵i
>0
↵
i
log↵i
+X
i:↵i
<C
(C � ↵
i
) log(C � ↵
i
)�lX
i=1
C logC
subject to 0 ↵
i
C, i = 1, . . . , l.
(4)
A.1
LIBLINEAR: A Library for Large Linear Classification
Acknowledgments
This work was supported in part by the National Science Council of Taiwan via the grantNSC 95-2221-E-002-205-MY3.
Appendix: Implementation Details and Practical Guide
Appendix A. Formulations
This section briefly describes classifiers supported in LIBLINEAR. Given training vectorsx
i
2 R
n
, i = 1, . . . , l in two class, and a vector y 2 R
l such that y
i
= {1,�1}, a linearclassifier generates a weight vector w as the model. The decision function is
sgn�w
T
x
�.
LIBLINEAR allows the classifier to include a bias term b. See Section 2 for details.
A.1 L2-regularized L1- and L2-loss Support Vector Classification
L2-regularized L1-loss SVC solves the following primal problem:
minw
1
2w
T
w + C
lX
i=1
(max(0, 1� y
i
w
T
x
i
)),
whereas L2-regularized L2-loss SVC solves the following primal problem:
minw
1
2w
T
w + C
lX
i=1
(max(0, 1� y
i
w
T
x
i
))2. (2)
Their dual forms are:
min↵
1
2↵
T
Q̄↵� e
T
↵
subject to 0 ↵
i
U, i = 1, . . . , l.
where e is the vector of all ones, Q̄ = Q+D, D is a diagonal matrix, and Q
ij
= y
i
y
j
x
T
i
x
j
.For L1-loss SVC, U = C and D
ii
= 0, 8i. For L2-loss SVC, U = 1 and D
ii
= 1/(2C), 8i.
A.2 L2-regularized Logistic Regression
L2-regularized LR solves the following unconstrained optimization problem:
minw
1
2w
T
w + C
lX
i=1
log(1 + e
�y
i
w
T
x
i). (3)
Its dual form is:
min↵
1
2↵
T
Q↵+X
i:↵i
>0
↵
i
log↵i
+X
i:↵i
<C
(C � ↵
i
) log(C � ↵
i
)�lX
i=1
C logC
subject to 0 ↵
i
C, i = 1, . . . , l.
(4)
A.1
LIBLINEAR: A Library for Large Linear Classification
Acknowledgments
This work was supported in part by the National Science Council of Taiwan via the grantNSC 95-2221-E-002-205-MY3.
Appendix: Implementation Details and Practical Guide
Appendix A. Formulations
This section briefly describes classifiers supported in LIBLINEAR. Given training vectorsx
i
2 R
n
, i = 1, . . . , l in two class, and a vector y 2 R
l such that y
i
= {1,�1}, a linearclassifier generates a weight vector w as the model. The decision function is
sgn�w
T
x
�.
LIBLINEAR allows the classifier to include a bias term b. See Section 2 for details.
A.1 L2-regularized L1- and L2-loss Support Vector Classification
L2-regularized L1-loss SVC solves the following primal problem:
minw
1
2w
T
w + C
lX
i=1
(max(0, 1� y
i
w
T
x
i
)),
whereas L2-regularized L2-loss SVC solves the following primal problem:
minw
1
2w
T
w + C
lX
i=1
(max(0, 1� y
i
w
T
x
i
))2. (2)
Their dual forms are:
min↵
1
2↵
T
Q̄↵� e
T
↵
subject to 0 ↵
i
U, i = 1, . . . , l.
where e is the vector of all ones, Q̄ = Q+D, D is a diagonal matrix, and Q
ij
= y
i
y
j
x
T
i
x
j
.For L1-loss SVC, U = C and D
ii
= 0, 8i. For L2-loss SVC, U = 1 and D
ii
= 1/(2C), 8i.
A.2 L2-regularized Logistic Regression
L2-regularized LR solves the following unconstrained optimization problem:
minw
1
2w
T
w + C
lX
i=1
log(1 + e
�y
i
w
T
x
i). (3)
Its dual form is:
min↵
1
2↵
T
Q↵+X
i:↵i
>0
↵
i
log↵i
+X
i:↵i
<C
(C � ↵
i
) log(C � ↵
i
)�lX
i=1
C logC
subject to 0 ↵
i
C, i = 1, . . . , l.
(4)
A.1
Fan, Chang, Hsieh, Wang and Lin
A.3 L1-regularized L2-loss Support Vector Classification
L1 regularization generates a sparse solution w. L1-regularized L2-loss SVC solves thefollowing primal problem:
minw
kwk1 + C
lX
i=1
(max(0, 1� y
i
w
T
x
i
))2. (5)
where k · k1 denotes the 1-norm.
A.4 L1-regularized Logistic Regression
L1-regularized LR solves the following unconstrained optimization problem:
minw
kwk1 + C
lX
i=1
log(1 + e
�y
i
w
T
x
i). (6)
where k · k1 denotes the 1-norm.
A.5 L2-regularized L1- and L2-loss Support Vector Regression
Support vector regression (SVR) considers a problem similar to (1), but y
i
is a real valueinstead of +1 or �1. L2-regularized SVR solves the following primal problems:
minw
1
2w
T
w +
(C
Pl
i=1(max(0, |yi
�w
T
x
i
|� ✏)) if using L1 loss,
C
Pl
i=1(max(0, |yi
�w
T
x
i
|� ✏))2 if using L2 loss,
where ✏ � 0 is a parameter to specify the sensitiveness of the loss.Their dual forms are:
min↵
+,↵
�
1
2
⇥↵
+↵
�⇤Q̄ �Q
�Q Q̄
� ↵
+
↵
�
�� y
T (↵+ �↵
�) + ✏e
T (↵+ +↵
�)
subject to 0 ↵
+i
,↵
�i
U, i = 1, . . . , l,
(7)
where e is the vector of all ones, Q̄ = Q+D, Q 2 R
l⇥l is a matrix with Q
ij
⌘ x
T
i
x
j
, D isa diagonal matrix,
D
ii
=
⇢012C
, and U =
(C if using L1-loss SVR,
1 if using L2-loss SVR.
Rather than (7), in LIBLINEAR, we consider the following problem.
min�
1
2�
T
Q̄� � y
T
� + ✏k�k1
subject to � U �
i
U, i = 1, . . . , l,(8)
where � 2 R
l and k · k1 denotes the 1-norm. It can be shown that an optimal solution of(8) leads to the following optimal solution of (7).
↵
+i
⌘ max(�i
, 0) and ↵
�i
⌘ max(��
i
, 0).
A.2
Fan, Chang, Hsieh, Wang and Lin
A.3 L1-regularized L2-loss Support Vector Classification
L1 regularization generates a sparse solution w. L1-regularized L2-loss SVC solves thefollowing primal problem:
minw
kwk1 + C
lX
i=1
(max(0, 1� y
i
w
T
x
i
))2. (5)
where k · k1 denotes the 1-norm.
A.4 L1-regularized Logistic Regression
L1-regularized LR solves the following unconstrained optimization problem:
minw
kwk1 + C
lX
i=1
log(1 + e
�y
i
w
T
x
i). (6)
where k · k1 denotes the 1-norm.
A.5 L2-regularized L1- and L2-loss Support Vector Regression
Support vector regression (SVR) considers a problem similar to (1), but y
i
is a real valueinstead of +1 or �1. L2-regularized SVR solves the following primal problems:
minw
1
2w
T
w +
(C
Pl
i=1(max(0, |yi
�w
T
x
i
|� ✏)) if using L1 loss,
C
Pl
i=1(max(0, |yi
�w
T
x
i
|� ✏))2 if using L2 loss,
where ✏ � 0 is a parameter to specify the sensitiveness of the loss.Their dual forms are:
min↵
+,↵
�
1
2
⇥↵
+↵
�⇤Q̄ �Q
�Q Q̄
� ↵
+
↵
�
�� y
T (↵+ �↵
�) + ✏e
T (↵+ +↵
�)
subject to 0 ↵
+i
,↵
�i
U, i = 1, . . . , l,
(7)
where e is the vector of all ones, Q̄ = Q+D, Q 2 R
l⇥l is a matrix with Q
ij
⌘ x
T
i
x
j
, D isa diagonal matrix,
D
ii
=
⇢012C
, and U =
(C if using L1-loss SVR,
1 if using L2-loss SVR.
Rather than (7), in LIBLINEAR, we consider the following problem.
min�
1
2�
T
Q̄� � y
T
� + ✏k�k1
subject to � U �
i
U, i = 1, . . . , l,(8)
where � 2 R
l and k · k1 denotes the 1-norm. It can be shown that an optimal solution of(8) leads to the following optimal solution of (7).
↵
+i
⌘ max(�i
, 0) and ↵
�i
⌘ max(��
i
, 0).
A.2
Watson Nadaraya Classifier/Regression Plug-in classifier/regression Watson Nadaraya Regression
Decision BoundaryPicking y = 1 or y = �1 depends on the sign of
Pr(y = 1|x)� Pr(y = �1|x) =
Pi
y
i
k(xi
, x)Pi
k(xi
, x)
Extension to RegressionUse the same equation for regression. This means that
f (x) =
Pi
y
i
k(xi
, x)Pi
k(xi
, x)
where now y
i
2 R.We get a locally weighted version of the data
: An Introduction to Machine Learning 45 / 49
Regression Problem
: An Introduction to Machine Learning 46 / 49
Watson Nadaraya Regression
: An Introduction to Machine Learning 47 / 49
Plug-in versus integration over estimated parameters
• The estimated uncertainties, e.g. • … are also uncertain. • Example parametric estimation, parameters:
• Estimate parameters given training data T • • Approach 1 (plug in). Take the that maximize • Approach 2 (integration). Integrate over
fX(x|Y = 1)
✓
f⇥(✓|T )✓ f⇥(✓|T )
✓
Integration over theta
• Since the parameters are uncertain, form the mean over all models, weighted according to their uncertainty
• Plug these estimated uncertainties into Bayes theorem to get the classification
• This can be difficult to calculate for more advanced models.
fX(x|Y = 1, T ) =
Z
✓fX(x|Y = 1, ✓)f⇥(✓|Y = 1, T )
Density estimation for novelty detection
• Find the least likely observations xi from a dataset X • Perform density estimation from data X • Check those data points xi which has lowest p(xi | X)
• Perhaps using leave-one-out strategy
Applications
• Network Intrusion detection • Jet Engine Failure detection • Database cleaning
• Fraud detection • Detecting bad sensors, e g EEG sensors that have been
erroneously placed on the patient • Self calibrating alarm devices
Typical data Typical Data
: An Introduction to Machine Learning 38 / 49
Outliers Outliers
: An Introduction to Machine Learning 39 / 49
HEp2 data (mean density)
Regression trees
Decision trees advantages
• Simple to understand and interpret • Requires little preparation • Can handle both continuous and discrete data
• ’white box’ model. You can easily explain a decision afterwards
• Robust • Performs well with large datasets
Decision tree limitations
• Optimal learning is NP-complete (use heuristics) • Problems with over fitting
Regression trees learning
• Try each variable • Try each threshold • Calculate score e g
– Entropy
– Gini Impurity
Regression trees learning
• Log2(6) = 2.58 (bits) • f = [0.24, 0.23, 0.11,
0.13, 0.21, 0.08]
• I(f) = 2.48 (bits) • Try each threshold
• Calculate score e g – Entropy
Using SVD in several ways
• The singular value decomposition • M = U S VT
• Essentially a unique factorization
• U and V rotation matrices (unitary matrices) • S diagonal with decreasing non-negative diagonal
elements called singular values
Using SVD in several ways
• Find vector x that ’solves’ Ax = 0 • If A has fewer rows than columns
– Underdetermined, many solutions
• If A is overdetermined find x that gives smallest |Ax| – But smaller x gives better results
• Minimize |Ax| under the constraint that |x|=1 • Is solved by setting x to last column of V,
– Where A = U S VT
Using SVD in several ways
• Given data matrix A,
– find rank k matrix Ak that is closest to A
• Solution
– Make singular value decomposition
– A = U S VT
– Let Uk be first k columns of U
– Let Vk be first k columns of V
– Let Sk be upper-left k x k submatrix of S
– Ak = Uk Sk VkT
– There is proof that Ak is the solution to the minimization problem
Using SVD in several ways
• Find optimal basis (and coordinates) for a set of images • “Optimal (linear) dimensionality reduction • Solution
– Put images as columns in matrix A, remove mean first – Make singular value decomposition
– A = U S VT
– Possibly make bestlow rank approximation » Ak = Uk Sk Vk
T
– Uk are basis for optimal subspace of dimension k – Xk = Sk Vk
T are optimal coordinate
Using SVD in several ways
• Given symmetric matrix A, – Factorize A = BBT
• Solution
– Make singular value decomposition – A = U S VT
– Take the square root of S = D2 – Set B = U D – A = BBT
Multi-Dimensional Scaling
• Assume that ’distances’ ’similarities’ are measured between each pairs (i,j) of feature vectors
• Can you reconstruct the feature vectors
• from the interpoint distances
(x1, x2, . . . , xn)
(x1, x2, . . . , xn)
dij = |xi � xj |
Multi-Dimensional Scaling - The trick
• Choose coordinate system so that • Square the distances
• Form a new matrix
tij = d
2ij = (xi � xj)
T (xi � xj)
tij = d
2ij = x
Ti xi + x
Tj xj � 2xT
i xj
x1 = 0
t1i = x
Ti xi t1j = x
Tj xj
sij = �(tij � t1i � t1j)/2 = �x
Ti xj
s =
0
BB@
x
T2
x
T3
. . .
x
Tn
1
CCA�x2 x3 . . . xn
�
Multi-Dimensional Scaling
• Use singular value decomposition to calculate X from S (see previous slide)
S =
0
BB@
x
T2
x
T3
. . .
x
Tn
1
CCA�x2 x3 . . . xn
�= X
TX
MDS example
Non-linear dimensionality reduction
• Many different methods, see – http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction
• Examples – Kernel PCA – Local linear embedding
– ISOMAP • Many similarities.
ISOMAP
• Idea (illustrate on blackboard) – For each point choose the k nearest neighbours – Form weighted graph using distance to the k nearest
neighbours – Calculate distance matrix D containing all distances dij
of pairs of feature vectors using shortest distance in graph.
– Use Multi-dimensional scaling to embed points in e g R2
Summary
• Machine learning topics, Bayes theorem again • Classification
– Logistic regression » Classification, where parameter estimation becomes a
convex optimization problem
– Regression trees • Novelty detection
• Visualization, dimensionality reduction – The many applications of SVD – Multi-dimensional scaling
– ISOMAP – non-linear dimensionality reduction