medical image analysis machine learning 2 · 2015-11-10 · a.4 l1-regularized logistic regression...

Medical Image Analysis Machine learning 2 KALLE ÅSTRÖM

Contents

•  Review – the basic machine learning problems •  Clustering, classification, regression, novelty

•  Plug-in classifier vs integration over estimated params

•  Classification –  Logistic regression

– Regression trees •  Novelty detection •  Visualization, dimensionality reduction

–  The many applications of SVD – Multi-dimensional scaling –  ISOMAP – non-linear dimensionality reduction

Review – Basic Machine Learning Questions

•  Clustering: to •  Classification:

•  Regression:

x1, . . . xn y1, . . . yn

(x1, y1), . . . (xn, yn) ! f

f : Rd ! {1, . . . k}(x1, y1), . . . (xn, yn) ! f

f : Rd ! R

Bayes Theorem

•  Bayes Theorem

•  Interpret P as probabilites, e.g. If Y – discrete

•  Interpret P as probability density functions, e.g. If X and/or Y are continuous stochastic variable,

P (Y |X) =P (X|Y )P (Y )

P (X)

P (Y = y|X = x) =fX(x|Y = y)P (Y = y)

fX(x)

fY (y|X = x) =fX(x|Y = y)fY (y)

fX(x)

7 Nearest Neighbour Classification Almost

7 Nearest Neighbors

: An Introduction to Machine Learning 8 / 49

P (Y = 1|X = x)

Non parametric density estimation Kernel (Parzen) density estimation Mixture Density


fX(x|Y = 1)

Bin counting Bin counting


fX(x|Y = 1)

Estimated density or pdf (probability density function) (discussion) Density Estimate


fX(x|Y = 1)

Cross-validation

•  Take away subset X’ of training data. •  Estimate distribution pX\X’(x) based on the remaining data

X\X’ for one type of width r.

•  Compute the mean log-likelihood on the subset X’

•  Pick the radius/width r that maximizes this quantity

Crossvalidation Details

Basic IdeaCompute p(X 0|✓(X\X

0)) for various subsets of X andaverage over the corresponding log-likelihoods.

Practical ImplementationGenerate subsets X

i

⇢ X and compute the log-likelihoodestimate

1n

nX

i

1|X

i

| log p(Xi

|✓(X |\X

i

))

Pick the parameter which maximizes the above estimate.Special Case: Leave-one-out Crossvalidation

p

X\x

i

(xi

) =m

m � 1p

X

(xi

)� 1m � 1

k(xi

, x

i

)


Cross-validation Cross Validation


Estimated P(x|y) Parzen Windows Classifier


fX(x|Y = 1)

fX(x|Y = �1)

Estimated Parzen Windows Density Estimate


fX(x)

Estimated . Bayes theorem (Discussion -> logistic regression)

Parzen Windows Conditional


P (Y = 1|X = x)

HEp2 data (mean density) (Intro to logistic regression)


fX(x|Y = 1)

fX(x|Y = �1)


P (Y = 1|X = x)

Logistic regression

•  Discuss ideas and derivations on blackboard •  z = simple function of x, e.g. Linear z = wTx+b •  Output y = smooth threshold of z, for example

•  Notice that s(z) looks like a typical P(Y=1 | x) function

x 2 R

d, w 2 R

d, b 2 R, f(x) = s(wT

x+ b)

P (Y = 1|x) = 1

1 + e

�z

s(z) =1

1 + e�z

Derivation

•  Estimate parameters

P (Y = 1|x) = 1

1 + e

�z

P (Y = �1|x) = 1� 1

1 + e

�z=

e

�z

1 + e

�z=

1

e

z + 1

T = (x1, y1), . . . , (xn, yn)

P (Y = y|x) = 1

1 + e

�yz

Estimate parameters

•  Parameters

T = (x1, y1), . . . , (xn, yn)

P (Y = y|x) = 1

1 + e

�yz

✓ = (w, b)

log(P ) = log(Y

i

P (Y = yi|xi, ✓))

X

i

log(

1

1 + eyi(wTxi+b)

)

More machine learning algorithms, where the parameter est => convex opt

•  SVM (L2 regularized, L1 loss) •  SVM (L2 regularized, L2 loss) •  LR (L2 regularized)

•  SVM (L1 regularized, L2 loss) •  LR (L1 regularized)

•  Efficient implementations e g in ’liblinear package’

LIBLINEAR: A Library for Large Linear Classification

Acknowledgments

This work was supported in part by the National Science Council of Taiwan via the grantNSC 95-2221-E-002-205-MY3.

Appendix: Implementation Details and Practical Guide

Appendix A. Formulations

This section briefly describes classifiers supported in LIBLINEAR. Given training vectorsx

i

2 R

n

, i = 1, . . . , l in two class, and a vector y 2 R

l such that y

i

= {1,�1}, a linearclassifier generates a weight vector w as the model. The decision function is

sgn�w

T

x

�.

LIBLINEAR allows the classifier to include a bias term b. See Section 2 for details.

A.1 L2-regularized L1- and L2-loss Support Vector Classification

L2-regularized L1-loss SVC solves the following primal problem:

minw

1

2w

T

w + C

lX

i=1

(max(0, 1� y

i

w

T

x

i

)),

whereas L2-regularized L2-loss SVC solves the following primal problem:

minw

1

2w

T

w + C

lX

i=1

(max(0, 1� y

i

w

T

x

i

))2. (2)

Their dual forms are:

min↵

1

2↵

T

Q̄↵� e

T

↵

subject to 0 ↵

i

U, i = 1, . . . , l.

where e is the vector of all ones, Q̄ = Q+D, D is a diagonal matrix, and Q

ij

= y

i

y

j

x

T

i

x

j

.For L1-loss SVC, U = C and D

ii

= 0, 8i. For L2-loss SVC, U = 1 and D

ii

= 1/(2C), 8i.

A.2 L2-regularized Logistic Regression

L2-regularized LR solves the following unconstrained optimization problem:

minw

1

2w

T

w + C

lX

i=1

log(1 + e

�y

i

w

T

x

i). (3)

Its dual form is:

min↵

1

2↵

T

Q↵+X

i:↵i

>0

↵

i

log↵i

+X

i:↵i

<C

(C � ↵

i

) log(C � ↵

i

)�lX

i=1

C logC

subject to 0 ↵

i

C, i = 1, . . . , l.

(4)

A.1


Acknowledgments





i

2 R

n


l such that y

i


sgn�w

T

x

�.




minw

1

2w

T

w + C

lX

i=1

(max(0, 1� y

i

w

T

x

i

)),


minw

1

2w

T

w + C

lX

i=1

(max(0, 1� y

i

w

T

x

i

))2. (2)


min↵

1

2↵

T

Q̄↵� e

T

↵

subject to 0 ↵

i

U, i = 1, . . . , l.


ij

= y

i

y

j

x

T

i

x

j


ii


ii

= 1/(2C), 8i.



minw

1

2w

T

w + C

lX

i=1

log(1 + e

�y

i

w

T

x

i). (3)

Its dual form is:

min↵

1

2↵

T

Q↵+X

i:↵i

>0

↵

i

log↵i

+X

i:↵i

<C

(C � ↵

i

) log(C � ↵

i

)�lX

i=1

C logC

subject to 0 ↵

i

C, i = 1, . . . , l.

(4)

A.1


Acknowledgments





i

2 R

n


l such that y

i


sgn�w

T

x

�.




minw

1

2w

T

w + C

lX

i=1

(max(0, 1� y

i

w

T

x

i

)),


minw

1

2w

T

w + C

lX

i=1

(max(0, 1� y

i

w

T

x

i

))2. (2)


min↵

1

2↵

T

Q̄↵� e

T

↵

subject to 0 ↵

i

U, i = 1, . . . , l.


ij

= y

i

y

j

x

T

i

x

j


ii


ii

= 1/(2C), 8i.



minw

1

2w

T

w + C

lX

i=1

log(1 + e

�y

i

w

T

x

i). (3)

Its dual form is:

min↵

1

2↵

T

Q↵+X

i:↵i

>0

↵

i

log↵i

+X

i:↵i

<C

(C � ↵

i

) log(C � ↵

i

)�lX

i=1

C logC

subject to 0 ↵

i

C, i = 1, . . . , l.

(4)

A.1

Fan, Chang, Hsieh, Wang and Lin

A.3 L1-regularized L2-loss Support Vector Classification

L1 regularization generates a sparse solution w. L1-regularized L2-loss SVC solves thefollowing primal problem:

minw

kwk1 + C

lX

i=1

(max(0, 1� y

i

w

T

x

i

))2. (5)

where k · k1 denotes the 1-norm.



minw

kwk1 + C

lX

i=1

log(1 + e

�y

i

w

T

x

i). (6)


A.5 L2-regularized L1- and L2-loss Support Vector Regression

Support vector regression (SVR) considers a problem similar to (1), but y

i

is a real valueinstead of +1 or �1. L2-regularized SVR solves the following primal problems:

minw

1

2w

T

w +

(C

Pl

i=1(max(0, |yi

�w

T

x

i

|� ✏)) if using L1 loss,

C

Pl

i=1(max(0, |yi

�w

T

x

i

|� ✏))2 if using L2 loss,

where ✏ � 0 is a parameter to specify the sensitiveness of the loss.Their dual forms are:

min↵

+,↵

�

1

2

⇥↵

+↵

�⇤Q̄ �Q

�Q Q̄

� ↵

+

↵

�

�� y

T (↵+ �↵

�) + ✏e

T (↵+ +↵

�)

subject to 0 ↵

+i

,↵

�i

U, i = 1, . . . , l,

(7)

where e is the vector of all ones, Q̄ = Q+D, Q 2 R

l⇥l is a matrix with Q

ij

⌘ x

T

i

x

j

, D isa diagonal matrix,

D

ii

=

⇢012C

, and U =

(C if using L1-loss SVR,

1 if using L2-loss SVR.

Rather than (7), in LIBLINEAR, we consider the following problem.

min�

1

2�

T

Q̄� � y

T

� + ✏k�k1

subject to � U �

i

U, i = 1, . . . , l,(8)

where � 2 R

l and k · k1 denotes the 1-norm. It can be shown that an optimal solution of(8) leads to the following optimal solution of (7).

↵

+i

⌘ max(�i

, 0) and ↵

�i

⌘ max(��

i

, 0).

A.2

Fan, Chang, Hsieh, Wang and Lin

A.3 L1-regularized L2-loss Support Vector Classification

L1 regularization generates a sparse solution w. L1-regularized L2-loss SVC solves thefollowing primal problem:

minw

kwk1 + C

lX

i=1

(max(0, 1� y

i

w

T

x

i

))2. (5)




minw

kwk1 + C

lX

i=1

log(1 + e

�y

i

w

T

x

i). (6)


A.5 L2-regularized L1- and L2-loss Support Vector Regression

Support vector regression (SVR) considers a problem similar to (1), but y

i

is a real valueinstead of +1 or �1. L2-regularized SVR solves the following primal problems:

minw

1

2w

T

w +

(C

Pl

i=1(max(0, |yi

�w

T

x

i

|� ✏)) if using L1 loss,

C

Pl

i=1(max(0, |yi

�w

T

x

i

|� ✏))2 if using L2 loss,

where ✏ � 0 is a parameter to specify the sensitiveness of the loss.Their dual forms are:

min↵

+,↵

�

1

2

⇥↵

+↵

�⇤Q̄ �Q

�Q Q̄

� ↵

+

↵

�

�� y

T (↵+ �↵

�) + ✏e

T (↵+ +↵

�)

subject to 0 ↵

+i

,↵

�i

U, i = 1, . . . , l,

(7)

where e is the vector of all ones, Q̄ = Q+D, Q 2 R

l⇥l is a matrix with Q

ij

⌘ x

T

i

x

j

, D isa diagonal matrix,

D

ii

=

⇢012C

, and U =

(C if using L1-loss SVR,

1 if using L2-loss SVR.

Rather than (7), in LIBLINEAR, we consider the following problem.

min�

1

2�

T

Q̄� � y

T

� + ✏k�k1

subject to � U �

i

U, i = 1, . . . , l,(8)

where � 2 R

l and k · k1 denotes the 1-norm. It can be shown that an optimal solution of(8) leads to the following optimal solution of (7).

↵

+i

⌘ max(�i

, 0) and ↵

�i

⌘ max(��

i

, 0).

A.2

Watson Nadaraya Classifier/Regression Plug-in classifier/regression Watson Nadaraya Regression

Decision BoundaryPicking y = 1 or y = �1 depends on the sign of

Pr(y = 1|x)� Pr(y = �1|x) =

Pi

y

i

k(xi

, x)Pi

k(xi

, x)

Extension to RegressionUse the same equation for regression. This means that

f (x) =

Pi

y

i

k(xi

, x)Pi

k(xi

, x)

where now y

i

2 R.We get a locally weighted version of the data


Regression Problem


Watson Nadaraya Regression


Plug-in versus integration over estimated parameters

•  The estimated uncertainties, e.g. •  … are also uncertain. •  Example parametric estimation, parameters:

•  Estimate parameters given training data T •  •  Approach 1 (plug in). Take the that maximize •  Approach 2 (integration). Integrate over

fX(x|Y = 1)

✓

f⇥(✓|T )✓ f⇥(✓|T )

✓

Integration over theta

•  Since the parameters are uncertain, form the mean over all models, weighted according to their uncertainty

•  Plug these estimated uncertainties into Bayes theorem to get the classification

•  This can be difficult to calculate for more advanced models.

fX(x|Y = 1, T ) =

Z

✓fX(x|Y = 1, ✓)f⇥(✓|Y = 1, T )

Density estimation for novelty detection

•  Find the least likely observations xi from a dataset X •  Perform density estimation from data X •  Check those data points xi which has lowest p(xi | X)

•  Perhaps using leave-one-out strategy

Applications

•  Network Intrusion detection •  Jet Engine Failure detection •  Database cleaning

•  Fraud detection •  Detecting bad sensors, e g EEG sensors that have been

erroneously placed on the patient •  Self calibrating alarm devices

Typical data Typical Data


Outliers Outliers


HEp2 data (mean density)

Regression trees

Decision trees advantages

•  Simple to understand and interpret •  Requires little preparation •  Can handle both continuous and discrete data

•  ’white box’ model. You can easily explain a decision afterwards

•  Robust •  Performs well with large datasets

Decision tree limitations

•  Optimal learning is NP-complete (use heuristics) •  Problems with over fitting

Regression trees learning

•  Try each variable •  Try each threshold •  Calculate score e g

– Entropy

– Gini Impurity

Regression trees learning

•  Log2(6) = 2.58 (bits) •  f = [0.24, 0.23, 0.11,

0.13, 0.21, 0.08]

•  I(f) = 2.48 (bits) •  Try each threshold

•  Calculate score e g – Entropy

Using SVD in several ways

•  The singular value decomposition •  M = U S VT

•  Essentially a unique factorization

•  U and V rotation matrices (unitary matrices) •  S diagonal with decreasing non-negative diagonal

elements called singular values


•  Find vector x that ’solves’ Ax = 0 •  If A has fewer rows than columns

– Underdetermined, many solutions

•  If A is overdetermined find x that gives smallest |Ax| – But smaller x gives better results

•  Minimize |Ax| under the constraint that |x|=1 •  Is solved by setting x to last column of V,

– Where A = U S VT


•  Given data matrix A,

–  find rank k matrix Ak that is closest to A

•  Solution

–  Make singular value decomposition

–  A = U S VT

–  Let Uk be first k columns of U

–  Let Vk be first k columns of V

–  Let Sk be upper-left k x k submatrix of S

–  Ak = Uk Sk VkT

–  There is proof that Ak is the solution to the minimization problem


•  Find optimal basis (and coordinates) for a set of images •  “Optimal (linear) dimensionality reduction •  Solution

– Put images as columns in matrix A, remove mean first – Make singular value decomposition

– A = U S VT

– Possibly make bestlow rank approximation » Ak = Uk Sk Vk

T

– Uk are basis for optimal subspace of dimension k – Xk = Sk Vk

T are optimal coordinate


•  Given symmetric matrix A, –  Factorize A = BBT

•  Solution

– Make singular value decomposition – A = U S VT

–  Take the square root of S = D2 – Set B = U D – A = BBT

Multi-Dimensional Scaling

•  Assume that ’distances’ ’similarities’ are measured between each pairs (i,j) of feature vectors

•  Can you reconstruct the feature vectors

•  from the interpoint distances

(x1, x2, . . . , xn)

(x1, x2, . . . , xn)

dij = |xi � xj |

Multi-Dimensional Scaling - The trick

•  Choose coordinate system so that •  Square the distances

•  Form a new matrix

tij = d

2ij = (xi � xj)

T (xi � xj)

tij = d

2ij = x

Ti xi + x

Tj xj � 2xT

i xj

x1 = 0

t1i = x

Ti xi t1j = x

Tj xj

sij = �(tij � t1i � t1j)/2 = �x

Ti xj

s =

0

BB@

x

T2

x

T3

. . .

x

Tn

1

CCA�x2 x3 . . . xn

�

Multi-Dimensional Scaling

•  Use singular value decomposition to calculate X from S (see previous slide)

S =

0

BB@

x

T2

x

T3

. . .

x

Tn

1

CCA�x2 x3 . . . xn

�= X

TX

MDS example

Non-linear dimensionality reduction

•  Many different methods, see –  http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction

•  Examples – Kernel PCA –  Local linear embedding

–  ISOMAP •  Many similarities.

ISOMAP

•  Idea (illustrate on blackboard) –  For each point choose the k nearest neighbours –  Form weighted graph using distance to the k nearest

neighbours – Calculate distance matrix D containing all distances dij

of pairs of feature vectors using shortest distance in graph.

– Use Multi-dimensional scaling to embed points in e g R2

Summary

•  Machine learning topics, Bayes theorem again •  Classification

–  Logistic regression » Classification, where parameter estimation becomes a

convex optimization problem

– Regression trees •  Novelty detection

•  Visualization, dimensionality reduction –  The many applications of SVD – Multi-dimensional scaling

–  ISOMAP – non-linear dimensionality reduction

medical image analysis machine learning 2 · 2015-11-10 · a.4 l1-regularized logistic regression...

Documents