cmse 890-002: mathematics of deep learning, msu, spring ...€¦ · cmse 890-002: mathematics of...

CMSE 890-002: Mathematics of Deep Learning, MSU, Spring 2020

Lecture 05: MAP & k-Nearest NeighborsJanuary 22, 2019

Lecturer: Matthew Hirn

3.2 Bayes and maximum a posterioriRecall from Section 1.3 we defined maximum likelihood estimation (MLE) as:

b✓MLE = argmax✓2Rn

pX,Y (T | ✓) = argmax✓2Rn

p(T | ✓) ,

where we recall T = {(xi, yi)}Ni=1 is our training set. In other words, we maximized the prob-ability of observing the particular training set T that we have, given the model parameters✓. On the other hand, a more intuitive and perhaps useful object might be p(✓ | T ), theprobability of the model ✓ being correct given the training set T . Indeed, this is how wegenerally think of machine learning. We are given a training set T and we pick a model ✓that has the highest chance of describing the data. This is called the maximum a posteriori

estimator:b✓MAP = argmax

✓2Rnp(✓ | T ) . (8)

The question then becomes, how do we compute b✓MAP? For this, we can use Bayes’Theorem:

p(✓ | T ) / p(T | ✓)p(✓) .

Notice that right hand side contains the term p(T | ✓), which is what we used to compute theMLE model. But it also contains another, new term, p(✓). What is this term? It is referred toas the prior distribution on the parameters ✓. It encodes any prior knowledge we have aboutour model, or behavior that we want to build into our model. For example, we can placea Gaussian/normal prior on ✓, meaning that we assume each parameter ✓(k) is distributedaccording to the normal distribution with mean zero and variance proportional to ��1:

p(✓) = p(✓ | �) =nY

k=1

r�

2⇡e��|✓(k)|2 (normal prior) .

We can also use a Laplace prior:

p(✓) = p(✓ | �) =nY

k=1

�

2e��|✓(k)| (Laplace prior) .

1

Let us now see how the b✓MAP model is related to regularized functional models, and inparticular ridge regression for the normal prior and LASSO for the Laplace prior. Using (8)we have:

b✓MAP = argmax✓2Rn

p(✓ | T )

= argmax✓2Rn

p(T | ✓)p(✓)

= argmax✓2Rn

(log p(T | ✓) + log p(✓))

Recall from (3) that if the {xi}Ni=1 are sampled iid (independently and identically distributed)from X , then we can decompose log p(T | ✓) as:

b✓MAP = argmax✓2Rn

NX

i=1

log pX(xi | ✓) +NX

i=1

log pY |X(yi | xi, ✓) + log p(✓)

!.

Furthermore, in the case of supervised learning we will often discard the first term involvingpX , and, as we saw in Section 2.2, taking

pY |X(y | x, ✓) = e�|y�f(x;✓)|2/2�2,

is a reasonable choice. We then have:

b✓MAP = argmin✓2Rn

1

2�2

NX

i=1

|yi � f(xi; ✓)|2 � log p(✓)

!.

We thus see that log p(✓) serves as the regularization term on the parameters ✓. In particular,if we use the Gaussian/normal prior, we obtain:

b✓MAP = argmin✓2Rn

1

2�2

NX

i=1

|yi � f(xi; ✓)|2 + �nX

k=1

|✓(k)|2!

,

which is precisely ridge regression if f(x; ✓) = h(1, x), ✓i. Similarly, if we use the Laplaceprior, we obtain LASSO.

3.3 k-Nearest neighbors and local methodsFigure 7c illustrates the need for a nonlinear classifier, capable of learning a nonlinear decisionboundary as opposed to a straight line. A polynomial method could do better, but this stillimposes a modeling assumption on the underlying data generation process. In order to avoidsuch assumptions, we instead turn to another type of method, k-nearest neighbors, which isa local method.

2

(a) Linear regression (b) k-nearest neighbors with k =1

(c) k-nearest neighbors with k =15

Figure 10: Two dimensional data with two classes, orange (+1) and blue (-1). Three differentclassifiers and their respective decision boundaries. Figure taken from [6].

Let Nk(x) denote k nearest of neighbors of x, in the Euclidean distance, from the set oftraining points {x1, . . . , xN} ⇢ Rd. The k-nearest neighbor model is:

f(x; k) =1

k

X

xi2Nk(x)

yi .

Figure 10 illustrates the model for k = 1 and k = 15, and compares to the linear classifierfrom Figure 7c. In the figure we see that the k-nearest neighbors classifiers make significantlyfewer mis-classifications on the training set than the linear regression classifier. However,classification rate on the training set is not the sole criterion for the quality of a model, aswe saw in our discussion on regularization. Indeed models must generalize well to unseenpoints. The k = 1 nearest neighbor model in fact makes no mis-classifications on the trainingset, since it simply assigns to each training point its own label. But the decision boundaryof the 1-nearest neighbor classifier is extremely irregular and likely to make mistakes on testdata. The k = 15 nearest neighbor classifier, on the other hand, has a semi-regular boundarythat is likely to generalize better than either the 1-nearest neighbor classifier (because it istoo rough) or the linear regression classifier (because it is too smooth). This discussion ishinting at the bias-variance tradeoff in machine learning, which we will make more precisein Section 4. It also illustrates that the k = 1 model is, in some sense, more complex thanthe k = 15 model. In the extreme case, k = N , all points are classified the same, and so thisis the simplest model. In fact, even though there is only one parameter, k, that we set, theeffective number of parameters or complexity of the k-nearest neighbor model is N/k.

Note that because the 1-nearest neighbor classifier will always make zero errors on thetraining set, we require a different method by which to select k. We instead use cross-validation, which requires a validation set separate from the training set, but for whichwe still know the correct labels. In fact, many hyper-parameters, including k in k-nearestneighbors and � in regularized linear models, are selected through cross-validation. The way

3

Figure 11: k-nearest neighbors classifier for k decreasing from left to right, evaluated on the trainingset and the test (validation) set. The models with k ⇡ 10 perform the best on the test set, even thoughthe training error is essentially decreasing as k ! 1.

it works for k-nearest neighbors is that we use the original training set T to define theneighborhoods of each point and the model f(x; k), but we then evaluate this model on thevalidation set and compute the average error on the validation set. The k with the lowestaverage error on the validation set is the model we use on the test set. See Figure 11 for anillustration using k-nearest neighbors. The more general principle, which includes k-nearestneighbors and the regularized linear regularization, is based on the bias variance tradeoff,which will be discussed in the next section.

4

References[1] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-

scale image recognition. In Proceedings of the International Conference on Learning

Representations, 2015.

[2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: Alarge-scale hierarchical image database. In Conference on Computer Vision and Pattern

Recognition, 2009.

[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet classification with deepconvolutional neural networks. In Advances in Neural Information Processing Systems

25, pages 1097–1105. 2012.

[4] Larry Greenemeier. AI versus AI: Self-taught AlphaGo Zero vanquishes its predecessor.Scientific American, October 18, 2017.

[5] Pankaj Mehta, Marin Bukov, Ching-Hao Wang, Alexandre G.R. Day, Clint Richardson,Charles K. Fisher, and David J. Schwab. A high-bias, low-variance introduction toMachine Learning for physicists. arXiv:1803.08823, 2018.

[6] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical

Learning. Springer-Verlag New York, 2nd edition, 2009.

cmse 890-002: mathematics of deep learning, msu, spring ...€¦ · cmse 890-002: mathematics of...

Documents