optimization for ml · 2020-03-05 · optimization for ml 1 10-601 introduction to machine learning...
TRANSCRIPT
![Page 1: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/1.jpg)
Optimization for ML
1
10-601 Introduction to Machine Learning
Matt GormleyLecture 7
Feb. 7, 2018
Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University
![Page 2: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/2.jpg)
Reminders
• Homework 3: KNN, Perceptron, Lin.Reg.– Out: Wed, Feb 7– Due: Wed, Feb 14 at 11:59pm
3
![Page 3: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/3.jpg)
OPTIMIZATION FOR LINEAR REGRESSION
7
![Page 4: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/4.jpg)
Regression
Example Applications:• Stock price prediction• Forecasting epidemics• Speech synthesis• Generation of images
(e.g. Deep Dream)• Predicting the number
of tourists on Machu Picchu on a given day
8Week 49 (December 5) forecast, using wILI data through week 47. During the week of
the first forecast, all of the available wILI values are below the CDC onset threshold, as shownin Fig 2A. Predictions for the onset are concentrated near the actual value, and the error in thepoint prediction is fairly small (1.58 weeks). Much of this error can be attributed to the suddenjump in wILI at the onset, which corresponds to Thanksgiving week. The number of patientsseen per reporting provider in ILINet drops noticeably every season on Thanksgiving week andaround winter holidays; at these times, there is a systematic bias towards higher wILI values.
In the 2013–2014 season, the number of total visits dropped from 869362 on the weekbefore Thanksgiving to 661282 on Thanksgiving week, and from 808701 on week 51 to 607611on week 52. The number of ILI visits also dropped slightly on Thanksgiving week (from 14995to 13909, not as significant as the drop in total visits), then increased continuously until it
Fig 2. 2013–2014 national forecast, retrospectively, using the final revisions of wILI values, usingrevised wILI data through epidemiological weeks (A) 47, (B) 51, (C) 1, and (D) 7.
doi:10.1371/journal.pcbi.1004382.g002
Flexible Modeling of Epidemics with an Empirical Bayes Framework
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004382 August 28, 2015 8 / 18
![Page 5: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/5.jpg)
9
![Page 6: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/6.jpg)
Optimization for Linear Regression
Whiteboard– Closed-form (Normal Equations)• Computational complexity• Stability
– Gradient Descent for Linear Regression
12
![Page 7: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/7.jpg)
13
Topographical Maps
![Page 8: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/8.jpg)
14
Topographical Maps
![Page 9: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/9.jpg)
Gradients
15
![Page 10: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/10.jpg)
Gradients
16These are the gradients that
Gradient Ascent would follow.
![Page 11: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/11.jpg)
(Negative) Gradients
17These are the negative gradients that
Gradient Descent would follow.
![Page 12: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/12.jpg)
(Negative) Gradient Paths
18
Shown are the paths that Gradient Descent would follow if it were making infinitesimally
small steps.
![Page 13: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/13.jpg)
Gradient Descent
19
Algorithm 1 Gradient Descent
1: procedure GD(D, �(0))2: � � �(0)
3: while not converged do4: � � � + ���J(�)
5: return �
In order to apply GD to Linear Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives).
��J(�) =
�
����
dd�1
J(�)d
d�2J(�)...
dd�N
J(�)
�
����
—
![Page 14: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/14.jpg)
Gradient Descent
20
Algorithm 1 Gradient Descent
1: procedure GD(D, �(0))2: � � �(0)
3: while not converged do4: � � � + ���J(�)
5: return �
There are many possible ways to detect convergence. For example, we could check whether the L2 norm of the gradient is below some small tolerance.
||��J(�)||2 � �Alternatively we could check that the reduction in the objective function from one iteration to the next is small.
—
![Page 15: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/15.jpg)
Pros and cons of gradient descent• Simple and often quite effective on ML tasks• Often very scalable • Only applies to smooth functions (differentiable)• Might find a local minimum, rather than a global one
21Slide courtesy of William Cohen
![Page 16: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/16.jpg)
Stochastic Gradient Descent (SGD)
22
Algorithm 2 Stochastic Gradient Descent (SGD)
1: procedure SGD(D, �(0))2: � � �(0)
3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: � � � + ���J (i)(�)
6: return �
We need a per-example objective:
Let J(�) =�N
i=1 J (i)(�)where J (i)(�) = 1
2 (�T x(i) � y(i))2.
—
![Page 17: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/17.jpg)
Stochastic Gradient Descent (SGD)
We need a per-example objective:
23
Let J(�) =�N
i=1 J (i)(�)where J (i)(�) = 1
2 (�T x(i) � y(i))2.
Algorithm 2 Stochastic Gradient Descent (SGD)
1: procedure SGD(D, �(0))2: � � �(0)
3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: for k � {1, 2, . . . , K} do6: �k � �k + � d
d�kJ (i)(�)
7: return �
—
![Page 18: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/18.jpg)
Stochastic Gradient Descent (SGD)
Whiteboard– Expectations of gradients– Algorithm– Mini-batches– Details: mini-batches, step size, stopping
criterion– Problematic cases for SGD
24
![Page 19: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/19.jpg)
Convergence
Whiteboard– Comparison of Newton’s method, Gradient
Descent, SGD– Asymptotic convergence– Convergence in practice
25
![Page 20: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/20.jpg)
Convergence Curves
• SGD reduces MSE much more rapidly than GD
• For GD / SGD, training MSE is initially large due to uninformed initialization
26
Gradient DescentSGDClosed-form (normal eq.s)
Figure adapted from Eric P. Xing
• Def: an epoch is a single pass through the training data
1. For GD, only one update per epoch
2. For SGD, N updates per epoch N = (# train examples)
![Page 21: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/21.jpg)
Optimization ObjectivesYou should be able to…• Apply gradient descent to optimize a function• Apply stochastic gradient descent (SGD) to
optimize a function• Apply knowledge of zero derivatives to identify
a closed-form solution (if one exists) to an optimization problem
• Distinguish between convex, concave, and nonconvex functions
• Obtain the gradient (and Hessian) of a (twice) differentiable function
27
![Page 22: Optimization for ML · 2020-03-05 · Optimization for ML 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 7 Feb. 7, 2018 Machine Learning Department School of Computer](https://reader030.vdocuments.net/reader030/viewer/2022041101/5eda8a4bfebf237c0c3b780e/html5/thumbnails/22.jpg)
Linear Regression ObjectivesYou should be able to…• Design k-NN Regression and Decision Tree
Regression• Implement learning for Linear Regression using three
optimization techniques: (1) closed form, (2) gradient descent, (3) stochastic gradient descent
• Choose a Linear Regression optimization technique that is appropriate for a particular dataset by analyzing the tradeoff of computational complexity vs. convergence speed
• Distinguish the three sources of error identified by the bias-variance decomposition: bias, variance, and irreducible error.
28