introduction to machine learning - github pages · introduction to machine learning recitation 2...
TRANSCRIPT
![Page 1: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/1.jpg)
Introduction to machine learningRecitation 2MIT - 6.802 / 6.874 / 20.390 / 20.490 / HST.506 - Spring 2019 Sachit Saksena
![Page 2: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/2.jpg)
Recap of recitation 1
Google Colabratory setup
Introduction to Tensorflow
Setting up Google Cloud with Datalab
Everyone set up for PS1?
Local conda and jupyter notebook setup
![Page 3: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/3.jpg)
Basics of machine learning: steps
I. Get dataII. Identify the space of possible solutionsIII. Formulate an objectiveIV. Choose algorithmV. Train (loss)VI. Validate results (metrics)
![Page 4: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/4.jpg)
Basics of machine learning: tasks
![Page 5: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/5.jpg)
Basics of machine learning: loss
Task LossRegression
(penalize large errors)
Regression (penalize error linearly)
Classification(binary)
Classification(multi-class)
Generative
![Page 6: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/6.jpg)
Basics of machine learning: metrics
![Page 7: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/7.jpg)
Basics of machine learning: metrics
![Page 8: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/8.jpg)
Basics of machine learning: ROC
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
![Page 9: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/9.jpg)
Basics of machine learning: data
Train—test split (1-fold)
Cross-validation (6-fold)
https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8
![Page 10: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/10.jpg)
Basics of machine learning: data
Train—test split (1-fold)
Cross-validation (6-fold)
https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8
![Page 11: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/11.jpg)
Basics of machine learning: data
Train—test split (1-fold)
Cross-validation (6-fold)
https://towardsdatascience.com/cross-validation-code-visualization-kind-of-fun-b9741baea1f8
![Page 12: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/12.jpg)
Basics of machine learning: non-parametric models
1 𝐷 = {(𝑋$, 𝑌$), 𝑋(, 𝑌( , … , (𝑋*, 𝑌*)}2 function knn(𝑘,dist, 𝐷 ,-./* , 𝐷 ,01, ):3 votes = []4 for 𝑖 = 1 to len(𝐷 ,01, )5 d = []6 for j = 1 to len(𝐷 ,-./* ) 7 d[j] = dist(𝑥/, 𝑥4)8 d = argsort(d)9 votes[i] = most_common(set(labels[d]))
![Page 13: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/13.jpg)
Basics of machine learning: semi-parametric models
The idea here is that we would like to find a partition of the input space and then fit very simple models to predict the output in each piece. The partition is described using a (typically binary) “decision
tree,” which recursively splits the space.
![Page 14: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/14.jpg)
Basics of machine learning: implementations
k-Nearest Neighbors (kNN)
Classification and regression decision trees (CART)
![Page 15: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/15.jpg)
Neural networks: perceptrons to neurons
![Page 16: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/16.jpg)
Basics of machine learning: feature representations
Polynomial basis
![Page 17: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/17.jpg)
Neural networks: perceptrons to neurons
![Page 18: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/18.jpg)
Neural networks: activation functions
Step function
Sigmoid
Rectified linear unit
Hyperbolic tangent
Softmax
![Page 19: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/19.jpg)
Neural networks: activation functions
Task LossRegression
(penalize large errors)
Regression (penalize error linearly)
Activation
Classification(binary)
Classification(multi-class)
Generative
Linear (ReLU, Leaky ReLU, etc)
Linear (ReLU, Leaky ReLU, etc)
Sigmoid, tanh
Softmax
Linear (ReLU, Leaky ReLU, etc)
Other considerations: gradient intensity, computational activation cost, exploding/vanishing gradients, depth of network (linear is useless)
![Page 20: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/20.jpg)
Neural networks: single layer feed-forward NN
![Page 21: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/21.jpg)
Optimization: batch gradient update
1 function gradient_update(W/*/,, 𝜂, 𝐽, 𝜖):2 𝑊0 = Winit3 t = 04 while 𝐽 𝑊, − 𝐽 𝑊,<$ > 𝜖5 t = t+16 𝑊, = 𝑊,<$ − 𝜂 ∇@ 𝐽
Gradient of objective J with respect to parameter vector W
Pseudocode for gradient update algorithm
Batch gradient update
This is an arbitrary update criteria
![Page 22: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/22.jpg)
Neural networks: multi-layer feed-forward NN
X f1W1
b1Z1 A1
layer 1
f2W2
b2Z2 A2
layer 2
… fLWL
bLZL AL
layer L
AL-1
Loss
y
Inspired by 6.036 lecture notes (Leslie Kaebling)
![Page 23: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/23.jpg)
Neural networks: error backpropogation
X f1W1
b1Z1 A1
f2W2
b2Z2 A2
… fLWL
bLZL ALAL-1
Loss
Inspired by 6.036 lecture notes (Leslie Kaebling)
Let’s work out error backprop on the board!
y
![Page 24: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/24.jpg)
Neural networks: error backpropogation
So let’s use the following shorthand from the previous figure,
First, let’s break down how the loss depends on the final layer,
Since,
We can re-write the equation as,
Now, to propagate through the whole network, we can keep applying the chain rule until the first layer of the network,
If you spend a few minutes looking at matrix dimensions, it becomes clear that this is an informal derivation. Here are the dimensions to think about:
Since we have the outputs of every layer, all we need to compute for the gradient of the last layer with respect to the weights is the gradient of the loss with respect to the pre-activation output.
The equation with the correct dimensions for matrix multiplication,
Inspired by 6.036 lecture notes (Leslie Kaebling)
![Page 25: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/25.jpg)
Neural networks: automatic differentiation
https://www.robots.ox.ac.uk/~tvg/publications/talks/autodiff.pdf
End result
Augment a standard Taylor series (numerical differentiation), with a “dual number”,
Forward automatic differentiation
Because ”dual numbers” have the (manufactured) property,
The Taylor Series simplifies to,
Which recovers the function output as well as the first derivative.
Dual numbers
![Page 26: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/26.jpg)
Neural networks: regularization
Objective function
Objective function with ridge regularization Penalize
Dropout
Popular new approach: Batch normalization
![Page 27: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/27.jpg)
Optimization: overview
![Page 28: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/28.jpg)
Optimization: stochastic gradient descent
Stochastic gradient update (per randomly sampled training example):
Pseudocode for stochastic gradient update algorithm
1 function sgd(W/*/,, 𝜂, 𝐽, T, 𝜖):2 𝑊0 = Winit3 for 𝑡 = 1 to T4 randomly select i ∈ {1,2,...,n} 5 𝑊, = 𝑊,<$ − 𝜂(𝑡) ∇@ 𝐽
Mini-batch gradient descent
![Page 29: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/29.jpg)
Optimization: momentum
Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006
Nesterov inequality? How can we make momentum better?
![Page 31: Introduction to machine learning - GitHub Pages · Introduction to machine learning Recitation 2 MIT -6.802 / 6.874 / 20.390 / 20.490 / HST.506 -Spring 2019 SachitSaksena](https://reader030.vdocuments.net/reader030/viewer/2022040116/5ec5efa738130663f40e64b0/html5/thumbnails/31.jpg)
Next week
Neural network interpretability
Convolutional neural networks Recurrent neural networks
Batch normalization