linear classi ers - lxmls.it.ptlxmls.it.pt/2020/lxmls_2020_martins_lecture.pdf · linear classi ers...
TRANSCRIPT
![Page 1: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/1.jpg)
Linear Classifiers
Andre Martins
Lisbon Machine Learning School, July 22, 2020
Andre Martins (IST) Linear Classifiers LxMLS 2020 1 / 157
![Page 2: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/2.jpg)
Why Linear Classifiers?
It’s 2020 and everybody uses neural networks. Why a lecture on linearclassifiers?
• The underlying machine learning concepts are the same
• The theory (statistics and optimization) are much better understood
• Linear classifiers are still widely used (and very effective when data isscarce)
• Linear classifiers are a component of neural networks.
Andre Martins (IST) Linear Classifiers LxMLS 2020 2 / 157
![Page 3: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/3.jpg)
Linear Classifiers and Neural Networks
Andre Martins (IST) Linear Classifiers LxMLS 2020 3 / 157
![Page 4: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/4.jpg)
Linear Classifiers and Neural Networks
Linear Classifier
Andre Martins (IST) Linear Classifiers LxMLS 2020 3 / 157
![Page 5: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/5.jpg)
Linear Classifiers and Neural Networks
Linear Classifier
Andre Martins (IST) Linear Classifiers LxMLS 2020 3 / 157
![Page 6: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/6.jpg)
Linear Classifiers and Neural Networks
Linear Classifier
HandcraftedFeatures
Andre Martins (IST) Linear Classifiers LxMLS 2020 3 / 157
![Page 7: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/7.jpg)
Today’s Roadmap
• Linear regression
• Binary and multi-class classification
• Linear classifiers: perceptron, naive Bayes, logistic regression, SVMs
• Softmax and sparsemax
• Regularization and optimization, stochastic gradient descent
• Similarity-based classifiers and kernels.
Andre Martins (IST) Linear Classifiers LxMLS 2020 4 / 157
![Page 8: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/8.jpg)
Example Tasks
Binary: given an e-mail: is it spam or not-spam?
Multi-class: given a news article, determine its topic (politics, sports, etc.)
Andre Martins (IST) Linear Classifiers LxMLS 2020 5 / 157
![Page 9: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/9.jpg)
Outline
1 Data and Feature Representation
2 Regression
3 Classification
Perceptron
Naive Bayes
Logistic Regression
Support Vector Machines
4 Regularization
5 Non-Linear Classifiers
Andre Martins (IST) Linear Classifiers LxMLS 2020 6 / 157
![Page 10: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/10.jpg)
Disclaimer
Some of the following slides are adapted from Ryan McDonald.
Andre Martins (IST) Linear Classifiers LxMLS 2020 7 / 157
![Page 11: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/11.jpg)
Let’s Start Simple
• Example 1 – sequence: ? ; label: −1
• Example 2 – sequence: ? ♥ 4; label: −1
• Example 3 – sequence: ? 4 ♠; label: +1
• Example 4 – sequence: 4 ; label: +1
• New sequence: ? ; label ?
• New sequence: ? ♥; label
• New sequence: ? 4 ; label ?
Why can we do this?
Andre Martins (IST) Linear Classifiers LxMLS 2020 8 / 157
![Page 12: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/12.jpg)
Let’s Start Simple
• Example 1 – sequence: ? ; label: −1
• Example 2 – sequence: ? ♥ 4; label: −1
• Example 3 – sequence: ? 4 ♠; label: +1
• Example 4 – sequence: 4 ; label: +1
• New sequence: ? ; label ?
• New sequence: ? ♥; label
• New sequence: ? 4 ; label ?
Why can we do this?
Andre Martins (IST) Linear Classifiers LxMLS 2020 8 / 157
![Page 13: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/13.jpg)
Let’s Start Simple
• Example 1 – sequence: ? ; label: −1
• Example 2 – sequence: ? ♥ 4; label: −1
• Example 3 – sequence: ? 4 ♠; label: +1
• Example 4 – sequence: 4 ; label: +1
• New sequence: ? ; label −1
• New sequence: ? ♥; label ?
• New sequence: ? 4 ; label ?
Why can we do this?
Andre Martins (IST) Linear Classifiers LxMLS 2020 8 / 157
![Page 14: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/14.jpg)
Let’s Start Simple
• Example 1 – sequence: ? ; label: −1
• Example 2 – sequence: ? ♥ 4; label: −1
• Example 3 – sequence: ? 4 ♠; label: +1
• Example 4 – sequence: 4 ; label: +1
• New sequence: ? ; label −1
• New sequence: ? ♥; label −1
• New sequence: ? 4 ; label ?
Why can we do this?
Andre Martins (IST) Linear Classifiers LxMLS 2020 8 / 157
![Page 15: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/15.jpg)
Let’s Start Simple
• Example 1 – sequence: ? ; label: −1
• Example 2 – sequence: ? ♥ 4; label: −1
• Example 3 – sequence: ? 4 ♠; label: +1
• Example 4 – sequence: 4 ; label: +1
• New sequence: ? ; label −1
• New sequence: ? ♥; label −1
• New sequence: ? 4 ; label ?
Why can we do this?
Andre Martins (IST) Linear Classifiers LxMLS 2020 8 / 157
![Page 16: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/16.jpg)
Let’s Start Simple: Machine Learning
• Example 1 – sequence: ? ; label: −1
• Example 2 – sequence: ? ♥ 4; label: −1
• Example 3 – sequence: ? 4 ♠; label: +1
• Example 4 – sequence: 4 ; label: +1
• New sequence: ? ♥; label −1
Label −1 Label +1
P(−1|?) = count(? and −1)count(?)
= 23
= 0.67 vs. P(+1|?) = count(? and +1)count(?)
= 13
= 0.33
P(−1|) = count( and −1)count()
= 12
= 0.5 vs. P(+1|) = count( and +1)count()
= 12
= 0.5
P(−1|♥) = count(♥ and −1)count(♥)
= 11
= 1.0 vs. P(+1|♥) = count(♥ and +1)count(♥)
= 01
= 0.0
Andre Martins (IST) Linear Classifiers LxMLS 2020 9 / 157
![Page 17: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/17.jpg)
Let’s Start Simple: Machine Learning
• Example 1 – sequence: ? ; label: −1
• Example 2 – sequence: ? ♥ 4; label: −1
• Example 3 – sequence: ? 4 ♠; label: +1
• Example 4 – sequence: 4 ; label: +1
• New sequence: ? 4 ; label ?
Label −1 Label +1
P(−1|?) = count(? and −1)count(?)
= 23
= 0.67 vs. P(+1|?) = count(? and +1)count(?)
= 13
= 0.33
P(−1|4) = count(4 and −1)count(4)
= 13
= 0.33 vs. P(+1|4) = count(4 and +1)count(4)
= 23
= 0.67
P(−1|) = count( and −1)count()
= 12
= 0.5 vs. P(+1|) = count( and +1)count()
= 12
= 0.5
Andre Martins (IST) Linear Classifiers LxMLS 2020 9 / 157
![Page 18: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/18.jpg)
Machine Learning
1 Define a model/distribution of interest
2 Make some assumptions if needed
3 Fit the model to the data
• Model: P(label|sequence) = P(label|symbol1, . . . symboln)• Prediction for new sequence = argmaxlabel P(label|sequence)
• Assumption (naive Bayes—more later):
P(symbol1, . . . , symboln|label) =n∏
i=1
P(symboli |label)
• Fit the model to the data: count!! (simple probabilistic modeling)
Andre Martins (IST) Linear Classifiers LxMLS 2020 10 / 157
![Page 19: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/19.jpg)
Machine Learning
1 Define a model/distribution of interest
2 Make some assumptions if needed
3 Fit the model to the data
• Model: P(label|sequence) = P(label|symbol1, . . . symboln)• Prediction for new sequence = argmaxlabel P(label|sequence)
• Assumption (naive Bayes—more later):
P(symbol1, . . . , symboln|label) =n∏
i=1
P(symboli |label)
• Fit the model to the data: count!! (simple probabilistic modeling)
Andre Martins (IST) Linear Classifiers LxMLS 2020 10 / 157
![Page 20: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/20.jpg)
Some Notation: Inputs and Outputs
• Input x ∈ X• e.g., a news article, a sentence, an image, ...
• Output y ∈ Y• e.g., spam/not spam, a topic, a parse tree, an image segmentation
• Input/Output pair: (x , y) ∈ X× Y• e.g., a news article together with a topic• e.g., a sentence together with a parse tree• e.g., an image partitioned into segmentation regions
Andre Martins (IST) Linear Classifiers LxMLS 2020 11 / 157
![Page 21: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/21.jpg)
Supervised Machine Learning
• We are given a labeled dataset of input/output pairs:
D = (xn, yn)Nn=1 ⊆ X× Y
• Goal: use it to learn a predictor h : X→ Y that generalizes well toarbitrary inputs.
• At test time, given x ∈ X, we predict
y = h(x).
• Hopefully, y ≈ y most of the time.
Andre Martins (IST) Linear Classifiers LxMLS 2020 12 / 157
![Page 22: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/22.jpg)
Things can go by different names depending on what Y is...
Andre Martins (IST) Linear Classifiers LxMLS 2020 13 / 157
![Page 23: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/23.jpg)
Regression
Deals with continuous output variables:
• Regression: Y = R• e.g., given a news article, how much time a user will spend reading it?
• Multivariate regression: Y = RK
• e.g., predict the X-Y coordinates in an image where the user will click
Andre Martins (IST) Linear Classifiers LxMLS 2020 14 / 157
![Page 24: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/24.jpg)
Classification
Deals with discrete output variables:
• Binary classification: Y = ±1• e.g., spam detection
• Multi-class classification: Y = 1, 2, . . . ,K• e.g., topic classification
• Structured classification: Y exponentially large and structured• e.g., machine translation, caption generation, image segmentation
• See Xavier Carreras’ lecture later at LxMLS!
Today we’ll focus mostly on multi-class classification.
Andre Martins (IST) Linear Classifiers LxMLS 2020 15 / 157
![Page 25: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/25.jpg)
Classification
Deals with discrete output variables:
• Binary classification: Y = ±1• e.g., spam detection
• Multi-class classification: Y = 1, 2, . . . ,K• e.g., topic classification
• Structured classification: Y exponentially large and structured• e.g., machine translation, caption generation, image segmentation• See Xavier Carreras’ lecture later at LxMLS!
Today we’ll focus mostly on multi-class classification.
Andre Martins (IST) Linear Classifiers LxMLS 2020 15 / 157
![Page 26: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/26.jpg)
Classification
Deals with discrete output variables:
• Binary classification: Y = ±1• e.g., spam detection
• Multi-class classification: Y = 1, 2, . . . ,K• e.g., topic classification
• Structured classification: Y exponentially large and structured• e.g., machine translation, caption generation, image segmentation• See Xavier Carreras’ lecture later at LxMLS!
Today we’ll focus mostly on multi-class classification.
Andre Martins (IST) Linear Classifiers LxMLS 2020 15 / 157
![Page 27: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/27.jpg)
Sometimes reductions are convenient:
• logistic regression reduces classification to regression
• one-vs-all reduces multi-class to binary
• greedy search reduces structured classification to multi-class
... but other times it’s better to tackle the problem in its native form.
More later!
Andre Martins (IST) Linear Classifiers LxMLS 2020 16 / 157
![Page 28: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/28.jpg)
Feature Representations
Feature engineering is an important step in linear classifiers:
• Bag-of-words features for text, also lemmas, parts-of-speech, ...
• SIFT features and wavelet representations in computer vision
• Other categorical, Boolean, and continuous features
Andre Martins (IST) Linear Classifiers LxMLS 2020 17 / 157
![Page 29: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/29.jpg)
Feature Representations
We need to represent information about x
Typical approach: define a feature map φ : X→ RD
• φ(x) is a high dimensional feature vector
We can use feature vectors to encapsulate Boolean, categorical, andcontinuous features
• e.g., categorical features can be reduced to a range of one-hot binaryvalues.
Andre Martins (IST) Linear Classifiers LxMLS 2020 18 / 157
![Page 30: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/30.jpg)
Example: Continuous Features
Linear Classifier
HandcraftedFeatures
Andre Martins (IST) Linear Classifiers LxMLS 2020 19 / 157
![Page 31: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/31.jpg)
Feature Engineering and NLP Pipelines
Classical NLP pipelines consist of stacking together several linear classifiers
Each classifier’s predictions are used to handcraft features for otherclassifiers
Examples of features:
• Word occurrences: binary feature denoting if a word occurs in not ina document
• Word counts: real-valued feature counting how many times a wordoccurs
• POS tags: adjective counts for sentiment analysis
• Spell checker: misspellings counts for spam detection
Andre Martins (IST) Linear Classifiers LxMLS 2020 20 / 157
![Page 32: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/32.jpg)
Example: Translation Quality Estimation
Goal: estimate the quality of a translation on the fly (without a reference)!
Andre Martins (IST) Linear Classifiers LxMLS 2020 21 / 157
![Page 33: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/33.jpg)
Example: Translation Quality Estimation
Wrong translation!
Goal: estimate the quality of a translation on the fly (without a reference)!
Andre Martins (IST) Linear Classifiers LxMLS 2020 21 / 157
![Page 34: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/34.jpg)
Example: Translation Quality Estimation
Wrong translation!
Goal: estimate the quality of a translation on the fly (without a reference)!
Andre Martins (IST) Linear Classifiers LxMLS 2020 21 / 157
![Page 35: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/35.jpg)
Example: Translation Quality Estimation
Hand-crafted features:
• no of tokens in the source/target segment
• LM probability of source/target segment and their ratio
• % of source 1–3-grams observed in 4 frequency quartiles of source corpus
• average no of translations per source word
• ratio of brackets and punctuation symbols in source & target segments
• ratio of numbers, content/non-content words in source & target segments
• ratio of nouns/verbs/etc in the source & target segments
• % of dependency relations b/w constituents in source & target segments
• diff in depth of the syntactic trees of source & target segments
• diff in no of PP/NP/VP/ADJP/ADVP/CONJP in source & target
• diff in no of person/location/organization entities in source & target
• features and global score of the SMT system
• number of distinct hypotheses in the n-best list
• 1–3-gram LM probabilities using translations in the n-best to train the LM
• average size of the target phrases
• proportion of pruned search graph nodes;
• proportion of recombined graph nodes.
Andre Martins (IST) Linear Classifiers LxMLS 2020 22 / 157
![Page 36: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/36.jpg)
Representation Learning
Feature engineering is a black art and can be very time-consuming
But it’s a good way of encoding prior knowledge, and it is still widely usedin practice (in particular with “small data”)
One alternative to feature engineering: representation learning
Bhiksha will talk about this tomorrow!
Andre Martins (IST) Linear Classifiers LxMLS 2020 23 / 157
![Page 37: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/37.jpg)
Representation Learning
Feature engineering is a black art and can be very time-consuming
But it’s a good way of encoding prior knowledge, and it is still widely usedin practice (in particular with “small data”)
One alternative to feature engineering: representation learning
Bhiksha will talk about this tomorrow!
Andre Martins (IST) Linear Classifiers LxMLS 2020 23 / 157
![Page 38: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/38.jpg)
Outline
1 Data and Feature Representation
2 Regression
3 Classification
Perceptron
Naive Bayes
Logistic Regression
Support Vector Machines
4 Regularization
5 Non-Linear Classifiers
Andre Martins (IST) Linear Classifiers LxMLS 2020 24 / 157
![Page 39: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/39.jpg)
Regression
Output space Y is continuous
Example: given an article, how much time a user spends reading it?
• x is number of words of the article
• y is the reading time (minutes)
How to define a model that predicts y from x?
Andre Martins (IST) Linear Classifiers LxMLS 2020 25 / 157
![Page 40: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/40.jpg)
Linear Regression
• First take: assume y = wx + b
• Model parameters: w and b
• Given training dataD = (xn, yn)Nn=1, how toestimate w and b?
Least squares method: fit w and b on the training set by minimizing∑Nn=1(yn − (wxn + b))2
Andre Martins (IST) Linear Classifiers LxMLS 2020 26 / 157
![Page 41: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/41.jpg)
Linear Regression
• First take: assume y = wx + b
• Model parameters: w and b
• Given training dataD = (xn, yn)Nn=1, how toestimate w and b?
Least squares method: fit w and b on the training set by minimizing∑Nn=1(yn − (wxn + b))2
Andre Martins (IST) Linear Classifiers LxMLS 2020 26 / 157
![Page 42: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/42.jpg)
Linear Regression
Often a linear dependency of y on x is a poor assumption
Second take: assume y = w · φ(x), where φ(x) is a feature vector
• e.g. φ(x) = [1, x , x2, . . . , xD ] (polynomial features degree ≤ D)
• the bias term b is captured by the constant feature φ0(x) = 1
Fit w by minimizing∑
n(yn − (w · φ(xn)))2
• Closed form solution:
w = (X>X )−1X>y , with X =
...
φ(xn)>
...
, y =
...yn...
.Still called linear regression – linearity w.r.t. the model parameters w.
Andre Martins (IST) Linear Classifiers LxMLS 2020 27 / 157
![Page 43: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/43.jpg)
Linear Regression (D = 1)
Andre Martins (IST) Linear Classifiers LxMLS 2020 28 / 157
![Page 44: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/44.jpg)
Linear Regression (D = 2)
Andre Martins (IST) Linear Classifiers LxMLS 2020 28 / 157
![Page 45: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/45.jpg)
Squared Loss Function
Linear regression with the least squares method corresponds to a lossfunction
L(y , y) =1
2(y − y)2, where y = w · φ(x).
The model is fit to the training data by minimizing this loss function.
This is called the squared loss.
More later.
Andre Martins (IST) Linear Classifiers LxMLS 2020 29 / 157
![Page 46: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/46.jpg)
Least Squares – Probabilistic Interpretation
The least squares method has a probabilistic interpretation.
Assume the data is generated stochastically as
y = w∗ · φ(x) + n
where n ∼ N(0, σ2) is Gaussian noise (with σ fixed), and w∗ are the“true” model parameters.
That is, y ∼ N(w∗ · φ(x), σ2).
Then w given by least squares is the maximum likelihood estimate underthis model.
Andre Martins (IST) Linear Classifiers LxMLS 2020 30 / 157
![Page 47: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/47.jpg)
One-Slide Proof
Recall N(y ;µ, σ2) = 1√2πσ
exp(− (y−µ)2
2σ2
).
wMLE = arg maxw
N∏n=1
P(yn | xn;w)
= arg maxw
N∑n=1
logP(yn | xn;w)
= arg maxw
N∑n=1
− (yn −w · φ(xn))2
2σ2− log(
√2πσ)︸ ︷︷ ︸
constant
= arg minw
N∑n=1
(yn −w · φ(xn))2
Thus, linear regression with the squared loss = MLE under Gaussian noise.
Andre Martins (IST) Linear Classifiers LxMLS 2020 31 / 157
![Page 48: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/48.jpg)
Other Regression Losses
Squared loss: L(y , y) = 12 (y − y)2.
Absolute error loss: L(y , y) = |y − y |.
Huber loss: L(y , y) =
12 (y − y)2 if |y − y | ≤ 1|y − y | − 1
2 if |y − y | ≥ 1.
Andre Martins (IST) Linear Classifiers LxMLS 2020 32 / 157
![Page 49: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/49.jpg)
Overfitting and Underfitting
We saw earlier an example of underfitting.
However, if the model is too complex (too many parameters) and the datais scarce, we run the risk of overfitting:
To avoid overfitting, we need regularization (more later).
Andre Martins (IST) Linear Classifiers LxMLS 2020 33 / 157
![Page 50: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/50.jpg)
Maximum A Posteriori
Assuming we have a prior distribution on w, w ∼ N(0, σ2wI )
A criterion to estimate w∗ is maximum a posteriori (MAP):
wMAP = arg maxw
P(w)N∏
n=1
P(yn | xn;w)
= arg maxw
logP(w) +N∑
n=1
logP(yn | xn;w)
= arg maxw−‖w‖
2
2σ2w
−N∑
n=1
− (yn −w · φ(xn))2
2σ2+ constant
= arg minw
λ‖w‖2
2+
N∑n=1
(yn −w · φ(xn))2
Thus, `2-regularizarion is equivalent to MAP with a Gaussian prior.Andre Martins (IST) Linear Classifiers LxMLS 2020 34 / 157
![Page 51: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/51.jpg)
Outline
1 Data and Feature Representation
2 Regression
3 Classification
Perceptron
Naive Bayes
Logistic Regression
Support Vector Machines
4 Regularization
5 Non-Linear Classifiers
Andre Martins (IST) Linear Classifiers LxMLS 2020 35 / 157
![Page 52: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/52.jpg)
Binary Classification
Before covering multi-class classification, we address the simpler case ofbinary classification
Output space Y = −1,+1Example: Given a news article, is it true or fake?
• x is the news article, represented a feature vector φ(x)
• y can be either true (+1) or fake (−1)
How to define a model to predict y from x?
Andre Martins (IST) Linear Classifiers LxMLS 2020 36 / 157
![Page 53: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/53.jpg)
Linear Classifier
Defined by y = sign(w · φ(x) + b) =
+1 if w · φ(x) + b ≥ 0−1 if w · φ(x) + b < 0.
Intuitively, w · φ(x) + b is a “score” for the positive class: if positive,predict +1; if negative, predict −1
Difference from regression: the sign function converts from continuous tobinary
The decision boundary is an hyperplane defined by the model parametersw and b
Also called a “hyperplane classifier.”
Andre Martins (IST) Linear Classifiers LxMLS 2020 37 / 157
![Page 54: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/54.jpg)
Linear Classifier
(w, b) is an hyperplane that splits the space into two half spaces:
1 2-2 -1
1
2
-2
-1
Points along linehave scores of 0
How to learn this hyperplane from the training data D = (xn, yn)Nn=1?
Andre Martins (IST) Linear Classifiers LxMLS 2020 38 / 157
![Page 55: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/55.jpg)
Linear Separability
• A dataset D is linearly separable if there exists (w, b) such thatclassification is perfect
Separable Not Separable
We next present an algorithm that finds such an hyperplane if it exists!
Andre Martins (IST) Linear Classifiers LxMLS 2020 39 / 157
![Page 56: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/56.jpg)
Linear Classifier: No Bias Term
It is common to present linear classifiers without the bias term b:y = sign(w · φ(x)+b)
In this case, the decision boundary is a hyperplane that passes through theorigin
We can always do this without loss of generality:
• Add a constant feature to φ(x): φ0(x) = 1
• Then the corresponding weight w0 replaces the bias term b
Andre Martins (IST) Linear Classifiers LxMLS 2020 40 / 157
![Page 57: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/57.jpg)
Outline
1 Data and Feature Representation
2 Regression
3 Classification
Perceptron
Naive Bayes
Logistic Regression
Support Vector Machines
4 Regularization
5 Non-Linear Classifiers
Andre Martins (IST) Linear Classifiers LxMLS 2020 41 / 157
![Page 58: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/58.jpg)
Perceptron (Rosenblatt, 1958)
(Extracted from Wikipedia)
• Invented in 1957 at theCornell AeronauticalLaboratory by FrankRosenblatt
• Implemented in custom-builthardware as the “Mark 1perceptron,” designed forimage recognition
• 400 photocells, randomlyconnected to the “neurons.”Weights were encoded inpotentiometers
• Weight updates duringlearning were performed byelectric motors.
Andre Martins (IST) Linear Classifiers LxMLS 2020 42 / 157
![Page 59: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/59.jpg)
Perceptron in the News...
Andre Martins (IST) Linear Classifiers LxMLS 2020 43 / 157
![Page 60: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/60.jpg)
Perceptron in the News...
Andre Martins (IST) Linear Classifiers LxMLS 2020 43 / 157
![Page 61: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/61.jpg)
Perceptron Algorithm
Online algorithm: process one data point at each round
1 Take xi ; apply the current model to make a prediction for it
2 If prediction is correct, do nothing
3 Else, correct model w by adding/subtracting feature vector φ(xi )
For simplicity, omit the bias b: assume a constant feature φ0(x) = 1 asexplained earlier.
Andre Martins (IST) Linear Classifiers LxMLS 2020 44 / 157
![Page 62: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/62.jpg)
Perceptron Algorithm
input: labeled data D
initialize w(0) = 0initialize k = 0 (number of mistakes)repeat
get new training example (xi , yi )predict yi = sign(w(k) · φ(xi ))if yi 6= yi then
update w(k+1) = w(k) + yiφ(xi )increment k
end ifuntil maximum number of epochsoutput: model weights w(k)
Andre Martins (IST) Linear Classifiers LxMLS 2020 45 / 157
![Page 63: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/63.jpg)
Perceptron’s Mistake Bound
A couple definitions:
• the training data is linearly separable with margin γ > 0 iff there is aweight vector u with ‖u‖ = 1 such that
yi u · φ(xi ) ≥ γ, ∀i .
• radius of the data: R = maxi ‖φ(xi )‖.
Then we have the following bound of the number of mistakes:
Theorem (Novikoff (1962))
The perceptron algorithm is guaranteed to find a separating hyperplaneafter at most R2
γ2 mistakes.
Andre Martins (IST) Linear Classifiers LxMLS 2020 46 / 157
![Page 64: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/64.jpg)
Perceptron’s Mistake Bound
A couple definitions:
• the training data is linearly separable with margin γ > 0 iff there is aweight vector u with ‖u‖ = 1 such that
yi u · φ(xi ) ≥ γ, ∀i .
• radius of the data: R = maxi ‖φ(xi )‖.
Then we have the following bound of the number of mistakes:
Theorem (Novikoff (1962))
The perceptron algorithm is guaranteed to find a separating hyperplaneafter at most R2
γ2 mistakes.
Andre Martins (IST) Linear Classifiers LxMLS 2020 46 / 157
![Page 65: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/65.jpg)
One-Slide Proof
Recall that w(k+1) = w(k) + yiφ(xi ).
• Lower bound on ‖w(k+1)‖:
u ·w(k+1) = u ·w(k) + yiu · φ(xi )
≥ u ·w(k) + γ
≥ kγ.
Hence ‖w(k+1)‖ = ‖u‖ · ‖w(k+1)‖ ≥ u ·w(k+1) ≥ kγ (from CSI).
• Upper bound on ‖w(k+1)‖:
‖w(k+1)‖2 = ‖w(k)‖2 + ‖φ(xi )‖2 + 2yiw(k) · φ(xi )
≤ ‖w(k)‖2 + R2
≤ kR2.
Equating both sides, we get (kγ)2 ≤ kR2 ⇒ k ≤ R2/γ2 (QED).
Andre Martins (IST) Linear Classifiers LxMLS 2020 47 / 157
![Page 66: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/66.jpg)
One-Slide Proof
Recall that w(k+1) = w(k) + yiφ(xi ).
• Lower bound on ‖w(k+1)‖:
u ·w(k+1) = u ·w(k) + yiu · φ(xi )
≥ u ·w(k) + γ
≥ kγ.
Hence ‖w(k+1)‖ = ‖u‖ · ‖w(k+1)‖ ≥ u ·w(k+1) ≥ kγ (from CSI).
• Upper bound on ‖w(k+1)‖:
‖w(k+1)‖2 = ‖w(k)‖2 + ‖φ(xi )‖2 + 2yiw(k) · φ(xi )
≤ ‖w(k)‖2 + R2
≤ kR2.
Equating both sides, we get (kγ)2 ≤ kR2 ⇒ k ≤ R2/γ2 (QED).
Andre Martins (IST) Linear Classifiers LxMLS 2020 47 / 157
![Page 67: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/67.jpg)
What a Simple Perceptron Can and Can’t Do
• Remember: the decision boundary is linear (linear classifier)
• It can solve linearly separable problems (OR, AND)
Andre Martins (IST) Linear Classifiers LxMLS 2020 48 / 157
![Page 68: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/68.jpg)
What a Simple Perceptron Can and Can’t Do
• ... but it can’t solve non-linearly separable problems such as simpleXOR (unless input is transformed into a better representation):
• This result is often attributed to Minsky and Papert (1969) but wasknown well before.
Andre Martins (IST) Linear Classifiers LxMLS 2020 49 / 157
![Page 69: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/69.jpg)
Limitations of the Perceptron
Minsky and Papert (1969):
• Shows limitations of multi-layerperceptrons and fostered an “AIwinter” period.
More tomorrow at Bhiksha’s lecture!
Andre Martins (IST) Linear Classifiers LxMLS 2020 50 / 157
![Page 70: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/70.jpg)
Multi-Class Classification
Let’s now assume a multi-class classification problem, with |Y| ≥ 2 labels(classes).
Andre Martins (IST) Linear Classifiers LxMLS 2020 51 / 157
![Page 71: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/71.jpg)
Reduction to Binary Classification
One strategy for multi-class classification is to train one binary classifierper label (using all the other classes as negative examples) and pick theclass with the highest score (one-vs-all)
Another strategy is to train pairwise classifiers and to use majority voting(one-vs-one)
Here, we’ll consider classifiers that tackle the multiple classes directly.
Andre Martins (IST) Linear Classifiers LxMLS 2020 52 / 157
![Page 72: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/72.jpg)
Multi-Class Linear Classifiers
• Parametrized by a weight matrix W ∈ R|Y|×D (one weight perfeature/label pair) and a bias vector b ∈ R|Y|:
W =
...w>y
...
, b =
...by...
.• Equivalently, |Y| weight vectors wy ∈ RD and scalars by ∈ R• The score (or probability) of a particular label is based on a linear
combination of features and their weights
• Predict the y which maximizes this score:
y = arg maxy∈Y
wy · φ(x) + by .
Andre Martins (IST) Linear Classifiers LxMLS 2020 53 / 157
![Page 73: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/73.jpg)
Multi-Class Linear Classifier
Geometrically, (W , b) split the feature space into regions delimited byhyperplanes.
Andre Martins (IST) Linear Classifiers LxMLS 2020 54 / 157
![Page 74: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/74.jpg)
Commonly Used Notation in Neural Networks
Linear Classifier
HandcraftedFeatures
y = argmax (Wφ(x) + b) , W =
...w>y
...
, b =
...by...
.Andre Martins (IST) Linear Classifiers LxMLS 2020 55 / 157
![Page 75: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/75.jpg)
Multi-Class Recovers Binary
With two classes (Y = ±1), this formulation recovers the binaryclassifier presented earlier:
y = arg maxy∈±1
wy · φ(x) + by
=
+1 if w+1 · φ(x) + b+1 > w−1 · φ(x) + b−1
−1 otherwise
= sign((w+1 −w−1)︸ ︷︷ ︸w
· φ(x) + (b+1 − b−1)︸ ︷︷ ︸b
).
That is: only half of the parameters are needed.
Andre Martins (IST) Linear Classifiers LxMLS 2020 56 / 157
![Page 76: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/76.jpg)
Multi-Class Recovers Binary
With two classes (Y = ±1), this formulation recovers the binaryclassifier presented earlier:
y = arg maxy∈±1
wy · φ(x) + by
=
+1 if w+1 · φ(x) + b+1 > w−1 · φ(x) + b−1
−1 otherwise
= sign((w+1 −w−1)︸ ︷︷ ︸w
· φ(x) + (b+1 − b−1)︸ ︷︷ ︸b
).
That is: only half of the parameters are needed.
Andre Martins (IST) Linear Classifiers LxMLS 2020 56 / 157
![Page 77: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/77.jpg)
Multi-Class Recovers Binary
With two classes (Y = ±1), this formulation recovers the binaryclassifier presented earlier:
y = arg maxy∈±1
wy · φ(x) + by
=
+1 if w+1 · φ(x) + b+1 > w−1 · φ(x) + b−1
−1 otherwise
= sign((w+1 −w−1)︸ ︷︷ ︸w
· φ(x) + (b+1 − b−1)︸ ︷︷ ︸b
).
That is: only half of the parameters are needed.
Andre Martins (IST) Linear Classifiers LxMLS 2020 56 / 157
![Page 78: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/78.jpg)
Multi-Class Recovers Binary
With two classes (Y = ±1), this formulation recovers the binaryclassifier presented earlier:
y = arg maxy∈±1
wy · φ(x) + by
=
+1 if w+1 · φ(x) + b+1 > w−1 · φ(x) + b−1
−1 otherwise
= sign((w+1 −w−1)︸ ︷︷ ︸w
· φ(x) + (b+1 − b−1)︸ ︷︷ ︸b
).
That is: only half of the parameters are needed.
Andre Martins (IST) Linear Classifiers LxMLS 2020 56 / 157
![Page 79: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/79.jpg)
Linear Classifiers (Binary vs Multi-Class)
• Prediction rule:
y = h(x) = arg maxy∈Y
linear in wy︷ ︸︸ ︷wy · φ(x)
• The decision boundary is defined by the intersection of half spaces
• In the binary case (|Y| = 2) this corresponds to a hyperplane classifier
Andre Martins (IST) Linear Classifiers LxMLS 2020 57 / 157
![Page 80: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/80.jpg)
Linear Classifier – No Bias Term
Again, it is common to omit the bias vector b:
y = arg maxy∈Y
wy · φ(x)+by
Like before, this can be done without loss of generality, by assuming aconstant feature φ0(x) = 1
The first column of W replaces the bias vector.
We assume this for simplicity.
Andre Martins (IST) Linear Classifiers LxMLS 2020 58 / 157
![Page 81: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/81.jpg)
Example: Perceptron
The perceptron algorithm also works for the multi-class case!
It has a similar mistake bound: if the data is separable, it’s guaranteed tofind separating hyperplanes!
Andre Martins (IST) Linear Classifiers LxMLS 2020 59 / 157
![Page 82: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/82.jpg)
Perceptron Algorithm: Multi-Class
input: labeled data D
initialize W (0) = 0initialize k = 0 (number of mistakes)repeat
get new training example (xi , yi )
predict yi = arg maxy∈Yw(k)y · φ(xi )
if yi 6= yi then
update w(k+1)yi = w
(k)yi + φ(xi ) increase weight of gold class
updatew(k+1)yi
= w(k)yi−φ(xi ) decrease weight of incorrect class
increment kend if
until maximum number of epochsoutput: model weights w(k)
Andre Martins (IST) Linear Classifiers LxMLS 2020 60 / 157
![Page 83: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/83.jpg)
Outline
1 Data and Feature Representation
2 Regression
3 Classification
Perceptron
Naive Bayes
Logistic Regression
Support Vector Machines
4 Regularization
5 Non-Linear Classifiers
Andre Martins (IST) Linear Classifiers LxMLS 2020 61 / 157
![Page 84: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/84.jpg)
Probabilistic Models
• For a moment, forget linear classifiers and parameter vectors w
• Let’s assume our goal is to model the conditional probability ofoutput labels y given inputs x , i.e. P(y |x)
• If we can define this distribution, then classification becomes:
y = arg maxy∈Y
P(y |x)
Andre Martins (IST) Linear Classifiers LxMLS 2020 62 / 157
![Page 85: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/85.jpg)
Bayes Rule
• One way to model P(y |x) is through Bayes Rule:
P(y |x) =P(y)P(x |y)
P(x)
arg maxy
P(y |x) = arg maxy
P(y)P(x |y)
(since x is fixed!)
• P(y)P(x |y) = P(x , y): a joint probability
• Above is a “generative story”: ‘pick y ; then pick x given y .”
• Models that consider the joint P(x , y) are called generative models,because they come with a generative story.
Andre Martins (IST) Linear Classifiers LxMLS 2020 63 / 157
![Page 86: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/86.jpg)
Naive Bayes
Assume that an input x is partitioned as v1, . . . , vL, where vk ∈ Vk
Example:
• x is a document of length L
• vk is the kth token (a word)
• The set Vk = V is a fixed vocabulary (all tokens drawn from V)
Naive Bayes Assumption(conditional independence)
P(v1, . . . , vL︸ ︷︷ ︸x
|y) =∏L
k=1 P(vk |y)
Andre Martins (IST) Linear Classifiers LxMLS 2020 64 / 157
![Page 87: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/87.jpg)
Multinomial Naive Bayes
P(x , y) = P(y)P(v1, . . . , vL︸ ︷︷ ︸x
|y) = P(y)L∏
k=1
P(vk |y)
• All tokens are conditionally independently, given the topic
• The word order doesn’t change P(x , y) (bag-of-words assumption)
Small caveat: we assumed that the document has a fixed length L.
This is not realistic.
How to deal with variable length?
Andre Martins (IST) Linear Classifiers LxMLS 2020 65 / 157
![Page 88: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/88.jpg)
Multinomial Naive Bayes – Arbitrary Length
Solution: introduce a distribution over document length P(|x |)
• e.g. a Poisson distribution.
We get:
P(x , y) = P(y)P(|x |)|x |∏k=1
P(vk |y)︸ ︷︷ ︸P(x |y)
P(|x |) is constant (independent of y), so nothing really changes
• the posterior P(y |x) is the same as before.
Andre Martins (IST) Linear Classifiers LxMLS 2020 66 / 157
![Page 89: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/89.jpg)
What Does This Buy Us?
P(v1, . . . , vL︸ ︷︷ ︸x
|y) =L∏
k=1
P(vk |y)
What do we gain with the Naive Bayes assumption?
• A huge reduction in the number of parameters!
• If we haven’t done any factorization assumption, how manyparameters would be required for expressing P(v1, . . . , vL|y)?
O(|V|L)
• And how many parameters with Naive Bayes?
O(|V|)
Less parameters =⇒ Less computation; less risk of overfitting
(Though we may underfit if our independence assumptions are too strong.)
Andre Martins (IST) Linear Classifiers LxMLS 2020 67 / 157
![Page 90: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/90.jpg)
What Does This Buy Us?
P(v1, . . . , vL︸ ︷︷ ︸x
|y) =L∏
k=1
P(vk |y)
What do we gain with the Naive Bayes assumption?
• A huge reduction in the number of parameters!
• If we haven’t done any factorization assumption, how manyparameters would be required for expressing P(v1, . . . , vL|y)?
O(|V|L)
• And how many parameters with Naive Bayes?
O(|V|)
Less parameters =⇒ Less computation; less risk of overfitting
(Though we may underfit if our independence assumptions are too strong.)
Andre Martins (IST) Linear Classifiers LxMLS 2020 67 / 157
![Page 91: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/91.jpg)
What Does This Buy Us?
P(v1, . . . , vL︸ ︷︷ ︸x
|y) =L∏
k=1
P(vk |y)
What do we gain with the Naive Bayes assumption?
• A huge reduction in the number of parameters!
• If we haven’t done any factorization assumption, how manyparameters would be required for expressing P(v1, . . . , vL|y)? O(|V|L)
• And how many parameters with Naive Bayes?
O(|V|)
Less parameters =⇒ Less computation; less risk of overfitting
(Though we may underfit if our independence assumptions are too strong.)
Andre Martins (IST) Linear Classifiers LxMLS 2020 67 / 157
![Page 92: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/92.jpg)
What Does This Buy Us?
P(v1, . . . , vL︸ ︷︷ ︸x
|y) =L∏
k=1
P(vk |y)
What do we gain with the Naive Bayes assumption?
• A huge reduction in the number of parameters!
• If we haven’t done any factorization assumption, how manyparameters would be required for expressing P(v1, . . . , vL|y)? O(|V|L)
• And how many parameters with Naive Bayes? O(|V|)
Less parameters =⇒ Less computation; less risk of overfitting
(Though we may underfit if our independence assumptions are too strong.)
Andre Martins (IST) Linear Classifiers LxMLS 2020 67 / 157
![Page 93: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/93.jpg)
What Does This Buy Us?
P(v1, . . . , vL︸ ︷︷ ︸x
|y) =L∏
k=1
P(vk |y)
What do we gain with the Naive Bayes assumption?
• A huge reduction in the number of parameters!
• If we haven’t done any factorization assumption, how manyparameters would be required for expressing P(v1, . . . , vL|y)? O(|V|L)
• And how many parameters with Naive Bayes? O(|V|)
Less parameters =⇒ Less computation; less risk of overfitting
(Though we may underfit if our independence assumptions are too strong.)
Andre Martins (IST) Linear Classifiers LxMLS 2020 67 / 157
![Page 94: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/94.jpg)
Naive Bayes – Learning
P(y)P(v1, . . . , vL︸ ︷︷ ︸x
|y) = P(y)L∏
k=1
P(vk |y)
• Input: dataset D = (xt , yt)Nt=1 (examples assumed i.i.d.)
• Parameters Θ = P(y),P(v |y)
• Objective: Maximum Likelihood Estimation (MLE): chooseparameters that maximize the likelihood of observed data
L(Θ;D) =N∏t=1
P(xt , yt) =N∏t=1
(P(yt)
L∏k=1
P(vk(xt)|yt)
)
Θ = arg maxΘ
N∏t=1
(P(yt)
L∏k=1
P(vk(xt)|yt)
)Andre Martins (IST) Linear Classifiers LxMLS 2020 68 / 157
![Page 95: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/95.jpg)
Naive Bayes – Learning via MLE
For the multinomial Naive Bayes model, MLE has a closed form solution!!
It all boils down to counting and normalizing!!
(The proof is left as an exercise...)
Andre Martins (IST) Linear Classifiers LxMLS 2020 69 / 157
![Page 96: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/96.jpg)
Naive Bayes – Learning via MLE
Θ = arg maxΘ
N∏t=1
(P(yt)
L∏k=1
P(vk(xt)|yt)
)
P(y) =
∑Nt=1[[yt = y ]]
N
P(v |y) =
∑Nt=1
∑Lk=1[[vk(xt) = v and yt = y ]]
L∑N
t=1[[yt = y ]]
[[X ]] is 1 if property X holds, 0 otherwise (Iverson notation)Fraction of times a feature appears in training cases of a given label
Andre Martins (IST) Linear Classifiers LxMLS 2020 70 / 157
![Page 97: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/97.jpg)
Naive Bayes Example
• Corpus of movie reviews: 7 examples for training
Doc Words Class
1 Great movie, excellent plot, renown actors Positive
2 I had not seen a fantastic plot like this in good 5years. Amazing!!!
Positive
3 Lovely plot, amazing cast, somehow I am in lovewith the bad guy
Positive
4 Bad movie with great cast, but very poor plot andunimaginative ending
Negative
5 I hate this film, it has nothing original Negative
6 Great movie, but not... Negative
7 Very bad movie, I have no words to express how Idislike it
Negative
Andre Martins (IST) Linear Classifiers LxMLS 2020 71 / 157
![Page 98: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/98.jpg)
Naive Bayes Example
• Features: adjectives (bag-of-words)
Doc Words Class
1 Great movie, excellent plot, renowned actors Positive
2 I had not seen a fantastic plot like this in good 5years. amazing !!!
Positive
3 Lovely plot, amazing cast, somehow I am in lovewith the bad guy
Positive
4 Bad movie with great cast, but very poor plot andunimaginative ending
Negative
5 I hate this film, it has nothing original. Really bad Negative
6 Great movie, but not... Negative
7 Very bad movie, I have no words to express how Idislike it
Negative
Andre Martins (IST) Linear Classifiers LxMLS 2020 72 / 157
![Page 99: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/99.jpg)
Naive Bayes Example
Relative frequency:
Priors:
P(positive) =
∑Nt=1[[yt = positive]]
N= 3/7 = 0.43
P(negative) =
∑Nt=1[[yt = negative]]
N= 4/7 = 0.57
Assume standard pre-processing: tokenization, lowercasing, punctuationremoval (except special punctuation like !!!)
Andre Martins (IST) Linear Classifiers LxMLS 2020 73 / 157
![Page 100: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/100.jpg)
Naive Bayes Example
Likelihoods: Count adjective v in class y / adjectives in y
P(v |y) =
∑Nt=1
∑Lk=1[[vk(xt) = v and yt = y ]]
L∑N
t=1[[yt = y ]]
P(amazing |positive) = 2/10 P(amazing |negative) = 0/8P(bad |positive) = 1/10 P(bad |negative) = 3/8P(excellent|positive) = 1/10 P(excellent|negative) = 0/8P(fantastic |positive) = 1/10 P(fantastic|negative) = 0/8P(good |positive) = 1/10 P(good |negative) = 0/8P(great|positive) = 1/10 P(great|negative) = 2/8P(lovely |positive) = 1/10 P(lovely |negative) = 0/8P(original |positive) = 0/10 P(original |negative) = 1/8P(poor |positive) = 0/10 P(poor |negative) = 1/8P(renowned |positive) = 1/10 P(renowned |negative) = 0/8P(unimaginative|positive) = 0/10 P(unimaginative|negative)= 1/8
Andre Martins (IST) Linear Classifiers LxMLS 2020 74 / 157
![Page 101: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/101.jpg)
Naive Bayes Example
Given a new segment to classify (test time):
Doc Words Class
8 This was a fantastic story, good, lovely ???
Final decision
y = arg maxy
(P(y)
L∏k=1
P(vk |y)
)
P(positive) ∗ P(fantastic|positive) ∗ P(good |positive) ∗ P(lovely |positive)
3/7 ∗ 1/10 ∗ 1/10 ∗ 1/10 = 0.00043
P(negative) ∗ P(fantastic|negative) ∗ P(good |negative) ∗ P(lovely |negative)
4/7 ∗ 0/8 ∗ 0/8 ∗ 0/8 = 0
So: sentiment = positive
Andre Martins (IST) Linear Classifiers LxMLS 2020 75 / 157
![Page 102: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/102.jpg)
Naive Bayes Example
Given a new segment to classify (test time):
Doc Words Class
9 Great plot, great cast, great everything ???
Final decision
P(positive) ∗ P(great|positive) ∗ P(great|positive) ∗ P(great|positive)
3/7 ∗ 1/10 ∗ 1/10 ∗ 1/10 = 0.00043
P(negative) ∗ P(great|negative) ∗ P(great|negative) ∗ P(great|negative)
4/7 ∗ 2/8 ∗ 2/8 ∗ 2/8 = 0.00893
So: sentiment = negative
Andre Martins (IST) Linear Classifiers LxMLS 2020 76 / 157
![Page 103: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/103.jpg)
Naive Bayes Example
But if the new segment to classify (test time) is:
Doc Words Class
10 Boring movie, annoying plot, unimaginative ending ???
Final decision
P(positive) ∗ P(boring |positive) ∗ P(annoying |positive) ∗ P(unimaginative|positive)
3/7 ∗ 0/10 ∗ 0/10 ∗ 0/10 = 0
P(negative) ∗ P(boring |negative) ∗ P(annoying |negative) ∗ P(unimaginative|negative)
4/7 ∗ 0/8 ∗ 0/8 ∗ 1/8 = 0
So: sentiment = ???
Andre Martins (IST) Linear Classifiers LxMLS 2020 77 / 157
![Page 104: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/104.jpg)
Laplace Smoothing
Add smoothing to feature counts (add 1 to every count):
P(v |y) =
∑Nt=1
∑Lk=1[[vk(xt) = v and yt = y ]] + 1
L∑N
t=1[[yt = y ]] + |V|where |V| = number of distinct adjectives in training (all classes) = 12
Doc Words Class
11 Boring movie, annoying plot, unimaginative ending ???
Final decision
P(positive) ∗ P(boring |positive) ∗ P(annoying |positive) ∗ P(unimaginative|positive)
3/7 ∗ ((0 + 1)/(10 + 12)) ∗ ((0 + 1)/(10 + 12)) ∗ ((0 + 1)/(10 + 12)) = 0.000040
P(negative) ∗ P(boring |negative) ∗ P(annoying |negative) ∗ P(unimaginative|negative)
4/7 ∗ ((0 + 1)/(8 + 12)) ∗ ((0 + 1)/(8 + 12)) ∗ ((1 + 1)/(8 + 12)) = 0.000143
So: sentiment = negative
Andre Martins (IST) Linear Classifiers LxMLS 2020 78 / 157
![Page 105: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/105.jpg)
Finally...
Multinomial Naive Bayes is a Linear Classifier!
Andre Martins (IST) Linear Classifiers LxMLS 2020 79 / 157
![Page 106: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/106.jpg)
One Slide Proof
• Let by = logP(y), ∀y ∈ Y
• Let [wy ]v = logP(v |y), ∀y ∈ Y, v ∈ V
• Let [φ(x)]v =∑L
k=1[[vk(x) = v ]], ∀v ∈ V (# times v occurs in x)
arg maxy
P(y |x) ∝ arg maxy
(P(y)
L∏k=1
P(vk(x)|y)
)
= arg maxy
(logP(y) +
L∑k=1
logP(vk(x)|y)
)
= arg maxy
logP(y)︸ ︷︷ ︸by
+∑v∈V
[φ(x)]v logP(v |y)︸ ︷︷ ︸[wy ]v
= arg max
y(wy · φ(x) + by ) .
Andre Martins (IST) Linear Classifiers LxMLS 2020 80 / 157
![Page 107: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/107.jpg)
Discriminative versus Generative
• Generative models attempt to model inputs and outputs• e.g., Naive Bayes = MLE of joint distribution P(x , y)• Statistical model must explain generation of input• Can we sample a document from the multinomial Naive Bayes model?
How?
• Occam’s Razor: why model input?• Discriminative models
• Use loss function that directly optimizes P(y |x) (or something related)• Logistic Regression – MLE of P(y |x)• Perceptron and SVMs – minimize classification error
• Generative and discriminative models use P(y |x) for prediction
• They differ only on what distribution they use to set w
Andre Martins (IST) Linear Classifiers LxMLS 2020 81 / 157
![Page 108: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/108.jpg)
Discriminative versus Generative
• Generative models attempt to model inputs and outputs• e.g., Naive Bayes = MLE of joint distribution P(x , y)• Statistical model must explain generation of input• Can we sample a document from the multinomial Naive Bayes model?
How?
• Occam’s Razor: why model input?• Discriminative models
• Use loss function that directly optimizes P(y |x) (or something related)• Logistic Regression – MLE of P(y |x)• Perceptron and SVMs – minimize classification error
• Generative and discriminative models use P(y |x) for prediction
• They differ only on what distribution they use to set w
Andre Martins (IST) Linear Classifiers LxMLS 2020 81 / 157
![Page 109: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/109.jpg)
Discriminative versus Generative
• Generative models attempt to model inputs and outputs• e.g., Naive Bayes = MLE of joint distribution P(x , y)• Statistical model must explain generation of input• Can we sample a document from the multinomial Naive Bayes model?
How?
• Occam’s Razor: why model input?• Discriminative models
• Use loss function that directly optimizes P(y |x) (or something related)• Logistic Regression – MLE of P(y |x)• Perceptron and SVMs – minimize classification error
• Generative and discriminative models use P(y |x) for prediction• They differ only on what distribution they use to set w
Andre Martins (IST) Linear Classifiers LxMLS 2020 81 / 157
![Page 110: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/110.jpg)
Coffee-break!
Andre Martins (IST) Linear Classifiers LxMLS 2020 82 / 157
![Page 111: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/111.jpg)
So far
We have covered:
• The perceptron algorithm
• (Multinomial) Naive Bayes.
We saw that both are instances of linear classifiers.
Perceptron finds a separating hyperplane (if it exists), Naive Bayes is agenerative probabilistic model
Next: a discriminative probabilistic model.
Andre Martins (IST) Linear Classifiers LxMLS 2020 83 / 157
![Page 112: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/112.jpg)
Reminder
Linear Classifier
HandcraftedFeatures
y = argmax (Wφ(x) + b) , W =
...w>y
...
, b =
...by...
.Andre Martins (IST) Linear Classifiers LxMLS 2020 84 / 157
![Page 113: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/113.jpg)
Key Problem
How to map from a set of label scores R|Y| to a probability distributionover Y?
z p
We’ll see two mappings: softmax (next) and sparsemax (later).
Andre Martins (IST) Linear Classifiers LxMLS 2020 85 / 157
![Page 114: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/114.jpg)
Outline
1 Data and Feature Representation
2 Regression
3 Classification
Perceptron
Naive Bayes
Logistic Regression
Support Vector Machines
4 Regularization
5 Non-Linear Classifiers
Andre Martins (IST) Linear Classifiers LxMLS 2020 86 / 157
![Page 115: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/115.jpg)
Logistic Regression
Recall: a linear model gives the score for each class, wy · φ(x).
Define a conditional probability:
P(y |x) =exp(wy · φ(x))
Zx, where Zx =
∑y ′∈Y
exp(wy ′ · φ(x))
This operation (exponentiating and normalizing) is called the softmaxtransformation (more later!)
Note: still a linear classifier
arg maxy
P(y |x) = arg maxy
exp(wy · φ(x))
Zx
= arg maxy
exp(wy · φ(x))
= arg maxy
wy · φ(x)
Andre Martins (IST) Linear Classifiers LxMLS 2020 87 / 157
![Page 116: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/116.jpg)
Binary Logistic Regression
Binary labels (Y = ±1)Scores: 0 for negative class, w · φ(x) for positive class
P(y = +1 | x) =exp(w · φ(x))
1 + exp(w · φ(x))
=1
1 + exp(−w · φ(x))
= σ(w · φ(x)).
This is called a sigmoid transformation (more later!)
Andre Martins (IST) Linear Classifiers LxMLS 2020 88 / 157
![Page 117: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/117.jpg)
Sigmoid Transformation
σ(z) =1
1 + e−z
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
• Widely used in neural networks (wait for tomorrow!)
• Can be regarded as a 2D softmax
• “Squashes” a real number between 0 and 1
• The output can be interpreted as a probability
• Positive, bounded, strictly increasing
Andre Martins (IST) Linear Classifiers LxMLS 2020 89 / 157
![Page 118: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/118.jpg)
Multinomial Logistic Regression
PW (y | x) =exp(wy · φ(x))
Zx
• How do we learn weights W ?• Set W to maximize the conditional log-likelihood of training data:
W = arg maxW
log
(N∏t=1
PW (yt |xt)
)= arg min
W−
N∑t=1
logPW (yt |xt) =
= arg minW
N∑t=1
log∑y ′t
exp(wy ′t· φ(xt))−wyt · φ(xt)
,
i.e., set W to assign as much probability mass as possible to thecorrect labels!
Andre Martins (IST) Linear Classifiers LxMLS 2020 90 / 157
![Page 119: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/119.jpg)
Logistic Regression
• This objective function is convex
• Therefore any local minimum is a global minimum• No closed form solution, but lots of numerical techniques
• Gradient methods (gradient descent, conjugate gradient)• Quasi-Newton methods (L-BFGS, ...)
• Logistic Regression = Maximum Entropy: maximize entropy subjectto constraints on features
• Proof left as an exercise!
Andre Martins (IST) Linear Classifiers LxMLS 2020 91 / 157
![Page 120: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/120.jpg)
Logistic Regression
• This objective function is convex
• Therefore any local minimum is a global minimum• No closed form solution, but lots of numerical techniques
• Gradient methods (gradient descent, conjugate gradient)• Quasi-Newton methods (L-BFGS, ...)
• Logistic Regression = Maximum Entropy: maximize entropy subjectto constraints on features
• Proof left as an exercise!
Andre Martins (IST) Linear Classifiers LxMLS 2020 91 / 157
![Page 121: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/121.jpg)
Recap: Convex functions
Pro: Guarantee of a global minima X
Figure: Illustration of a convex function. The line segment between any twopoints on the graph lies entirely above the curve.
Andre Martins (IST) Linear Classifiers LxMLS 2020 92 / 157
![Page 122: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/122.jpg)
Recap: Iterative Descent Methods
Goal: find the minimum/minimizer of f : Rd → R
• Proceed in small steps in the optimal direction till a stoppingcriterion is met.• Gradient descent: updates of the form: x (k+1) ← x (k) − ηk∇f (x (k))
Figure: Illustration of gradient descent. The red lines correspond to steps takenin the negative gradient direction.
Andre Martins (IST) Linear Classifiers LxMLS 2020 93 / 157
![Page 123: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/123.jpg)
Gradient Descent
• Our loss function in logistic regression is
L(W ; (x , y)) = log∑y ′
exp(wy ′ · φ(x)) − wy · φ(x).
• We want to find arg minW∑N
t=1 L(W ; (xt , yt))• Set W 0 = 0• Iterate until convergence (for suitable stepsize ηk):
W k+1 = W k − ηk∇W
(∑Nt=1 L(W ; (xt , yt))
)= W k − ηk
∑Nt=1∇W L(W k ; (xt , yt))
• ∇W L(W ) is gradient of L w.r.t. W
• L(W ) convex ⇒ gradient descent will reach the global optimum W .
Andre Martins (IST) Linear Classifiers LxMLS 2020 94 / 157
![Page 124: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/124.jpg)
Stochastic Gradient Descent
It turns out this works with a Monte Carlo approximation of the gradient(more frequent updates, convenient with large datasets):
• Set W 0 = 0• Iterate until convergence
• Pick (xt , yt) randomly
• Update W k+1 = W k − ηk∇W L(W k ; (xt , yt))
• i.e. we approximate the true gradient with a noisy, unbiased, gradient,based on a single sample
• Variants exist in-between (mini-batches)
• All guaranteed to find the optimal W (for suitable step sizes)
Andre Martins (IST) Linear Classifiers LxMLS 2020 95 / 157
![Page 125: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/125.jpg)
Computing the Gradient
• For this to work, we need to compute ∇W L(W ; (xt , yt)), where
L(W ; (x , y)) = log∑y ′
exp(wy ′ · φ(x)) − wy · φ(x)
• Some reminders:
1 ∇W log F (W ) = 1F (W )∇W F (W )
2 ∇W expF (W ) = exp(F (W ))∇W F (W )
• We denote byey = [0, . . . , 0, 1︸︷︷︸
y
, 0, . . . , 0]>
the one-hot vector representation of class y .
Andre Martins (IST) Linear Classifiers LxMLS 2020 96 / 157
![Page 126: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/126.jpg)
Computing the Gradient
∇W L(W ; (x , y)) = ∇W
log∑y′
exp(wy′ · φ(x))−wy · φ(x)
= ∇W log
∑y′
exp(wy′ · φ(x))−∇Wwy · φ(x)
=1∑
y′ exp(wy′ · φ(x))
∑y′∇W exp(wy′ · φ(x))−eyφ(x)>
=1
Zx
∑y′
exp(wy′ · φ(x))∇Wwy′ · φ(x)−eyφ(x)>
=∑y′
exp(wy′ · φ(x))
Zxey′φ(x)>−eyφ(x)>
=∑y′
PW (y ′|x)ey′φ(x)>−eyφ(x)>
=
...PW (y ′|x)
...
− eyφ(x)>.
Andre Martins (IST) Linear Classifiers LxMLS 2020 97 / 157
![Page 127: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/127.jpg)
Logistic Regression Summary
• Define conditional probability
PW (y |x) =exp(wy · φ(x))
Zx
• Set weights to maximize conditional log-likelihood of training data:
W = arg maxW
∑t
logPW (yt |xt) = arg minW∑t
L(W ; (xt , yt))
• Can find the gradient and run gradient descent (or any gradient-basedoptimization algorithm)
∇W L(W ; (x , y)) =∑y ′
PW (y ′|x)ey ′φ(x)>−eyφ(x)>
Andre Martins (IST) Linear Classifiers LxMLS 2020 98 / 157
![Page 128: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/128.jpg)
The Story So Far
• Naive Bayes is generative: maximizes joint likelihood• closed form solution (boils down to counting and normalizing)
• Logistic regression is discriminative: maximizes conditional likelihood• also called log-linear model and max-entropy classifier• no closed form solution• stochastic gradient updates look like
W k+1 = W k + η
eyφ(x)> −∑y ′
Pw(y ′|x)ey ′φ(x)>
• Perceptron is a discriminative, non-probabilistic classifier
• perceptron’s updates look like
W k+1 = W k + eyφ(x)> − eyφ(x)>
SGD updates for logistic regression and perceptron’s updates look similar!
Andre Martins (IST) Linear Classifiers LxMLS 2020 99 / 157
![Page 129: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/129.jpg)
Maximizing Margin
• For a training set D
• Margin of a weight matrix W is smallest γ such that
wyt · φ(xt)−wy ′ · φ(xt) ≥ γ
• for every training instance (xt , yt) ∈ D, y ′ ∈ Y
Andre Martins (IST) Linear Classifiers LxMLS 2020 100 / 157
![Page 130: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/130.jpg)
Margin
Training Testing
Denote thevalue of themargin by γ
Andre Martins (IST) Linear Classifiers LxMLS 2020 101 / 157
![Page 131: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/131.jpg)
Maximizing Margin
• Intuitively maximizing margin makes sense
• More importantly, generalization error to unseen test data isproportional to the inverse of the margin
ε ∝ R2
γ2 × N
• Perceptron:• If a training set is separable by some margin, the perceptron will find aW that separates the data
• However, the perceptron does not pick W to maximize the margin!
Andre Martins (IST) Linear Classifiers LxMLS 2020 102 / 157
![Page 132: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/132.jpg)
Outline
1 Data and Feature Representation
2 Regression
3 Classification
Perceptron
Naive Bayes
Logistic Regression
Support Vector Machines
4 Regularization
5 Non-Linear Classifiers
Andre Martins (IST) Linear Classifiers LxMLS 2020 103 / 157
![Page 133: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/133.jpg)
Maximizing Margin
Let γ > 0max||U ||=1
γ
such that:uyt · φ(xt)− uy ′ · φ(xt) ≥ γ
∀(xt , yt) ∈ D
and y ′ ∈ Y
• Note: the solution still ensures a separating hyperplane if there is one(zero training error) – due to the hard constraint
• We fix ||U || = 1 since scaling U to increase ‖U‖ trivially produceslarger margin
Andre Martins (IST) Linear Classifiers LxMLS 2020 104 / 157
![Page 134: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/134.jpg)
Maximizing Margin
Let γ > 0max||U ||=1
γ
such that:uyt · φ(xt)− uy ′ · φ(xt) ≥ γ
∀(xt , yt) ∈ D
and y ′ ∈ Y
• Note: the solution still ensures a separating hyperplane if there is one(zero training error) – due to the hard constraint
• We fix ||U || = 1 since scaling U to increase ‖U‖ trivially produceslarger margin
Andre Martins (IST) Linear Classifiers LxMLS 2020 104 / 157
![Page 135: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/135.jpg)
Max Margin = Min Norm
Let γ > 0
Max Margin:
max||U ||=1
γ
such that:
uyt ·φ(xt)−uy ′ ·φ(xt) ≥ γ
∀(xt , yt) ∈ D
and y ′ ∈ Y
=
Min Norm:
minW
1
2||W ||2
such that:
wyt ·φ(xt)−wy ′ ·φ(xt) ≥ 1
∀(xt , yt) ∈ D
and y ′ ∈ Y
• Instead of fixing ||U || we fix the margin to 1
• Make substitution W = Uγ ; then we have ‖W ‖ = ‖U‖
γ = 1γ .
Andre Martins (IST) Linear Classifiers LxMLS 2020 105 / 157
![Page 136: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/136.jpg)
Max Margin = Min Norm
Let γ > 0
Max Margin:
max||U ||=1
γ
such that:
uyt ·φ(xt)−uy ′ ·φ(xt) ≥ γ
∀(xt , yt) ∈ D
and y ′ ∈ Y
=
Min Norm:
minW
1
2||W ||2
such that:
wyt ·φ(xt)−wy ′ ·φ(xt) ≥ 1
∀(xt , yt) ∈ D
and y ′ ∈ Y
• Instead of fixing ||U || we fix the margin to 1
• Make substitution W = Uγ ; then we have ‖W ‖ = ‖U‖
γ = 1γ .
Andre Martins (IST) Linear Classifiers LxMLS 2020 105 / 157
![Page 137: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/137.jpg)
Support Vector Machines
W = arg minW1
2||W ||2
such that:wyt · φ(xt)−wy ′ · φ(xt) ≥ 1
∀(xt , yt) ∈ D and y ′ ∈ Y
• Quadratic programming problem – a well known convex optimizationproblem
• Can be solved with many techniques.
Andre Martins (IST) Linear Classifiers LxMLS 2020 106 / 157
![Page 138: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/138.jpg)
Support Vector Machines
What if data is not separable?
W = arg minW ,ξ
1
2||W ||2 + C
N∑t=1
ξt
such that:
wyt · φ(xt)−wy ′ · φ(xt) ≥ 1− ξt and ξt ≥ 0
∀(xt , yt) ∈ D and y ′ ∈ Y
ξt : trade-off between margin violations per example and ‖W ‖Larger C = more examples correctly classified, but smaller margin.
Andre Martins (IST) Linear Classifiers LxMLS 2020 107 / 157
![Page 139: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/139.jpg)
Kernels
Historically, SVMs with kernels co-ocurred together and were extremelypopular
Can “kernelize” algorithms to make them non-linear (not only SVMs, butalso logistic regression, perceptron, ...)
More later.
Andre Martins (IST) Linear Classifiers LxMLS 2020 108 / 157
![Page 140: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/140.jpg)
Support Vector Machines
W = arg minW ,ξ
1
2||W ||2 + C
N∑t=1
ξt
such that:wyt · φ(xt)−wy ′ · φ(xt) ≥ 1− ξt ∀y ′ 6= yt
If W classifies (xt , yt) with margin 1, penalty ξt = 0Otherwise penalty ξt = 1 + maxy ′ 6=yt wy ′ · φ(xt)−wyt · φ(xt)
Hinge loss:
L((xt , yt);W ) = max (0, 1 + maxy ′ 6=yt wy ′ · φ(xt)−wyt · φ(xt))
Andre Martins (IST) Linear Classifiers LxMLS 2020 109 / 157
![Page 141: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/141.jpg)
Support Vector Machines
W = arg minW ,ξ
1
2||W ||2 + C
N∑t=1
ξt
such that:wyt · φ(xt)− max
y ′ 6=ytwy ′ · φ(xt) ≥ 1− ξt
If W classifies (xt , yt) with margin 1, penalty ξt = 0Otherwise penalty ξt = 1 + maxy ′ 6=yt wy ′ · φ(xt)−wyt · φ(xt)
Hinge loss:
L((xt , yt);W ) = max (0, 1 + maxy ′ 6=yt wy ′ · φ(xt)−wyt · φ(xt))
Andre Martins (IST) Linear Classifiers LxMLS 2020 109 / 157
![Page 142: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/142.jpg)
Support Vector Machines
W = arg minW ,ξ
1
2||W ||2 + C
N∑t=1
ξt
such that:ξt ≥ 1 + max
y ′ 6=ytwy ′ · φ(xt)−wyt · φ(xt)
If W classifies (xt , yt) with margin 1, penalty ξt = 0Otherwise penalty ξt = 1 + maxy ′ 6=yt wy ′ · φ(xt)−wyt · φ(xt)
Hinge loss:
L((xt , yt);W ) = max (0, 1 + maxy ′ 6=yt wy ′ · φ(xt)−wyt · φ(xt))
Andre Martins (IST) Linear Classifiers LxMLS 2020 109 / 157
![Page 143: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/143.jpg)
Support Vector Machines
W = arg minW ,ξ
λ
2||W ||2 +
N∑t=1
ξt λ =1
C
such that:ξt ≥ 1 + max
y ′ 6=ytwy ′ · φ(xt)−wyt · φ(xt)
If W classifies (xt , yt) with margin 1, penalty ξt = 0Otherwise penalty ξt = 1 + maxy ′ 6=yt wy ′ · φ(xt)−wyt · φ(xt)
Hinge loss:
L((xt , yt);W ) = max (0, 1 + maxy ′ 6=yt wy ′ · φ(xt)−wyt · φ(xt))
Andre Martins (IST) Linear Classifiers LxMLS 2020 109 / 157
![Page 144: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/144.jpg)
Support Vector Machines
W = arg minW ,ξ
λ
2||W ||2 +
N∑t=1
ξt λ =1
C
such that:ξt ≥ 1 + max
y ′ 6=ytwy ′ · φ(xt)−wyt · φ(xt)
If W classifies (xt , yt) with margin 1, penalty ξt = 0Otherwise penalty ξt = 1 + maxy ′ 6=yt wy ′ · φ(xt)−wyt · φ(xt)
Hinge loss:
L((xt , yt);W ) = max (0, 1 + maxy ′ 6=yt wy ′ · φ(xt)−wyt · φ(xt))
Andre Martins (IST) Linear Classifiers LxMLS 2020 109 / 157
![Page 145: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/145.jpg)
Support Vector Machines
W = arg minW ,ξ
λ
2||W ||2 +
N∑t=1
ξt λ =1
C
such that:ξt ≥ 1 + max
y ′ 6=ytwy ′ · φ(xt)−wyt · φ(xt)
If W classifies (xt , yt) with margin 1, penalty ξt = 0Otherwise penalty ξt = 1 + maxy ′ 6=yt wy ′ · φ(xt)−wyt · φ(xt)
Hinge loss:
L((xt , yt);W ) = max (0, 1 + maxy ′ 6=yt wy ′ · φ(xt)−wyt · φ(xt))
Andre Martins (IST) Linear Classifiers LxMLS 2020 109 / 157
![Page 146: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/146.jpg)
Support Vector Machines
W = arg minW ,ξ
λ
2||W ||2 +
N∑t=1
ξt
such that:ξt ≥ 1 + max
y ′ 6=ytwy ′ · φ(xt)−wyt · φ(xt)
Hinge loss equivalent:
W = arg minW
N∑t=1
max (0, 1 + maxy ′ 6=yt
wy ′ · φ(xt)−wyt · φ(xt))︸ ︷︷ ︸L(W ;(xt ,yt))
+λ
2||W ||2
Andre Martins (IST) Linear Classifiers LxMLS 2020 110 / 157
![Page 147: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/147.jpg)
From Gradient to Subgradient
The hinge loss is a piecewise linear function—not differentiable everywhere
Cannot use gradient descent
But... can use subgradient descent (almost the same)!
Andre Martins (IST) Linear Classifiers LxMLS 2020 111 / 157
![Page 148: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/148.jpg)
Recap: Subgradient
• Defined for convex functions f : RD → R• Generalizes the notion of gradient—in points where f is differentiable,
there is a single subgradient which equals the gradient
• Other points may have multiple subgradients
Andre Martins (IST) Linear Classifiers LxMLS 2020 112 / 157
![Page 149: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/149.jpg)
Subgradient Descent
L(W ; (x , y)) = max (0, 1 + maxy ′ 6=y
wy ′ · φ(x)−wy · φ(x))
=
(maxy ′∈Y
wy ′ · φ(x) + [[y ′ 6= y ]]
)−wy · φ(x)
A subgradient of the hinge is
∇W L(W ; (x , y)) 3 eyφ(x)> − eyφ(x)>
wherey = arg max
y ′∈Ywy ′ · φ(x) + [[y ′ 6= y ]]
Can also train SVMs with (stochastic) sub-gradient descent!
Andre Martins (IST) Linear Classifiers LxMLS 2020 113 / 157
![Page 150: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/150.jpg)
Perceptron and Hinge-Loss
SVM subgradient update looks like perceptron update
W k+1 = W k−η
0, if wyt · φ(xt)−maxy 6=yt wy · φ(xt) ≥ 1
eyφ(xt)> − eytφ(xt)>, otherwise, where y = arg maxy wy · φ(xt) + [[y 6= yt ]]
Perceptron
W k+1 = W k − η
0, if wyt · φ(xt)−maxy wy · φ(xt) ≥ 0
eyφ(xt)> − eytφ(xt)>, otherwise, where y = arg maxy wy · φ(xt)
where η = 1
Perceptron = SGD with no-margin hinge-loss
max (0, 1+ maxy 6=yt
wy · φ(xt)−wyt · φ(xt))
Andre Martins (IST) Linear Classifiers LxMLS 2020 114 / 157
![Page 151: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/151.jpg)
Summary
What we have covered
• Linear Classifiers• Naive Bayes• Logistic Regression• Perceptron• Support Vector Machines
What is next
• Regularization
• Softmax and sparsemax
• Non-linear classifiers
Andre Martins (IST) Linear Classifiers LxMLS 2020 115 / 157
![Page 152: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/152.jpg)
Outline
1 Data and Feature Representation
2 Regression
3 Classification
Perceptron
Naive Bayes
Logistic Regression
Support Vector Machines
4 Regularization
5 Non-Linear Classifiers
Andre Martins (IST) Linear Classifiers LxMLS 2020 116 / 157
![Page 153: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/153.jpg)
Regularization
Andre Martins (IST) Linear Classifiers LxMLS 2020 117 / 157
![Page 154: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/154.jpg)
Overfitting
If the model is too complex (too many parameters) and the data is scarce,we run the risk of overfitting:
• We saw one example already when talking about add-one smoothingin Naive Bayes!
Andre Martins (IST) Linear Classifiers LxMLS 2020 118 / 157
![Page 155: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/155.jpg)
Regularization
In practice, we regularize models to prevent overfitting
arg minW
N∑t=1
L(W ; (xt , yt)) + λΩ(W ),
where Ω(W ) is the regularization function, and λ controls how much toregularize.
• Gaussian prior (`2), promotes smaller weights:
Ω(W ) = ‖W ‖22 =
∑y
‖wy‖22 =
∑y
∑j
w2y ,j .
• Laplacian prior (`1), promotes sparse weights!
Ω(W ) = ‖W ‖1 =∑y
‖wy‖1 =∑y
∑j
|wy ,j |
Andre Martins (IST) Linear Classifiers LxMLS 2020 119 / 157
![Page 156: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/156.jpg)
Empirical Risk Minimization
Andre Martins (IST) Linear Classifiers LxMLS 2020 120 / 157
![Page 157: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/157.jpg)
Logistic Regression with `2 Regularization
N∑t=1
L(W ; (xt , yt)) + λΩ(W ) = −N∑t=1
log (exp(wyt · φ(xt))/Zx) +λ
2‖W ‖2
• What is the new gradient?
N∑t=1
∇W L(W ; (xt , yt)) +∇WλΩ(W )
• We know ∇W L(W ; (xt , yt))
• Just need ∇Wλ2 ‖W ‖
2 = λW
Andre Martins (IST) Linear Classifiers LxMLS 2020 121 / 157
![Page 158: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/158.jpg)
Support Vector Machines
Hinge-loss formulation: `2 regularization already happening!
W = arg minW
N∑t=1
L(W ; (xt , yt)) + λΩ(W )
= arg minW
N∑t=1
max (0, 1 + maxy 6=yt
wy · φ(xt)−wyt · φ(xt)) + λΩ(W )
= arg minW
N∑t=1
max (0, 1 + maxy 6=yt
wy · φ(xt)−wyt · φ(xt)) +λ
2‖W ‖2
↑ SVM optimization ↑
Andre Martins (IST) Linear Classifiers LxMLS 2020 122 / 157
![Page 159: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/159.jpg)
SVMs vs. Logistic Regression
W = arg minW
N∑t=1
L(W ; (xt , yt)) + λΩ(W )
• SVMs/hinge-loss:
L(W ; (xt , yt)) = max (0, 1 + maxy 6=yt
(wy · φ(xt)−wyt · φ(xt))), Ω(W ) =1
2‖W ‖2
• Logistic Regression/log-loss:
L(W ; (xt , yt)) = − log (exp(w ·ψ(xt , yt))/Zx ) , Ω(W ) =1
2‖W ‖2.
Andre Martins (IST) Linear Classifiers LxMLS 2020 123 / 157
![Page 160: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/160.jpg)
Loss Function
Should match as much as possible the metric we want to optimize at testtime
Should be well-behaved (continuous, maybe smooth) to be amenable tooptimization (this rules out the 0/1 loss)
Some examples:
• Squared loss for regression
• Negative log-likelihood (cross-entropy): multinomial logistic regression
• Hinge loss: support vector machines
• Sparsemax loss for multi-class and multi-label classification (next)
Andre Martins (IST) Linear Classifiers LxMLS 2020 124 / 157
![Page 161: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/161.jpg)
Recap
How to map from a set of label scores R|Y| to a probability distributionover Y?
z p
We already saw one example: softmax.
Next: sparsemax.
Andre Martins (IST) Linear Classifiers LxMLS 2020 125 / 157
![Page 162: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/162.jpg)
Recap: Softmax Transformation
The typical transformation for multi-class classification issoftmax : R|Y| → ∆|Y|−1:
softmax(z) =
[exp(z1)∑c exp(zc)
, . . . ,exp(z|Y|)∑c exp(zc)
]
• Underlies multinomial logistic regression!
• Strictly positive, sums to 1
• Resulting distribution has full support: softmax(z) > 0,∀z• A disadvantage if a sparse probability distribution is desired
• Common workaround: threshold and truncate
Andre Martins (IST) Linear Classifiers LxMLS 2020 126 / 157
![Page 163: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/163.jpg)
Sparsemax (Martins and Astudillo, 2016)
A sparse-friendly alternative is sparsemax : R|Y| → ∆|Y|−1, defined as:
sparsemax(z) := arg minp∈∆|Y|−1 ‖p − z‖2.
• In words: Euclidean projection of z onto the probability simplex
• Likely to hit the boundary of the simplex, in which casesparsemax(z) becomes sparse (hence the name)
• Retains many of the properties of softmax (e.g. differentiability),having in addition the ability of producing sparse distributions
• Projecting onto the simplex amounts to a soft-thresholding operation
• Efficient linear time forward/backward propagation (see paper)
Andre Martins (IST) Linear Classifiers LxMLS 2020 127 / 157
![Page 164: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/164.jpg)
Sparsemax in Closed Form
• Projecting onto the simplex amounts to a soft-thresholding operation:
sparsemaxi (z) = max0, zi − τ
where τ is a normalizing constant such that∑
j max0, zj − τ = 1
• To evaluate the sparsemax, all we need is to compute τ
• Coordinates above the threshold will be shifted by this amount; theothers will be truncated to zero
Andre Martins (IST) Linear Classifiers LxMLS 2020 128 / 157
![Page 165: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/165.jpg)
Two Dimensions
• Parametrize z = (t, 0)• The 2D softmax is the logistic (sigmoid) function:
softmax1(z) = (1 + exp(−t))−1
• The 2D sparsemax is the “hard” version of the sigmoid:
− 3 − 2 − 1 0 1 2 3t
0.0
0.2
0.4
0.6
0.8
1.0 softmax1([t,0])
sparsemax1([t,0])
Andre Martins (IST) Linear Classifiers LxMLS 2020 129 / 157
![Page 166: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/166.jpg)
Three Dimensions
• Parameterize z = (t1, t2, 0) and plot softmax1(z) andsparsemax1(z) as a function of t1 and t2
• sparsemax is piecewise linear, but asymptotically similar to softmax
Andre Martins (IST) Linear Classifiers LxMLS 2020 130 / 157
![Page 167: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/167.jpg)
Loss Function
How to use sparsemax as a loss function?
Caveat: sparsemax is sparse and we don’t want to take the log of zero...
Andre Martins (IST) Linear Classifiers LxMLS 2020 131 / 157
![Page 168: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/168.jpg)
Recap: Multinomial Logistic Regression
• The common choice for a softmax output layer
• The classifier estimates P(y = c | x ;W )
• We minimize the negative log-likelihood:
L(W ; (x , y)) = − logP(y | x ;W )
= − log [softmax(z(x))]y ,
where zc(x) = wc · φ(x) is the score of class c .
• Loss gradient:
∇W L((x , y);W ) = −(eyφ(x)> − softmax(z(x))φ(x)>
)
Andre Martins (IST) Linear Classifiers LxMLS 2020 132 / 157
![Page 169: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/169.jpg)
Sparsemax Loss (Martins and Astudillo, 2016)
• The natural choice for a sparsemax output layer
• The neural network estimates P(y | x ;W ) as a sparse distribution• The sparsemax loss is
L((x , y);W ) = −zy (x) +1
2−
1
2‖ sparsemax(z(x))‖2 + z(x)> sparsemax(z(x)),
where zy (x) = wy · φ(x).
• Loss gradient:
∇W L((x , y);W ) = −(eyφ(x)> − sparsemax(z(x))φ(x)>
)
Andre Martins (IST) Linear Classifiers LxMLS 2020 133 / 157
![Page 170: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/170.jpg)
Classification Losses (Binary Case)
• Let the correct label be y = +1 and define s = z2 − z1.• Sparsemax loss in 2D becomes a “classification Huber loss”:
Andre Martins (IST) Linear Classifiers LxMLS 2020 134 / 157
![Page 171: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/171.jpg)
Outline
1 Data and Feature Representation
2 Regression
3 Classification
Perceptron
Naive Bayes
Logistic Regression
Support Vector Machines
4 Regularization
5 Non-Linear Classifiers
Andre Martins (IST) Linear Classifiers LxMLS 2020 135 / 157
![Page 172: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/172.jpg)
Recap: What a Linear Classifier Can Do
• It can solve linearly separable problems (OR, AND)
Andre Martins (IST) Linear Classifiers LxMLS 2020 136 / 157
![Page 173: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/173.jpg)
Recap: What a Linear Classifier Can’t Do
• ... but it can’t solve non-linearly separable problems such as simpleXOR (unless input is transformed into a better representation):
• This was observed by Minsky and Papert (1969) (for the perceptron)and motivated strong criticisms
Andre Martins (IST) Linear Classifiers LxMLS 2020 137 / 157
![Page 174: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/174.jpg)
Summary: Linear Classifiers
We’ve seen
• Perceptron
• Naive Bayes
• Logistic regression
• Support vector machines
All lead to convex optimization problems ⇒ no issues with localminima/initialization
All assume the features are well-engineered such that the data is nearlylinearly separable
Andre Martins (IST) Linear Classifiers LxMLS 2020 138 / 157
![Page 175: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/175.jpg)
What If Data Are Not Linearly Separable?
Engineer better features (often works!)
Kernel methods:
• works implicitly in a high-dimensional feature space
• ... but still need to choose/design a good kernel
• model capacity confined to positive-definite kernels
Neural networks (next class!)
• embrace non-convexity and local minima
• instead of engineering features/kernels, engineer the modelarchitecture
Andre Martins (IST) Linear Classifiers LxMLS 2020 139 / 157
![Page 176: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/176.jpg)
What If Data Are Not Linearly Separable?
Engineer better features (often works!)
Kernel methods:
• works implicitly in a high-dimensional feature space
• ... but still need to choose/design a good kernel
• model capacity confined to positive-definite kernels
Neural networks (next class!)
• embrace non-convexity and local minima
• instead of engineering features/kernels, engineer the modelarchitecture
Andre Martins (IST) Linear Classifiers LxMLS 2020 139 / 157
![Page 177: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/177.jpg)
What If Data Are Not Linearly Separable?
Engineer better features (often works!)
Kernel methods:
• works implicitly in a high-dimensional feature space
• ... but still need to choose/design a good kernel
• model capacity confined to positive-definite kernels
Neural networks (next class!)
• embrace non-convexity and local minima
• instead of engineering features/kernels, engineer the modelarchitecture
Andre Martins (IST) Linear Classifiers LxMLS 2020 139 / 157
![Page 178: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/178.jpg)
What If Data Are Not Linearly Separable?
Engineer better features (often works!)
Kernel methods:
• works implicitly in a high-dimensional feature space
• ... but still need to choose/design a good kernel
• model capacity confined to positive-definite kernels
Neural networks (next class!)
• embrace non-convexity and local minima
• instead of engineering features/kernels, engineer the modelarchitecture
Andre Martins (IST) Linear Classifiers LxMLS 2020 139 / 157
![Page 179: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/179.jpg)
Two Views of Machine Learning
There’s two big ways of building machine learning systems:
1 Feature-based: describe objects’ properties (features) and buildmodels that manipulate them• everything that we have seen so far.
2 Similarity-based: don’t describe objects by their properties; rather,build systems based on comparing objects to each other• k-th nearest neighbors; kernel methods; Gaussian processes.
Sometimes the two are equivalent!
Andre Martins (IST) Linear Classifiers LxMLS 2020 140 / 157
![Page 180: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/180.jpg)
Nearest Neighbor Classifier
• Not a linear classifier!
• In its simplest version, doesn’t require any parameters
• Instead of “training”, memorize all the data D = (xi , yi )Ni=1• Given a new input x , find its most similar data point xi and predict
y = yi
• Many variants (e.g. k-th nearest neighbor)
• Disadvantage: requires searching over the entire training data
• Specialized data structures can be used to speed up search.
Andre Martins (IST) Linear Classifiers LxMLS 2020 141 / 157
![Page 181: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/181.jpg)
Kernels
• A kernel is a similarity function between two points that is symmetricand positive semi-definite, which we denote by:
κ(xi , xj) ∈ R
• Given dataset D = (xi , yi )Ni=1, the Gram matrix K is the N × Nmatrix defined as:
Ki ,j = κ(xi , xj)
• Symmetric:κ(xi , xj) = κ(xj , xi )
• Positive definite: for all non-zero v
vKvT ≥ 0
Andre Martins (IST) Linear Classifiers LxMLS 2020 142 / 157
![Page 182: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/182.jpg)
Kernels
• Mercer’s Theorem: for any kernel κ : X× X→ R, there exists somefeature mapping φ : X→ RX, s.t.:
κ(xi , xj) = φ(xi ) · φ(xj)
• That is: a kernel corresponds to some a mapping in some implicitfeature space!
• Kernel trick: take a feature-based algorithm (SVMs, perceptron,logistic regression) and replace all explicit feature computations bykernel evaluations!
wy · φ(x) =N∑i=1
∑y∈Y
αi ,yκ(x , xi ) for some αi ,y ∈ R
• Extremely popular idea in the 1990-2000s!
Andre Martins (IST) Linear Classifiers LxMLS 2020 143 / 157
![Page 183: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/183.jpg)
Kernels = Tractable Non-Linearity
• A linear classifier in a higher dimensional feature space is a non-linearclassifier in the original space
• Computing a non-linear kernel is sometimes better computationallythan calculating the corresponding dot product in the high dimensionfeature space
• Many models can be “kernelized” – learning algorithms generallysolve the dual optimization problem (also convex)
• Drawback: quadratic dependency on dataset size
Andre Martins (IST) Linear Classifiers LxMLS 2020 144 / 157
![Page 184: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/184.jpg)
Linear Classifiers in High Dimension
Andre Martins (IST) Linear Classifiers LxMLS 2020 145 / 157
![Page 185: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/185.jpg)
Popular Kernels
• Polynomial kernel
κ(xi , xj) = (φ(xi ) · φ(xj) + 1)d
• Gaussian radial basis kernel
κ(xi , xj) = exp(−||φ(xi )− φ(xj)||2
2σ)
• String kernels (Lodhi et al., 2002; Collins and Duffy, 2002)
• Tree kernels (Collins and Duffy, 2002)
Andre Martins (IST) Linear Classifiers LxMLS 2020 146 / 157
![Page 186: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/186.jpg)
Joint Feature Mappings (useful for the labs)
Andre Martins (IST) Linear Classifiers LxMLS 2020 147 / 157
![Page 187: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/187.jpg)
Feature Representations: Joint Feature Mappings
For multi-class/structured classification, a joint feature mapψ : X× Y→ RD is sometimes more convenient
• ψ(x , y) instead of φ(x)
Each feature now represents a joint property of the input x and thecandidate output y .
We’ll use this notation in the labs this afternoon!
Andre Martins (IST) Linear Classifiers LxMLS 2020 148 / 157
![Page 188: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/188.jpg)
Examples
• x is a document and y is a label
ψj(x , y) =
1 if x contains the word “interest”
and y = “financial”0 otherwise
ψj(x , y) = % of words in x with punctuation and y = “scientific”
• x is a word and y is a part-of-speech tag
ψj(x , y) =
1 if x = “bank” and y = Verb0 otherwise
Andre Martins (IST) Linear Classifiers LxMLS 2020 149 / 157
![Page 189: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/189.jpg)
More Examples
• x is a name, y is a label classifying the type of entity
ψ0(x, y) =
1 if x contains “George”and y = “Person”
0 otherwise
ψ1(x, y) =
1 if x contains “Washington”and y = “Person”
0 otherwise
ψ2(x, y) =
1 if x contains “Bridge”and y = “Person”
0 otherwise
ψ3(x, y) =
1 if x contains “General”and y = “Person”
0 otherwise
ψ4(x, y) =
1 if x contains “George”and y = “Location”
0 otherwise
ψ5(x, y) =
1 if x contains “Washington”and y = “Location”
0 otherwise
ψ6(x, y) =
1 if x contains “Bridge”and y = “Location”
0 otherwise
ψ7(x, y) =
1 if x contains “General”and y = “Location”
0 otherwise
• x=General George Washington, y=Person → ψ(x , y) = [1 1 0 1 0 0 0 0]
• x=George Washington Bridge, y=Location → ψ(x , y) = [0 0 0 0 1 1 1 0]
• x=George Washington George, y=Location → ψ(x , y) = [0 0 0 0 1 1 0 0]
Andre Martins (IST) Linear Classifiers LxMLS 2020 150 / 157
![Page 190: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/190.jpg)
Block Feature Vectors
• x=General George Washington, y=Person → ψ(x , y) = [1 1 0 1 0 0 0 0]
• x=General George Washington, y=Location → ψ(x , y) = [0 0 0 0 1 1 0 1]
• x=George Washington Bridge, y=Location → ψ(x , y) = [0 0 0 0 1 1 1 0]
• x=George Washington George, y=Location → ψ(x , y) = [0 0 0 0 1 1 0 0]
• Each equal size block of the feature vector corresponds to one label
• Non-zero values allowed only in one block
Andre Martins (IST) Linear Classifiers LxMLS 2020 151 / 157
![Page 191: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/191.jpg)
Feature Representations – φ(x) vs. ψ(x , y)
Equivalent if ψ(x , y) conjoins input features φ(x) with one-hot labelrepresentations ey := [0, . . . , 0, 1, 0, . . . , 0]
ψ(x , y) = φ(x)⊗ ey= [0, . . . , 0, φ(x)︸︷︷︸
y th block
, 0, . . . , 0]
• φ(x)• x=General George Washington → φ(x) = [1 1 0 1]
• ψ(x , y)• x=General George Washington, y=Person → ψ(x , y) = [1 1 0 1 0 0 0 0]• x=General George Washington, y=Object → ψ(x , y) = [0 0 0 0 1 1 0 1]
φ(x) is sometimes simpler and more convenient in binary classification
... but ψ(x , y) is more expressive (allows more complex features overproperties of labels)
Andre Martins (IST) Linear Classifiers LxMLS 2020 152 / 157
![Page 192: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/192.jpg)
Linear Classifiers – ψ(x , y)
• Parametrized by a weight vector w ∈ RD (one weight per feature)
• The score (or probability) of a particular label is based on a linearcombination of features and their weights
• At test time (known w), predict the class y which maximizes thisscore:
y = h(x) = arg maxy∈Y
w ·ψ(x , y)
• At training time, different strategies to learn w yield different linearclassifiers: perceptron, naıve Bayes, logistic regression, SVMs, ...
Andre Martins (IST) Linear Classifiers LxMLS 2020 153 / 157
![Page 193: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/193.jpg)
Linear Classifiers – φ(x)
• Define |Y| weight vectors wy ∈ RD
• i.e., one weight vector per output label y
• Classificationy = arg max
y∈Ywy · φ(x)
• ψ(x , y)• x=General George Washington, y=Person → ψ(x , y) = [1 1 0 1 0 0 0 0]• x=General George Washington, y=Object → ψ(x , y) = [0 0 0 0 1 1 0 1]• Single w ∈ R8
• φ(x)• x=General George Washington → φ(x) = [1 1 0 1]• Two parameter vectors w0 ∈ R4, w1 ∈ R4
Andre Martins (IST) Linear Classifiers LxMLS 2020 154 / 157
![Page 194: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/194.jpg)
Linear Classifiers – φ(x)
• Define |Y| weight vectors wy ∈ RD
• i.e., one weight vector per output label y
• Classificationy = arg max
y∈Ywy · φ(x)
• ψ(x , y)• x=General George Washington, y=Person → ψ(x , y) = [1 1 0 1 0 0 0 0]• x=General George Washington, y=Object → ψ(x , y) = [0 0 0 0 1 1 0 1]• Single w ∈ R8
• φ(x)• x=General George Washington → φ(x) = [1 1 0 1]• Two parameter vectors w0 ∈ R4, w1 ∈ R4
Andre Martins (IST) Linear Classifiers LxMLS 2020 154 / 157
![Page 195: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/195.jpg)
Conclusions
• Linear classifiers are a broad class including well-known ML methodssuch as perceptron, Naive Bayes, logistic regression, support vectormachines
• They all involve manipulating weights and features
• They either lead to closed-form solutions or convex optimizationproblems (no local minima)
• Stochastic gradient descent algorithms are useful if training datasetsare large
• However, they require manual specification of feature representations
• Tomorrow: methods that are able to learn internal representations
Andre Martins (IST) Linear Classifiers LxMLS 2020 155 / 157
![Page 196: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/196.jpg)
Thank You!
Post-Doc Openings for the ERC project DeepSPIN (Deep StructuredPrediction in NLP)
• 1 post-doc position available
• Topics: deep learning, structured prediction, NLP, machine translation
• Involving University of Lisbon and Unbabel
• More details: https://deep-spin.github.io
Andre Martins (IST) Linear Classifiers LxMLS 2020 156 / 157
![Page 197: Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7031da377db4699513f2c1/html5/thumbnails/197.jpg)
References I
Collins, M. and Duffy, N. (2002). Convolution kernels for natural language. Advances in Neural Information Processing Systems,1:625–632.
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. (2002). Text classification using string kernels.Journal of Machine Learning Research, 2:419–444.
Martins, A. F. T. and Astudillo, R. (2016). From Softmax to Sparsemax: A Sparse Model of Attention and Multi-LabelClassification. In Proc. of the International Conference on Machine Learning.
Minsky, M. and Papert, S. (1969). Perceptrons.
Novikoff, A. B. (1962). On convergence proofs for perceptrons. In Symposium on the Mathematical Theory of Automata.
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain.Psychological review, 65(6):386.
Andre Martins (IST) Linear Classifiers LxMLS 2020 157 / 157