machine learning for nlp - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · i \maximum...
TRANSCRIPT
Machine Learning for NLP
Perceptron Learning
Joakim Nivre
Uppsala UniversityDepartment of Linguistics and Philology
Slides adapted from Ryan McDonald, Google Research
Machine Learning for NLP 1(25)
Introduction
Linear Classifiers
I Classifiers covered so far:I Decision treesI Nearest neighbor
I Next three lectures: linear classifiersI Statistics from Google Scholar (October 2009):
I “Maximum Entropy” & “NLP” 2660 hits, 141 before 2000I “SVM” & “NLP” 2210 hits, 16 before 2000I “Perceptron” & “NLP”, 947 hits, 118 before 2000
I All are linear classifiers and basic tools in NLP
I They are also the bridge to deep learning techniques
Machine Learning for NLP 2(25)
Introduction
Outline
I Today:I Preliminaries: input/output, features, etc.I PerceptronI Assignment 2
I Next time:I Large-margin classifiers (SVMs, MIRA)I Logistic regression (Maximum Entropy)
I Final session:I Naive Bayes classifiersI Generative and discriminative models
Machine Learning for NLP 3(25)
Preliminaries
Inputs and Outputs
I Input: x ∈ XI e.g., document or sentence with some words x = w1 . . .wn, or
a series of previous actions
I Output: y ∈ YI e.g., parse tree, document class, part-of-speech tags,
word-sense
I Input/output pair: (x,y) ∈ X × YI e.g., a document x and its label yI Sometimes x is explicit in y, e.g., a parse tree y will contain
the sentence x
Machine Learning for NLP 4(25)
Preliminaries
Feature Representations
I We assume a mapping from input-output pairs (x,y) to ahigh dimensional feature vector
I f(x,y) : X × Y → Rm
I For some cases, i.e., binary classification Y = {−1,+1}, wecan map only from the input to the feature space
I f(x) : X → Rm
I However, most problems in NLP require more than twoclasses, so we focus on the multi-class case
I For any vector v ∈ Rm, let vj be the j th value
Machine Learning for NLP 5(25)
Preliminaries
Features and Classes
I All features must be numericalI Numerical features are represented directly as fi (x,y) ∈ RI Binary (boolean) features are represented as fi (x,y) ∈ {0, 1}
I Multinomial (categorical) features must be binarizedI Instead of: fi (x,y) ∈ {v0, . . . , vp}I We have: fi+0(x,y) ∈ {0, 1}, . . . , fi+p(x,y) ∈ {0, 1}I Such that: fi+j(x,y) = 1 iff fi (x,y) = vj
I We need distinct features for distinct output classesI Instead of: fi (x) (1 ≤ i ≤ m)I We have: fi+0m(x,y), . . . , fi+Nm(x,y) for Y = {0, . . . ,N}I Such that: fi+jm(x,y) = fi (x) iff y = yj
Machine Learning for NLP 6(25)
Preliminaries
Examples
I x is a document and y is a label
fj(x,y) =
1 if x contains the word “interest”
and y =“financial”0 otherwise
fj(x,y) = % of words in x with punctuation and y =“scientific”
I x is a word and y is a part-of-speech tag
fj(x,y) =
{1 if x = “bank” and y = Verb0 otherwise
Machine Learning for NLP 7(25)
Preliminaries
Examples
I x is a name, y is a label classifying the name
f0(x,y) =
1 if x contains “George”and y = “Person”
0 otherwise
f1(x,y) =
1 if x contains “Washington”and y = “Person”
0 otherwise
f2(x,y) =
1 if x contains “Bridge”and y = “Person”
0 otherwise
f3(x,y) =
1 if x contains “General”and y = “Person”
0 otherwise
f4(x,y) =
1 if x contains “George”and y = “Object”
0 otherwise
f5(x,y) =
1 if x contains “Washington”and y = “Object”
0 otherwise
f6(x,y) =
1 if x contains “Bridge”and y = “Object”
0 otherwise
f7(x,y) =
1 if x contains “General”and y = “Object”
0 otherwise
I x=General George Washington, y=Person → f(x,y) = [1 1 0 1 0 0 0 0]
I x=George Washington Bridge, y=Object → f(x,y) = [0 0 0 0 1 1 1 0]
I x=George Washington George, y=Object → f(x,y) = [0 0 0 0 1 1 0 0]
Machine Learning for NLP 8(25)
Preliminaries
Block Feature Vectors
I x=General George Washington, y=Person → f(x,y) = [1 1 0 1 0 0 0 0]
I x=George Washington Bridge, y=Object → f(x,y) = [0 0 0 0 1 1 1 0]
I x=George Washington George, y=Object → f(x,y) = [0 0 0 0 1 1 0 0]
I One equal-size block of the feature vector for each label
I Input features duplicated in each block
I Non-zero values allowed only in one block
Machine Learning for NLP 9(25)
Linear Classifiers
Linear Classifiers
I Linear classifier: score (or probability) of a particularclassification is based on a linear combination of features andtheir weights
I Let w ∈ Rm be a high dimensional weight vectorI If we assume that w is known, then we define our classifier as
I Multiclass Classification: Y = {0, 1, . . . ,N}
y = argmaxy
w · f(x,y)
= argmaxy
m∑j=0
wj × fj(x,y)
I Binary Classification just a special case of multiclass
Machine Learning for NLP 10(25)
Linear Classifiers
Linear Classifiers – Bias Terms
I Often linear classifiers presented as
y = argmaxy
m∑j=0
wj × fj(x,y) + by
I Where b is a bias or offset term
I But this can be folded into f
x=General George Washington, y=Person → f(x,y) = [1 1 0 1 1 0 0 0 0 0]
x=General George Washington, y=Object → f(x,y) = [0 0 0 0 0 1 1 0 1 1]
f4(x,y) =
{1 y =“Person”0 otherwise
f9(x,y) =
{1 y =“Object”0 otherwise
I w4 and w9 are now the bias terms for the labels
Machine Learning for NLP 11(25)
Linear Classifiers
Binary Linear Classifier
Divides all points:
Machine Learning for NLP 12(25)
Linear Classifiers
Multiclass Linear ClassifierDefines regions of space:
I i.e., + are all points (x,y) where + = argmaxy w · f(x,y)
Machine Learning for NLP 13(25)
Linear Classifiers
Separability
I A set of points is separable, if there exists a w such thatclassification is perfect
Separable Not Separable
I This can also be defined mathematically (and we will shortly)
Machine Learning for NLP 14(25)
Linear Classifiers
Supervised Learning – how to find w
I Input: training examples T = {(xt ,yt)}|T |t=1
I Input: feature representation fI Output: w that maximizes/minimizes some important
function on the training setI minimize error (Perceptron, SVMs, Boosting)I maximize likelihood of data (Logistic Regression, Naive Bayes)
I Assumption: The training data is separableI Not necessary, just makes life easierI There is a lot of good work in machine learning to tackle the
non-separable case
Machine Learning for NLP 15(25)
Linear Classifiers
Perceptron
I Choose a w that minimizes error
w = argminw
∑t
1− 1[yt = argmaxy
w · f(xt ,y)]
1[p] =
{1 p is true0 otherwise
I This is a 0-1 loss functionI Aside: when minimizing error people tend to use hinge-loss or
other smoother loss functions
Machine Learning for NLP 16(25)
Linear Classifiers
Perceptron Learning Algorithm
Training data: T = {(xt ,yt)}|T |t=1
1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T
4. Let y′ = argmaxy w(i) · f(xt ,y)5. if y′ 6= yt
6. w(i+1) = w(i) + f(xt ,yt)− f(xt ,y′)
7. i = i + 18. return wi
Machine Learning for NLP 17(25)
Linear Classifiers
Perceptron: Separability and Margin
I Given an training instance (xt ,yt), define:I Yt = Y − {yt}I i.e., Yt is the set of incorrect labels for xt
I A training set T is separable with margin γ > 0 if there existsa vector u with ‖u‖ = 1 such that:
u · f(xt ,yt)− u · f(xt ,y′) ≥ γ
for all y′ ∈ Yt and ||u|| =√∑
j u2j (Euclidean or L2 norm)
I Assumption: the training set is separable with margin γ
Machine Learning for NLP 18(25)
Linear Classifiers
Perceptron: Main Theorem
I Theorem: For any training set separable with a margin of γ,the following holds for the perceptron algorithm:
mistakes made during training ≤ R2
γ2
where R ≥ ||f(xt ,yt)− f(xt ,y′)|| for (xt ,yt) ∈ T and y′ ∈ Yt
I Thus, after a finite number of training iterations, the error onthe training set will converge to zero
I For proof, see Collins (2002)
Machine Learning for NLP 19(25)
Linear Classifiers
Practical Considerations
I The perceptron is sensitive to the order of training examplesI Consider: 500 positive instances + 500 negative instances
I Shuffling:I Randomly permute training instances between iterations
I Voting and averaging:I Let w1, . . .wn be all the weight vectors seen in trainingI The voted perceptron predicts the majority vote of w1, . . .wn
I The averaged perceptron predicts using the average vector:
w =1
n
∑i
w1, . . .wn
Machine Learning for NLP 20(25)
Linear Classifiers
Perceptron Summary
I Learns a linear classifier that minimizes errorI Guaranteed to find a w in a finite amount of timeI Improvement 1: shuffle training data between iterationsI Improvement 2: average weight vectors seen during training
I Perceptron is an example of an online learning algorithmI w is updated based on a single training instance in isolation
w(i+1) = w(i) + f(xt ,yt)− f(xt ,y′)
I Compare decision trees that perform batch learningI All training instances are used to find best split
Machine Learning for NLP 21(25)
Linear Classifiers
Assignment 2
I Implement the perceptron (starter code in Python)I Three subtasks:
I Implement the linear classifier (dot product)I Implement the perceptron updateI Evaluate on the spambase data set
I For VGI Implement the averaged perceptron
Machine Learning for NLP 22(25)
Appendix
Proofs and Derivations
Machine Learning for NLP 23(25)
Convergence Proof for Perceptron
Perceptron Learning AlgorithmTraining data: T = {(xt ,yt )}|T |
t=1
1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T
4. Let y′ = argmaxy w(i) · f(xt ,y)
5. if y′ 6= yt
6. w(i+1) = w(i) + f(xt ,yt )− f(xt ,y′)
7. i = i + 1
8. return wi
I w (k−1) are the weights before kth
mistake
I Suppose kth mistake made at thetth example, (xt ,yt)
I y′ = argmaxy w(k−1) · f(xt ,y)
I y′ 6= yt
I w (k) = w(k−1) + f(xt ,yt)− f(xt ,y′)
I Now: u · w(k) = u · w(k−1) + u · (f(xt ,yt)− f(xt ,y′)) ≥ u · w(k−1) + γI Now: w(0) = 0 and u · w(0) = 0, by induction on k, u · w(k) ≥ kγI Now: since u · w(k) ≤ ||u|| × ||w(k)|| and ||u|| = 1 then ||w(k)|| ≥ kγI Now:
||w(k)||2 = ||w(k−1)||2 + ||f(xt ,yt)− f(xt ,y′)||2 + 2w(k−1) · (f(xt ,yt)− f(xt ,y
′))
||w(k)||2 ≤ ||w(k−1)||2 + R2
(since R ≥ ||f(xt ,yt)− f(xt ,y′)||
and w(k−1) · f(xt ,yt)− w(k−1) · f(xt ,y′) ≤ 0)
Machine Learning for NLP 24(25)
Convergence Proof for Perceptron
Perceptron Learning Algorithm
I We have just shown that ||w(k)|| ≥ kγ and||w(k)||2 ≤ ||w(k−1)||2 + R2
I By induction on k and since w(0) = 0 and ||w(0)||2 = 0
||w(k)||2 ≤ kR2
I Therefore,k2γ2 ≤ ||w(k)||2 ≤ kR2
I and solving for k
k ≤ R2
γ2
I Therefore the number of errors is bounded!
Machine Learning for NLP 25(25)