machine learning for nlp - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · i \maximum...

$: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",$
Machine Learning for NLP

Perceptron Learning

Joakim Nivre

Uppsala UniversityDepartment of Linguistics and Philology

Slides adapted from Ryan McDonald, Google Research

Machine Learning for NLP 1(25)

Introduction

Linear Classifiers

I Classifiers covered so far:I Decision treesI Nearest neighbor

I Next three lectures: linear classifiersI Statistics from Google Scholar (October 2009):

I “Maximum Entropy” & “NLP” 2660 hits, 141 before 2000I “SVM” & “NLP” 2210 hits, 16 before 2000I “Perceptron” & “NLP”, 947 hits, 118 before 2000

I All are linear classifiers and basic tools in NLP

I They are also the bridge to deep learning techniques


Introduction

Outline

I Today:I Preliminaries: input/output, features, etc.I PerceptronI Assignment 2

I Next time:I Large-margin classifiers (SVMs, MIRA)I Logistic regression (Maximum Entropy)

I Final session:I Naive Bayes classifiersI Generative and discriminative models


Preliminaries

Inputs and Outputs

I Input: x ∈ XI e.g., document or sentence with some words x = w1 . . .wn, or

a series of previous actions

I Output: y ∈ YI e.g., parse tree, document class, part-of-speech tags,

word-sense

I Input/output pair: (x,y) ∈ X × YI e.g., a document x and its label yI Sometimes x is explicit in y, e.g., a parse tree y will contain

the sentence x


Preliminaries

Feature Representations

I We assume a mapping from input-output pairs (x,y) to ahigh dimensional feature vector

I f(x,y) : X × Y → Rm

I For some cases, i.e., binary classification Y = {−1,+1}, wecan map only from the input to the feature space

I f(x) : X → Rm

I However, most problems in NLP require more than twoclasses, so we focus on the multi-class case

I For any vector v ∈ Rm, let vj be the j th value


Preliminaries

Features and Classes

I All features must be numericalI Numerical features are represented directly as fi (x,y) ∈ RI Binary (boolean) features are represented as fi (x,y) ∈ {0, 1}

I Multinomial (categorical) features must be binarizedI Instead of: fi (x,y) ∈ {v0, . . . , vp}I We have: fi+0(x,y) ∈ {0, 1}, . . . , fi+p(x,y) ∈ {0, 1}I Such that: fi+j(x,y) = 1 iff fi (x,y) = vj

I We need distinct features for distinct output classesI Instead of: fi (x) (1 ≤ i ≤ m)I We have: fi+0m(x,y), . . . , fi+Nm(x,y) for Y = {0, . . . ,N}I Such that: fi+jm(x,y) = fi (x) iff y = yj


Preliminaries

Examples

I x is a document and y is a label

fj(x,y) =

1 if x contains the word “interest”

and y =“financial”0 otherwise

fj(x,y) = % of words in x with punctuation and y =“scientific”

I x is a word and y is a part-of-speech tag

fj(x,y) =

{1 if x = “bank” and y = Verb0 otherwise


Preliminaries

Examples

I x is a name, y is a label classifying the name

f0(x,y) =

1 if x contains “George”and y = “Person”

0 otherwise

f1(x,y) =

1 if x contains “Washington”and y = “Person”

0 otherwise

f2(x,y) =

1 if x contains “Bridge”and y = “Person”

0 otherwise

f3(x,y) =

1 if x contains “General”and y = “Person”

0 otherwise

f4(x,y) =

1 if x contains “George”and y = “Object”

0 otherwise

f5(x,y) =

1 if x contains “Washington”and y = “Object”

0 otherwise

f6(x,y) =

1 if x contains “Bridge”and y = “Object”

0 otherwise

f7(x,y) =

1 if x contains “General”and y = “Object”

0 otherwise

I x=General George Washington, y=Person → f(x,y) = [1 1 0 1 0 0 0 0]

I x=George Washington Bridge, y=Object → f(x,y) = [0 0 0 0 1 1 1 0]

I x=George Washington George, y=Object → f(x,y) = [0 0 0 0 1 1 0 0]


Preliminaries

Block Feature Vectors

I x=General George Washington, y=Person → f(x,y) = [1 1 0 1 0 0 0 0]

I x=George Washington Bridge, y=Object → f(x,y) = [0 0 0 0 1 1 1 0]

I x=George Washington George, y=Object → f(x,y) = [0 0 0 0 1 1 0 0]

I One equal-size block of the feature vector for each label

I Input features duplicated in each block

I Non-zero values allowed only in one block


Linear Classifiers

Linear Classifiers

I Linear classifier: score (or probability) of a particularclassification is based on a linear combination of features andtheir weights

I Let w ∈ Rm be a high dimensional weight vectorI If we assume that w is known, then we define our classifier as

I Multiclass Classification: Y = {0, 1, . . . ,N}

y = argmaxy

w · f(x,y)

= argmaxy

m∑j=0

wj × fj(x,y)

I Binary Classification just a special case of multiclass


Linear Classifiers

Linear Classifiers – Bias Terms

I Often linear classifiers presented as

y = argmaxy

m∑j=0

wj × fj(x,y) + by

I Where b is a bias or offset term

I But this can be folded into f

x=General George Washington, y=Person → f(x,y) = [1 1 0 1 1 0 0 0 0 0]

x=General George Washington, y=Object → f(x,y) = [0 0 0 0 0 1 1 0 1 1]

f4(x,y) =

{1 y =“Person”0 otherwise

f9(x,y) =

{1 y =“Object”0 otherwise

I w4 and w9 are now the bias terms for the labels


Linear Classifiers

Binary Linear Classifier

Divides all points:


Linear Classifiers

Multiclass Linear ClassifierDefines regions of space:

I i.e., + are all points (x,y) where + = argmaxy w · f(x,y)


Linear Classifiers

Separability

I A set of points is separable, if there exists a w such thatclassification is perfect

Separable Not Separable

I This can also be defined mathematically (and we will shortly)


Linear Classifiers

Supervised Learning – how to find w

I Input: training examples T = {(xt ,yt)}|T |t=1

I Input: feature representation fI Output: w that maximizes/minimizes some important

function on the training setI minimize error (Perceptron, SVMs, Boosting)I maximize likelihood of data (Logistic Regression, Naive Bayes)

I Assumption: The training data is separableI Not necessary, just makes life easierI There is a lot of good work in machine learning to tackle the

non-separable case


Linear Classifiers

Perceptron

I Choose a w that minimizes error

w = argminw

∑t

1− 1[yt = argmaxy

w · f(xt ,y)]

1[p] =

{1 p is true0 otherwise

I This is a 0-1 loss functionI Aside: when minimizing error people tend to use hinge-loss or

other smoother loss functions


Linear Classifiers

Perceptron Learning Algorithm

Training data: T = {(xt ,yt)}|T |t=1

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T

4. Let y′ = argmaxy w(i) · f(xt ,y)5. if y′ 6= yt

6. w(i+1) = w(i) + f(xt ,yt)− f(xt ,y′)

7. i = i + 18. return wi


Linear Classifiers

Perceptron: Separability and Margin

I Given an training instance (xt ,yt), define:I Yt = Y − {yt}I i.e., Yt is the set of incorrect labels for xt

I A training set T is separable with margin γ > 0 if there existsa vector u with ‖u‖ = 1 such that:

u · f(xt ,yt)− u · f(xt ,y′) ≥ γ

for all y′ ∈ Yt and ||u|| =√∑

j u2j (Euclidean or L2 norm)

I Assumption: the training set is separable with margin γ


Linear Classifiers

Perceptron: Main Theorem

I Theorem: For any training set separable with a margin of γ,the following holds for the perceptron algorithm:

mistakes made during training ≤ R2

γ2

where R ≥ ||f(xt ,yt)− f(xt ,y′)|| for (xt ,yt) ∈ T and y′ ∈ Yt

I Thus, after a finite number of training iterations, the error onthe training set will converge to zero

I For proof, see Collins (2002)


Linear Classifiers

Practical Considerations

I The perceptron is sensitive to the order of training examplesI Consider: 500 positive instances + 500 negative instances

I Shuffling:I Randomly permute training instances between iterations

I Voting and averaging:I Let w1, . . .wn be all the weight vectors seen in trainingI The voted perceptron predicts the majority vote of w1, . . .wn

I The averaged perceptron predicts using the average vector:

w =1

n

∑i

w1, . . .wn


Linear Classifiers

Perceptron Summary

I Learns a linear classifier that minimizes errorI Guaranteed to find a w in a finite amount of timeI Improvement 1: shuffle training data between iterationsI Improvement 2: average weight vectors seen during training

I Perceptron is an example of an online learning algorithmI w is updated based on a single training instance in isolation

w(i+1) = w(i) + f(xt ,yt)− f(xt ,y′)

I Compare decision trees that perform batch learningI All training instances are used to find best split


Linear Classifiers

Assignment 2

I Implement the perceptron (starter code in Python)I Three subtasks:

I Implement the linear classifier (dot product)I Implement the perceptron updateI Evaluate on the spambase data set

I For VGI Implement the averaged perceptron


Appendix

Proofs and Derivations


Convergence Proof for Perceptron

Perceptron Learning AlgorithmTraining data: T = {(xt ,yt )}|T |

t=1

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T

4. Let y′ = argmaxy w(i) · f(xt ,y)

5. if y′ 6= yt

6. w(i+1) = w(i) + f(xt ,yt )− f(xt ,y′)

7. i = i + 1

8. return wi

I w (k−1) are the weights before kth

mistake

I Suppose kth mistake made at thetth example, (xt ,yt)

I y′ = argmaxy w(k−1) · f(xt ,y)

I y′ 6= yt

I w (k) = w(k−1) + f(xt ,yt)− f(xt ,y′)

I Now: u · w(k) = u · w(k−1) + u · (f(xt ,yt)− f(xt ,y′)) ≥ u · w(k−1) + γI Now: w(0) = 0 and u · w(0) = 0, by induction on k, u · w(k) ≥ kγI Now: since u · w(k) ≤ ||u|| × ||w(k)|| and ||u|| = 1 then ||w(k)|| ≥ kγI Now:

||w(k)||2 = ||w(k−1)||2 + ||f(xt ,yt)− f(xt ,y′)||2 + 2w(k−1) · (f(xt ,yt)− f(xt ,y

′))

||w(k)||2 ≤ ||w(k−1)||2 + R2

(since R ≥ ||f(xt ,yt)− f(xt ,y′)||

and w(k−1) · f(xt ,yt)− w(k−1) · f(xt ,y′) ≤ 0)


Convergence Proof for Perceptron

Perceptron Learning Algorithm

I We have just shown that ||w(k)|| ≥ kγ and||w(k)||2 ≤ ||w(k−1)||2 + R2

I By induction on k and since w(0) = 0 and ||w(0)||2 = 0

||w(k)||2 ≤ kR2

I Therefore,k2γ2 ≤ ||w(k)||2 ≤ kR2

I and solving for k

k ≤ R2

γ2

I Therefore the number of errors is bounded!


machine learning for nlp - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · i \maximum...

Documents