distributed perceptron

25
Distributed Perceptron Introducing Distributed Training Strategies for the Structured Perceptron , published by R. McDonald, K. Hall & G. Mann in NAACL 2010 2010-10-06 / 2nd seminar for State-of-the-Art NLP

Upload: yusuke-matsubara

Post on 07-Jun-2015

1.481 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Distributed perceptron

Distributed Perceptron

Introducing Distributed Training Strategies for the Structured Perceptron,

published by R. McDonald, K. Hall & G. Mannin NAACL 2010

2010-10-06 / 2nd seminar for State-of-the-Art NLP

Page 2: Distributed perceptron

Distributed training of perceptronsin a theoretically-proven way Naive distribution strategy fails

Parameter mixing (or averaging) Simple modification

Iterative parameter mixing Proofs & Experiments

ConvergenceConvergence speedNER experimentsDependency parsing experiments

Page 3: Distributed perceptron

Timeline

1958 F. RosenblattPrinciples of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms

1962 H.D. Block and A.B. Novikoff (independently)the perceptron convergence theorem of for the separable case

1999 Y. Freund & R.E. Schapirevoted perceptron with a bound to the generalization error for the inseparable case

2002 M. CollinsGeneralization to the structured prediction problem

2010 R. McDonald et alparallelization with parameter mixing and synchronization

Page 4: Distributed perceptron

A new strategy of parallelization is required for distributed perceptrons

Gradient-based batch training algorithms have been parallelized in the forms of Map-Reduce Parameter mixing works for maximum entropy models

Divide the training data into a number of shardsTrain separate models with the shardsTake average of the weights of the models

Perceptrons? Non-convex objective functionSimple parameter mixing doesn't work

Page 5: Distributed perceptron

Parameter mixing (averaging) fails (1/6)

Parameter mixing: Train S perceptrons with S shards of the training data, Take a weighted average of their weights

Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010

Page 6: Distributed perceptron

Parameter mixing (averaging) fails (2/6)

Counter example Feature space (separated into observed and non-observed examples): f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] Shard 1: (x1,1, 0), (x1,2, 1)Shard 2: (x2,1, 0), (x2,2, 1)

Preview of the consequence:Mixing of two local optimaSmaller data can fool the algorithm, because of the increased initializations and tie-breakings.

Page 7: Distributed perceptron

Parameter mixing (averaging) fails (3/6)

Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] Shard 1: (x1,1, 0), (x1,2, 1)

w1 := [0 0 0 0 0 0] {initialization}w1·f(x1,1,0)t ≦ w1·f(x1,1,1)t

w1 := [1 1 0 0 0 0] - [0 0 0 1 1 0] = [1 1 0 -1 -1 0]w1·f(x1,2,0)t ≦ w1·f(x1,1,1)t {tie-breaking}

Page 8: Distributed perceptron

Parameter mixing (averaging) fails (4/6)

Counter example Feature space: f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] Shard 2: (x2,1, 0), (x2,2, 1)

w2 := [0 0 0 0 0 0] {initialization}w2·f(x2,1,0)t ≦ w1·f(x2,1,1)t

w2 := [0 1 1 0 0 0] - [0 0 0 0 1 1] = [0 1 1 0 -1 -1]w2·f(x2,2,0)t ≦ w2·f(x2,2,1)t {tie-breaking}

Page 9: Distributed perceptron

Parameter mixing (averaging) fails (5/6)

Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] Shard 1: (x1,1, 0), (x1,2, 1) ... w1=[1 1 0 -1 -1 0]Shard 2: (x2,1, 0), (x2,2, 1) ... w2=[0 1 1 0 -1 -1]

mixed weight: [μ1 1 μ2 -μ1 -1 -μ2]

Page 10: Distributed perceptron

Parameter mixing (averaging) fails (6/6)

Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] ... μ1+1, -μ1-1 f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] ... μ2, -μ2 f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] ... μ2+1, -μ2-1 f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] ... μ1, -μ1 Mixed weight [μ1 1 μ2 -μ1 -1 -μ2] doesn't separate positives and negatives:

LHS feature vectors always beat RHS vectorsw·f(*,0) ≦ w·f(*,1)

But there is a separating weight vector: [-1 2 -1 1 -2 1]

Page 11: Distributed perceptron

Iterative parameter mixing

Page 12: Distributed perceptron

Convergence theorem of iterative parameter mixing (1/4)

Assumptions u: separating weight vector γ: margin, γ ≦ u ·(f(xt,yt) - f(xt,y')) for all t and y'R: maxt,y' |f(xt,yt) - f(xt,y')|ki,n : the number of updates (errors) occur in the n th epoch of the i th OneEpochPerceptron

Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010

Page 13: Distributed perceptron

Convergence theorem of iterative parameter mixing (2/4)

Lowerbound of the number of the errors in a epoch

Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010

← from definition: γ ≦ u ·(f(xt,yt) - f(xt,

y'))

By induction on n, u·w(avg,N) ≧ ΣnΣi μi,nki,nγ

Page 14: Distributed perceptron

Convergence theorem of iterative parameter mixing (3/4)

Upperbound of the number of the errors in a epoch

Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010

← from definition:R ≧ |f(xt,yt) - f(xt,y')| y' = argmaxy w f(...)

By induction on n, |w(avg,N)|2 ≦ ΣnΣi μi,n ki,n R2

Page 15: Distributed perceptron

Convergence theorem of iterative parameter mixing (4/4)

Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010

|w(avg,N)|2 ≦ (ΣnΣi μi,n ki,n) R2

(ΣnΣi μi,n ki,n )2γ2 ≦ (ΣnΣi μi,n ki,n) R

2

(ΣnΣi μi,n ki,n )γ2 ≦  R2

(ΣnΣi μi,n ki,n ) ≦  R2/γ2

|w(avg,N)|2 ≧ (u·w(avg,N))2 ≧ (ΣnΣi μi,n ki,n γ)2 = (ΣnΣi μi,n ki,n )

2γ2

Page 16: Distributed perceptron

Convergence speed is predicted in two ways (1/2)

Theorem 3 impliesWhen we take uniform weights for mixing, the number of errors is proportional to the number of shards (in worst case when the equality holds)

implying that we cannot benefit from the parallelization very much

#(errors per epoch) can be multiplied by S the time required in an epoch would reduced to 1/S.

Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010

Page 17: Distributed perceptron

Convergence speed is predicted in two ways (2/2)Section 4.3

When we take error-proportional weighting for mixing, the number of epochs Ndist is bounded by

Worst case (when the equality holds)The same number of epochs as the vanilla perceptronEven in that case, each epoch is S times faster because of the parallelization

Ndist doesn't depend on the number of shardsimplying that we can well benefit from parallelization

↑error-proportional mixinggeometric mean ≦ arithmetic mean

Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010

Page 18: Distributed perceptron

Experiments

ComparisonSerial (All Data)Serial (Sub Sampling): use only one shardParallel (Parameter Mix)Parallel (Iterative Parameter Mix)

SettingsNumber of shards: 10(see the paper for more details)

Page 19: Distributed perceptron

NER experiments: faster & better, close to averaged perceptrons

Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010

Page 20: Distributed perceptron

NER experiments: faster & better, close to averaged perceptrons

Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010

Iterative mixing is faster and more accurate than serial. (non-averaged case)

Iterative mixing is faster and similarly accurate to serial. (averaged case)

Page 21: Distributed perceptron

Dependency parsing experiments: similar improvements

Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010

Page 22: Distributed perceptron

Different shard size: the more shards, the slower convergence

Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010

Page 23: Distributed perceptron

Different shard size: the more shards, the slower convergence

Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010

High parallelism leads to slower convergence (in a rate somewhere middle in the two predictions)

Page 24: Distributed perceptron

Conclusions

Distributed training of the structured perceptron via simple parameter mixing strategies

Guaranteed to converge and separate the data (if separable)Results in fast and accurate classifiers

Trade-off between high parallelism and slow convergence

(+ applicable to online passive-aggressive algorithm)

Page 25: Distributed perceptron

Presenter's comments

Parameter synchronization can be slow, especially when the feature space or the number of epochs is largeAnalysis of the generalization error (for inseparable case)? Relation to voted perceptron?

Voted perceptron: weighting with survival timeDistributed perceptron: weighting with the number of updates

Relation to Bayes point machines?