dr. perceptronwcohen/10-605/perceptrons-1.pdf• eg, for nb: use one small multinomial for pos...

1http://futurama.wikia.com/wiki/Dr._Perceptron

Quick review of Tuesday• Learningasoptimization• Optimizingconditionallog-likelihoodPr(y|x)withlogisticregression

• Stochasticgradientdescentforlogisticregression– Streammultipletimes(epochs)thrudata– Keepmodelinmemory

• L2-regularization• Sparse/lazyL2regularization• The“hashtrick”:allowfeaturecollisions,usearrayindexedbyhashcodeinsteadofhashtableforparameters.

2

Quick look ahead• Experimentswithahash-trickimplementationoflogisticregression

• Nextquestion:–howdoyouparallelizeSGD,ormoregenerally,thiskindofstreamingalgorithm?

– eachexampleaffectsthenextpredictionèordermattersè parallelizationchangesthebehavior

–wewillstepbacktoperceptrons andthenstepforwardtoparallelperceptrons

3

Debugging Machine Learning Algorithms

WilliamCohen

4

Debugging for non-ML systems

• “Ifitcompiles,shipit.”

5

Debugging for ML systems

1. It’sdefinitelyexactly thealgorithmyoureadaboutinthatpaper

2. Italsocompiles3. Itgets87% accuracyontheauthor’sdataset

– buthegot91%– soit’snotworking?– or,youreval iswrong?– or,his eval iswrong?

6


1. It’sdefinitelyexactly thealgorithmyoureadaboutinthatpaper

2. Italsocompiles3. Itgets97%accuracyontheauthor’sdataset

– buthegot91%– soyouhaveabestpaperaward!– or,maybeabug…

7


• It’salwayshardtodebugsoftware• It’sespecially hardforML

– awiderangeofalmost-correctmodesforaprogramtobein

8

Debugging advice1. Writetests2. Forsubtleproblems,writetests3. Ifyou’restillnotsurewhyit’snotworking,write

tests4. Ifyougetreallystuck:

– takeawalkandcomebacktoitinahour– askafriend

• Ifs/he’salsoin10-605s/hecanstillhelpaslongasnonotesaretaken(myrules)

– takeabreakandwritesometests

10

Debugging ML systemsWritetests

– Foragenerativelearner,writeageneratorandgenerate training/testdatafromtheassumeddistribution

• Eg,forNB:useonesmallmultinomialforposexamples,anotheroneforneg examples,andaweightedcoinfortheclasspriors.

– Thelearnershould(usually)recovertheactualparametersofthegenerator

• givenenoughdata,moduloconvexity,…– Testitontheweirdcases(eg,uniformclass

priors,highlyskewedmultinomials)

11

Debugging ML systemsWritetests

– Foradiscriminativelearner,similartrick…– Also,usewhatyouknow:eg,forSGD

• doestakingonegradientstep(onasampletask)lowerthelossonthetrainingdata?

• doesitlowerthelossasexpected?– (f(x)-f(x+d))/dshouldapproximatef ’(x)

• doesregularizationworkasexpected?– largemuè smallerparam values

• recordtrainingset/testsetloss– withandwithoutregularization

12

Debugging ML systems

Comparetoa“baseline”mathematicallycleanmethodvs scalable,efficientmethod• lazy/sparsevs naïveregularizer• hashedfeaturevaluesvs hashtable featurevalues

• …

13

ON-LINE ANALYSIS AND REGRET

14

On-line learning/regret analysis• Optimization

– isagreatmodelofwhatyouwant todo– alessgoodmodelofwhatyouhavetime todo

• Example:– HowmuchtowelosewhenwereplacegradientdescentwithSGD?

– whatifwecanonlyapproximatethelocalgradient?– whatifthedistributionchangesovertime?– …

• Onepowerfulanalyticapproach:online-learningakaregretanalysis(~akaon-lineoptimization)

On-line learning

Binstance xi Compute: yi = sign(v . xi )^

+1,-1: label yiGet yi and make update to vTrain Data

To detect interactions:• increase/decrease vk only if we need to (for that example)• otherwise, leave it unchanged

• We can be sensitive to duplication by stopping updates when we get better performance

On-line learning

Binstance xi Compute: yi = sign(v . xi )^

+1,-1: label yiIf mistake: vk+1 = vk + correctionTrain Data

To detect interactions:• increase/decrease vk only if we need to (for that example)• otherwise, leave it unchanged

• We can be sensitive to duplication by stopping updates when we get better performance

Theory: the prediction game

• PlayerA:– picksa“targetconcept”c

• fornow- fromafinitesetofpossibilitiesC(e.g.,alldecisiontreesofsizem)

– fort=1,….,• PlayerApicksx=(x1,…,xn)andsendsittoB

– Fornow,fromafinitesetofpossibilities(e.g.,allbinaryvectorsoflengthn)

• Bpredictsalabel,ŷ,andsendsittoA• AsendsBthetruelabely=c(x)• werecordifBmadeamistakeornot

– Wecareabouttheworstcase numberofmistakesBwillmakeoverallpossible concept&trainingsequencesofanylength

• The“Mistakebound”forB,MB(C),isthisbound

Perceptrons

The prediction game

• Aretherepracticalalgorithmswherewecancomputethemistakebound?

The voted perceptron

A Binstance xi Compute: yi =sign( vk . xi )^

yi^

yi

If mistake: vk+1 = vk + yi xi

2

The voted perceptron

A Binstance xi Compute: p = sign(vk . xi )

yi^

yi


Aside: this is related to the SGD update:

y=p:noupdatey=0,p=1:-xy=1,p=0:+x

y=-1,p=+1:-xy=+1,p=-1:+x

u

-u

2γ

u

-u

2γ

+x1v1

(1) A target u (2) The guess v1 after one positive example.

u

-u

2γ

u

-u

2γ

v1

+x2

v2

+x1v1

-x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

(3b) The guess v2 after the one positive and one negative example: v2=v1-x2


u

-u

2γ

u

-u

2γ

v1

+x2

v2

+x1v1

-x2

v2



>γ

u

-u

2γ

u

-u

2γ

v1

+x2

v2

+x1v1

-x2

v2



2

2

2

2

2

2

÷÷ø

öççè

æ=

gR

u

-u

2γ

Whatiftheseparatinglinedoesn’tgothrutheorigin?

Replacex=(x1,….,xn)with(x0,….,xn)wherex0 =1foreveryexamplex.

Then𝑦 = 𝑠𝑖𝑔𝑛 ∑ 𝑥t𝑤t�t becomes

𝑠𝑖𝑔𝑛 𝑥w𝑤w +∑ 𝑥t𝑤t�txy whichis

𝑠𝑖𝑔𝑛 𝑤w +z𝑥t𝑤t�

txy

One Weird Trick for Making Perceptrons More Expressive

Summary• We have shown that

– If : exists a u with unit norm that has margin γ on examples in the seq (x1,y1),(x2,y2),….

– Then : the perceptron algorithm makes < R2/ γ2 mistakes on the sequence (where R >= ||xi||)

– Independent of dimension of the data or classifier (!)– This doesn’t follow from M(C)<=VCDim(C)

• We don’t know if this algorithm could be better– There are many variants that rely on similar analysis (ROMMA,

Passive-Aggressive, MIRA, …)• We don’t know what happens if the data’s not separable

– Unless I explain the “Δ trick” to you• We don’t know what classifier to use “after” training

The idea of the “delta trick”

29

+ + + +- - --+

A noisy example makes

the data inseparable

The idea of the “delta trick”

30

+ + + +- - --

+

So let’s add a new dimension

and give the noisy example an offset in the dimension of △

I don’t know which examples are noisy….but I have lots of dimensions that I could add to the data….

△

Now it’s separable!

The Δ Trick• The proof assumes the data is separable by a

wide margin• We can make that true by adding an “id” feature

to each example– sort of like we added a constant feature

x1 = (x11, x2

1,..., xm1 )→ (x1

1, x21,..., xm

1 , Δ, 0,...., 0)x2 = (x1

2, x22,..., xm

2 )→ (x12, x2

2,..., xm2 , 0,Δ,...., 0)

...xn = (x1

n, x2n,..., xm

n )→ (x1n, x2

n,..., xmn , 0, 0,...,Δ)

n new features

The Δ Trick• The proof assumes the data is separable by a

wide margin• We can make that true by adding an “id” feature

to each example– sort of like we added a constant feature

n new features

doc17: i, found, aardvark, today à i, found, aardvark, today, doc17doc37: aardvarks, are, dangerous à aardvarks, are, dangerous, doc37….

The Δ Trick• Replace xi with x’i so X becomes [X | I Δ]• Replace R2 in our bounds with R2 + Δ2

• Let di = max(0, γ - yi xi u) • Let u’ = (u1,…,un, y1d1/Δ, … ymdm/Δ) * 1/Z

– So Z=sqrt(1 + D2/ Δ2), for D=sqrt(d12+…+dm

2)– Now [X|IΔ] is separable by u’ with margin γ

• Mistake bound is (R2 + Δ2)Z2 / γ2

• Let Δ = sqrt(RD) è k <= ((R + D)/ γ)2

• Conclusion: a little noise is ok

Summary• We have shown that

– If : exists a u with unit norm that has margin γ on examples in the seq (x1,y1),(x2,y2),….

– Then : the perceptron algorithm makes < R2/ γ2 mistakes on the sequence (where R >= ||xi||)

– Independent of dimension of the data or classifier (!)• We don’t know what happens if the data’s not

separable– Unless I explain the “Δ trick” to you

• We don’t know what classifier to use “after”training

The averaged perceptron

On-line to batch learning

1. Pick a vk at random according to mk/m, the fraction of examples it was used for.

2. Predict using the vkyou just picked.

3. (Actually, use some sort of deterministic approximation to this).

1. Pick a vk at random according to mk/m, the fraction of examples it was used for.

2. Predict using the vkyou just picked.

3. (Actually, use some sort of deterministic approximation to this).

predict using sign(v*. x)

SPARSIFYINGTHE AVERAGED PERCEPTRON UPDATE

39

Complexity of perceptron learning

• Algorithm:• v=0• foreachexamplex,y:

– ifsign(v.x) !=y• v =v +yx

• inithashtable

• forxi!=0,vi +=yxi

O(n)

O(|x|)=O(|d|)

Final hypothesis (last): v

Complexity of averaged perceptron

• Algorithm:• vk=0• va =0• foreachexamplex,y:

– ifsign(vk.x) !=y• va =va +mk*vk• vk =vk +yx• m=m+1• mk =1

– else• mk++

• inithashtables

• forvki!=0,vai +=vki• forxi!=0,vi +=yxi

O(n) O(n|V|)

O(|x|)=O(|d|)

O(|V|)

Final hypothesis (avg): va/m

Complexity of perceptron learning

• Algorithm:• v=0• foreachexamplex,y:

– ifsign(v.x) !=y• v =v +yx

• inithashtable

• forxi!=0,vi +=yxi

O(n)

O(|x|)=O(|d|)

Complexity of averaged perceptron

• Algorithm:• vk=0• va =0• foreachexamplex,y:

– ifsign(vk.x) !=y• va =va +vk• vk =vk +yx• mk =1

– else• nk++

• inithashtables

• forvki!=0,vai +=vki• forxi!=0,vi +=yxi

O(n) O(n|V|)

O(|x|)=O(|d|)

O(|V|)

Alternative averaged perceptron

• Algorithm:• vk =0• va =0• foreachexamplex,y:

– va =va +vk– m=m+1– ifsign(vk.x)!=y

• vk =vk +y*x• Returnva/m

vk = yjx jj∈Sk∑

Sk is the set of examples including the first k mistakes

Observe:




• vk =vk +y*x• Returnva/m

yjx jj∈Sk∑

So when there’s a mistake at time t on x,y:

y*x is added to va on every subsequent iteration

Suppose you know T, the total number of examples in the stream…




• vk =vk +y*x• va =va +(T-m)*y*x

• Returnva/T

yjx jj∈Sk∑

T = the total number of examples in the stream…(all epochs)

All subsequent additions of x to va

Unpublished? I figured this out recently, Leon Bottou knows it too

KERNELS AND PERCEPTRONS

The kernel perceptron

A Binstance xi

Compute: yi = vk . xi^

yi^

yi


-

-

+

+

ååÎÎ

-= kFP

ikFN

i

kk

y xxxxxx

..ˆ :Compute

FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If

i

i

xx

Mathematically the same as before … but allows use of the kernel trick

48

The kernel perceptron

A Binstance xi

Compute: yi = vk . xi^

yi^

yi


-

-

+

+

ååÎÎ

-= kFP

ikFN

i

kk

y xxxxxx

..ˆ :Compute

FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If

i

i

xx

Mathematically the same as before … but allows use of the “kernel trick”

),(),(ˆ -

-

+

+

ååÎÎ

-= kFP

ikFN

i

kk

KKy xxxxxx

kkK xxxx ×º),(

Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values.

49

Some common kernels• Linear kernel:

• Polynomial kernel:

• Gaussian kernel:

• More later….

')',( xxxx ×ºK

dK )1'()',( +×º xxxx

s/'|||| 2

)',( xxxx --º eK

50

Kernels 101• Duality

– and computational properties– Reproducing Kernel Hilbert Space (RKHS)

• Gram matrix• Positive semi-definite• Closure properties

51

Kernels 101

• Duality: two ways to look at this

),(),(ˆ -

-

+

+

ååÎÎ

-= kFP

ikFN

i

kk

KKy xxxxxx

)()(),( kkK xxxx ff ×º

)(ˆ wx,wx Ky =×=

ååÎÎ -

-

+

+ -=FP

kFN

kkkxx

xxw

wx ×= )(ˆ fy

ååÎÎ -

-

+

+ -=FP

kFN

kkkxx

xxw )()( ff

),(),(ˆ -

-

+

+

ååÎÎ

-= kFP

ikFN

i

kk

KKy xxxxxx

)'()'(),( kkK xxxx ff ×º

Two different computational ways of getting the same behavior

Explicitly map from x to φ(x)– i.e.tothepointcorrespondingtoxintheHilbertspace

Implicitly map from x to φ(x)bychangingthekernelfunctionK

52

Kernels 101

• Duality • Gram matrix: K: kij = K(xi,xj)

K(x,x’) = K(x’,x) è Gram matrix is symmetric

K(x,x) > 0 è diagonal of K is positive è K is “positive semi-definite”è zT K z >= 0 for all z

53

Review: the hash trick

54

Learning as optimization for regularized logistic regression

• Algorithm:• Initializehashtables W,Aand set k=0• Foreachiterationt=1,…T

– Foreachexample(xi,yi)• pi =…;k++• Foreachfeaturej:xij>0:

»W[j] *= (1 - λ2µ)k-A[j]

»W[j] = W[j] +λ(yi - pi)xj»A[j]=k

55

Learning as optimization for regularized logistic regression

• Algorithm:• InitializearraysW,AofsizeR and set k=0• Foreachiterationt=1,…T

– Foreachexample(xi,yi)• LetV behashtablesothat• pi =…;k++• Foreachhashvalueh:V[h]>0:

»W[h] *= (1 - λ2µ)k-A[j]

»W[h] = W[h] +λ(yi - pi)V[h]»A[h]=k

V[h]= xij

j:hash(xij )%R=h∑

56

The hash trick as a kernel

57

Some details

ϕ[h]= ξ ( j)xij

j:hash( j )%m==h∑ , where ξ ( j)∈ −1,+1{ }

Slightly different hash to avoid systematic bias

€

V[h] = xij

j:hash( j )%R ==h∑

m is the number of buckets you hash into (R in my discussion)

59

Some details

ϕ[h]= ξ ( j)xij

j:hash( j )%m==h∑ , where ξ ( j)∈ −1,+1{ }

Slightly different hash to avoid systematic bias

m is the number of buckets you hash into (R in my discussion)

60

Some details

I.e. – a hashed vector is probably close to the original vector

61

Some details

I.e. the inner products between x and x’ are probably not changed too much by the hash function: a classifier will probably still work 62

Some details

63

The hash kernel: implementation

• Oneproblem:debuggingisharder–Featuresarenolongermeaningful–There’sanewwaytoruinaclassifier

• ChangethehashfunctionL• Youcanseparatelycomputethesetofallwordsthathashtoh andguesswhatfeaturesmean–Buildaninvertedindexhàw1,w2,…,

64

An example

2^26 entries = 1 Gb @ 8bytes/weight

65

dr. perceptronwcohen/10-605/perceptrons-1.pdf• eg, for nb: use one small multinomial for pos...

Documents