dr. perceptronwcohen/10-605/perceptrons-1.pdf• eg, for nb: use one small multinomial for pos...
TRANSCRIPT
1http://futurama.wikia.com/wiki/Dr._Perceptron
Quick review of Tuesday• Learningasoptimization• Optimizingconditionallog-likelihoodPr(y|x)withlogisticregression
• Stochasticgradientdescentforlogisticregression– Streammultipletimes(epochs)thrudata– Keepmodelinmemory
• L2-regularization• Sparse/lazyL2regularization• The“hashtrick”:allowfeaturecollisions,usearrayindexedbyhashcodeinsteadofhashtableforparameters.
2
Quick look ahead• Experimentswithahash-trickimplementationoflogisticregression
• Nextquestion:–howdoyouparallelizeSGD,ormoregenerally,thiskindofstreamingalgorithm?
– eachexampleaffectsthenextpredictionèordermattersè parallelizationchangesthebehavior
–wewillstepbacktoperceptrons andthenstepforwardtoparallelperceptrons
3
Debugging Machine Learning Algorithms
WilliamCohen
4
Debugging for non-ML systems
• “Ifitcompiles,shipit.”
5
Debugging for ML systems
1. It’sdefinitelyexactly thealgorithmyoureadaboutinthatpaper
2. Italsocompiles3. Itgets87% accuracyontheauthor’sdataset
– buthegot91%– soit’snotworking?– or,youreval iswrong?– or,his eval iswrong?
6
Debugging for ML systems
1. It’sdefinitelyexactly thealgorithmyoureadaboutinthatpaper
2. Italsocompiles3. Itgets97%accuracyontheauthor’sdataset
– buthegot91%– soyouhaveabestpaperaward!– or,maybeabug…
7
Debugging for ML systems
• It’salwayshardtodebugsoftware• It’sespecially hardforML
– awiderangeofalmost-correctmodesforaprogramtobein
8
9
Debugging advice1. Writetests2. Forsubtleproblems,writetests3. Ifyou’restillnotsurewhyit’snotworking,write
tests4. Ifyougetreallystuck:
– takeawalkandcomebacktoitinahour– askafriend
• Ifs/he’salsoin10-605s/hecanstillhelpaslongasnonotesaretaken(myrules)
– takeabreakandwritesometests
10
Debugging ML systemsWritetests
– Foragenerativelearner,writeageneratorandgenerate training/testdatafromtheassumeddistribution
• Eg,forNB:useonesmallmultinomialforposexamples,anotheroneforneg examples,andaweightedcoinfortheclasspriors.
– Thelearnershould(usually)recovertheactualparametersofthegenerator
• givenenoughdata,moduloconvexity,…– Testitontheweirdcases(eg,uniformclass
priors,highlyskewedmultinomials)
11
Debugging ML systemsWritetests
– Foradiscriminativelearner,similartrick…– Also,usewhatyouknow:eg,forSGD
• doestakingonegradientstep(onasampletask)lowerthelossonthetrainingdata?
• doesitlowerthelossasexpected?– (f(x)-f(x+d))/dshouldapproximatef ’(x)
• doesregularizationworkasexpected?– largemuè smallerparam values
• recordtrainingset/testsetloss– withandwithoutregularization
12
Debugging ML systems
Comparetoa“baseline”mathematicallycleanmethodvs scalable,efficientmethod• lazy/sparsevs naïveregularizer• hashedfeaturevaluesvs hashtable featurevalues
• …
13
ON-LINE ANALYSIS AND REGRET
14
On-line learning/regret analysis• Optimization
– isagreatmodelofwhatyouwant todo– alessgoodmodelofwhatyouhavetime todo
• Example:– HowmuchtowelosewhenwereplacegradientdescentwithSGD?
– whatifwecanonlyapproximatethelocalgradient?– whatifthedistributionchangesovertime?– …
• Onepowerfulanalyticapproach:online-learningakaregretanalysis(~akaon-lineoptimization)
On-line learning
Binstance xi Compute: yi = sign(v . xi )^
+1,-1: label yiGet yi and make update to vTrain Data
To detect interactions:• increase/decrease vk only if we need to (for that example)• otherwise, leave it unchanged
• We can be sensitive to duplication by stopping updates when we get better performance
On-line learning
Binstance xi Compute: yi = sign(v . xi )^
+1,-1: label yiIf mistake: vk+1 = vk + correctionTrain Data
To detect interactions:• increase/decrease vk only if we need to (for that example)• otherwise, leave it unchanged
• We can be sensitive to duplication by stopping updates when we get better performance
Theory: the prediction game
• PlayerA:– picksa“targetconcept”c
• fornow- fromafinitesetofpossibilitiesC(e.g.,alldecisiontreesofsizem)
– fort=1,….,• PlayerApicksx=(x1,…,xn)andsendsittoB
– Fornow,fromafinitesetofpossibilities(e.g.,allbinaryvectorsoflengthn)
• Bpredictsalabel,ŷ,andsendsittoA• AsendsBthetruelabely=c(x)• werecordifBmadeamistakeornot
– Wecareabouttheworstcase numberofmistakesBwillmakeoverallpossible concept&trainingsequencesofanylength
• The“Mistakebound”forB,MB(C),isthisbound
Perceptrons
The prediction game
• Aretherepracticalalgorithmswherewecancomputethemistakebound?
The voted perceptron
A Binstance xi Compute: yi =sign( vk . xi )^
yi^
yi
If mistake: vk+1 = vk + yi xi
2
The voted perceptron
A Binstance xi Compute: p = sign(vk . xi )
yi^
yi
If mistake: vk+1 = vk + yi xi
Aside: this is related to the SGD update:
y=p:noupdatey=0,p=1:-xy=1,p=0:+x
y=-1,p=+1:-xy=+1,p=-1:+x
u
-u
2γ
u
-u
2γ
+x1v1
(1) A target u (2) The guess v1 after one positive example.
u
-u
2γ
u
-u
2γ
v1
+x2
v2
+x1v1
-x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
(3b) The guess v2 after the one positive and one negative example: v2=v1-x2
If mistake: vk+1 = vk + yi xi
u
-u
2γ
u
-u
2γ
v1
+x2
v2
+x1v1
-x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
(3b) The guess v2 after the one positive and one negative example: v2=v1-x2
>γ
u
-u
2γ
u
-u
2γ
v1
+x2
v2
+x1v1
-x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
(3b) The guess v2 after the one positive and one negative example: v2=v1-x2
2
2
2
2
2
2
÷÷ø
öççè
æ=
gR
u
-u
2γ
Whatiftheseparatinglinedoesn’tgothrutheorigin?
Replacex=(x1,….,xn)with(x0,….,xn)wherex0 =1foreveryexamplex.
Then𝑦 = 𝑠𝑖𝑔𝑛 ∑ 𝑥t𝑤t�t becomes
𝑠𝑖𝑔𝑛 𝑥w𝑤w +∑ 𝑥t𝑤t�txy whichis
𝑠𝑖𝑔𝑛 𝑤w +z𝑥t𝑤t�
txy
One Weird Trick for Making Perceptrons More Expressive
Summary• We have shown that
– If : exists a u with unit norm that has margin γ on examples in the seq (x1,y1),(x2,y2),….
– Then : the perceptron algorithm makes < R2/ γ2 mistakes on the sequence (where R >= ||xi||)
– Independent of dimension of the data or classifier (!)– This doesn’t follow from M(C)<=VCDim(C)
• We don’t know if this algorithm could be better– There are many variants that rely on similar analysis (ROMMA,
Passive-Aggressive, MIRA, …)• We don’t know what happens if the data’s not separable
– Unless I explain the “Δ trick” to you• We don’t know what classifier to use “after” training
The idea of the “delta trick”
29
+ + + +- - --+
A noisy example makes
the data inseparable
The idea of the “delta trick”
30
+ + + +- - --
+
So let’s add a new dimension
and give the noisy example an offset in the dimension of △
I don’t know which examples are noisy….but I have lots of dimensions that I could add to the data….
△
Now it’s separable!
The Δ Trick• The proof assumes the data is separable by a
wide margin• We can make that true by adding an “id” feature
to each example– sort of like we added a constant feature
x1 = (x11, x2
1,..., xm1 )→ (x1
1, x21,..., xm
1 , Δ, 0,...., 0)x2 = (x1
2, x22,..., xm
2 )→ (x12, x2
2,..., xm2 , 0,Δ,...., 0)
...xn = (x1
n, x2n,..., xm
n )→ (x1n, x2
n,..., xmn , 0, 0,...,Δ)
n new features
The Δ Trick• The proof assumes the data is separable by a
wide margin• We can make that true by adding an “id” feature
to each example– sort of like we added a constant feature
n new features
doc17: i, found, aardvark, today à i, found, aardvark, today, doc17doc37: aardvarks, are, dangerous à aardvarks, are, dangerous, doc37….
The Δ Trick• Replace xi with x’i so X becomes [X | I Δ]• Replace R2 in our bounds with R2 + Δ2
• Let di = max(0, γ - yi xi u) • Let u’ = (u1,…,un, y1d1/Δ, … ymdm/Δ) * 1/Z
– So Z=sqrt(1 + D2/ Δ2), for D=sqrt(d12+…+dm
2)– Now [X|IΔ] is separable by u’ with margin γ
• Mistake bound is (R2 + Δ2)Z2 / γ2
• Let Δ = sqrt(RD) è k <= ((R + D)/ γ)2
• Conclusion: a little noise is ok
Summary• We have shown that
– If : exists a u with unit norm that has margin γ on examples in the seq (x1,y1),(x2,y2),….
– Then : the perceptron algorithm makes < R2/ γ2 mistakes on the sequence (where R >= ||xi||)
– Independent of dimension of the data or classifier (!)• We don’t know what happens if the data’s not
separable– Unless I explain the “Δ trick” to you
• We don’t know what classifier to use “after”training
The averaged perceptron
On-line to batch learning
1. Pick a vk at random according to mk/m, the fraction of examples it was used for.
2. Predict using the vkyou just picked.
3. (Actually, use some sort of deterministic approximation to this).
1. Pick a vk at random according to mk/m, the fraction of examples it was used for.
2. Predict using the vkyou just picked.
3. (Actually, use some sort of deterministic approximation to this).
predict using sign(v*. x)
SPARSIFYINGTHE AVERAGED PERCEPTRON UPDATE
39
Complexity of perceptron learning
• Algorithm:• v=0• foreachexamplex,y:
– ifsign(v.x) !=y• v =v +yx
• inithashtable
• forxi!=0,vi +=yxi
O(n)
O(|x|)=O(|d|)
Final hypothesis (last): v
Complexity of averaged perceptron
• Algorithm:• vk=0• va =0• foreachexamplex,y:
– ifsign(vk.x) !=y• va =va +mk*vk• vk =vk +yx• m=m+1• mk =1
– else• mk++
• inithashtables
• forvki!=0,vai +=vki• forxi!=0,vi +=yxi
O(n) O(n|V|)
O(|x|)=O(|d|)
O(|V|)
Final hypothesis (avg): va/m
Complexity of perceptron learning
• Algorithm:• v=0• foreachexamplex,y:
– ifsign(v.x) !=y• v =v +yx
• inithashtable
• forxi!=0,vi +=yxi
O(n)
O(|x|)=O(|d|)
Complexity of averaged perceptron
• Algorithm:• vk=0• va =0• foreachexamplex,y:
– ifsign(vk.x) !=y• va =va +vk• vk =vk +yx• mk =1
– else• nk++
• inithashtables
• forvki!=0,vai +=vki• forxi!=0,vi +=yxi
O(n) O(n|V|)
O(|x|)=O(|d|)
O(|V|)
Alternative averaged perceptron
• Algorithm:• vk =0• va =0• foreachexamplex,y:
– va =va +vk– m=m+1– ifsign(vk.x)!=y
• vk =vk +y*x• Returnva/m
vk = yjx jj∈Sk∑
Sk is the set of examples including the first k mistakes
Observe:
Alternative averaged perceptron
• Algorithm:• vk =0• va =0• foreachexamplex,y:
– va =va +vk– m=m+1– ifsign(vk.x)!=y
• vk =vk +y*x• Returnva/m
yjx jj∈Sk∑
So when there’s a mistake at time t on x,y:
y*x is added to va on every subsequent iteration
Suppose you know T, the total number of examples in the stream…
Alternative averaged perceptron
• Algorithm:• vk =0• va =0• foreachexamplex,y:
– va =va +vk– m=m+1– ifsign(vk.x)!=y
• vk =vk +y*x• va =va +(T-m)*y*x
• Returnva/T
yjx jj∈Sk∑
T = the total number of examples in the stream…(all epochs)
All subsequent additions of x to va
Unpublished? I figured this out recently, Leon Bottou knows it too
KERNELS AND PERCEPTRONS
The kernel perceptron
A Binstance xi
Compute: yi = vk . xi^
yi^
yi
If mistake: vk+1 = vk + yi xi
-
-
+
+
ååÎÎ
-= kFP
ikFN
i
kk
y xxxxxx
..ˆ :Compute
FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If
i
i
xx
Mathematically the same as before … but allows use of the kernel trick
48
The kernel perceptron
A Binstance xi
Compute: yi = vk . xi^
yi^
yi
If mistake: vk+1 = vk + yi xi
-
-
+
+
ååÎÎ
-= kFP
ikFN
i
kk
y xxxxxx
..ˆ :Compute
FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If
i
i
xx
Mathematically the same as before … but allows use of the “kernel trick”
),(),(ˆ -
-
+
+
ååÎÎ
-= kFP
ikFN
i
kk
KKy xxxxxx
kkK xxxx ׺),(
Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values.
49
Some common kernels• Linear kernel:
• Polynomial kernel:
• Gaussian kernel:
• More later….
')',( xxxx ׺K
dK )1'()',( +׺ xxxx
s/'|||| 2
)',( xxxx --º eK
50
Kernels 101• Duality
– and computational properties– Reproducing Kernel Hilbert Space (RKHS)
• Gram matrix• Positive semi-definite• Closure properties
51
Kernels 101
• Duality: two ways to look at this
),(),(ˆ -
-
+
+
ååÎÎ
-= kFP
ikFN
i
kk
KKy xxxxxx
)()(),( kkK xxxx ff ׺
)(ˆ wx,wx Ky =×=
ååÎÎ -
-
+
+ -=FP
kFN
kkkxx
xxw
wx ×= )(ˆ fy
ååÎÎ -
-
+
+ -=FP
kFN
kkkxx
xxw )()( ff
),(),(ˆ -
-
+
+
ååÎÎ
-= kFP
ikFN
i
kk
KKy xxxxxx
)'()'(),( kkK xxxx ff ׺
Two different computational ways of getting the same behavior
Explicitly map from x to φ(x)– i.e.tothepointcorrespondingtoxintheHilbertspace
Implicitly map from x to φ(x)bychangingthekernelfunctionK
52
Kernels 101
• Duality • Gram matrix: K: kij = K(xi,xj)
K(x,x’) = K(x’,x) è Gram matrix is symmetric
K(x,x) > 0 è diagonal of K is positive è K is “positive semi-definite”è zT K z >= 0 for all z
53
Review: the hash trick
54
Learning as optimization for regularized logistic regression
• Algorithm:• Initializehashtables W,Aand set k=0• Foreachiterationt=1,…T
– Foreachexample(xi,yi)• pi =…;k++• Foreachfeaturej:xij>0:
»W[j] *= (1 - λ2µ)k-A[j]
»W[j] = W[j] +λ(yi - pi)xj»A[j]=k
55
Learning as optimization for regularized logistic regression
• Algorithm:• InitializearraysW,AofsizeR and set k=0• Foreachiterationt=1,…T
– Foreachexample(xi,yi)• LetV behashtablesothat• pi =…;k++• Foreachhashvalueh:V[h]>0:
»W[h] *= (1 - λ2µ)k-A[j]
»W[h] = W[h] +λ(yi - pi)V[h]»A[h]=k
V[h]= xij
j:hash(xij )%R=h∑
56
The hash trick as a kernel
57
58
Some details
ϕ[h]= ξ ( j)xij
j:hash( j )%m==h∑ , where ξ ( j)∈ −1,+1{ }
Slightly different hash to avoid systematic bias
€
V[h] = xij
j:hash( j )%R ==h∑
m is the number of buckets you hash into (R in my discussion)
59
Some details
ϕ[h]= ξ ( j)xij
j:hash( j )%m==h∑ , where ξ ( j)∈ −1,+1{ }
Slightly different hash to avoid systematic bias
m is the number of buckets you hash into (R in my discussion)
60
Some details
I.e. – a hashed vector is probably close to the original vector
61
Some details
I.e. the inner products between x and x’ are probably not changed too much by the hash function: a classifier will probably still work 62
Some details
63
The hash kernel: implementation
• Oneproblem:debuggingisharder–Featuresarenolongermeaningful–There’sanewwaytoruinaclassifier
• ChangethehashfunctionL• Youcanseparatelycomputethesetofallwordsthathashtoh andguesswhatfeaturesmean–Buildaninvertedindexhàw1,w2,…,
64
An example
2^26 entries = 1 Gb @ 8bytes/weight
65