dr. perceptron - cs.cmu.edu
TRANSCRIPT
1http://futurama.wikia.com/wiki/Dr._Perceptron
Where we are…• Experiments with a hash-trick implementation
of logistic regression• Next question:
– how do you parallelize SGD, or more generally, this kind of streaming algorithm?
– each example affects the next prediction èorder matters è parallelization changes the behavior
– we will step back to perceptrons and then step forward to parallel perceptrons
– then another nice parallel learning algorithm – then a midterm
2
Recap: perceptrons
The perceptron
A Binstance xi Compute: yi =sign( vk . xi )^
yi^
yi
If mistake: vk+1 = vk + yi xi
2
The perceptron
A Binstance xi Compute: yi =sign( vk . xi )^
yi^
yi
If mistake: vk+1 = vk + yi xi
2
22
÷÷ø
öççè
æ=
gR
u
-u
2γ
margin
positive
negative
+ ++- -
-
Mistake bound:
A lot like SGD update for logistic regression!
On-line to batch learning
1. Pick a vk at random according to mk/m, the fraction of examples it was used for.
2. Predict using the vkyou just picked.
3. (Actually, use some sort of deterministic approximation to this).
m1=3
m2=4
m=10
1. Pick a vk at random according to mk/m, the fraction of examples it was used for.
2. Predict using the vkyou just picked.
3. (Actually, use some sort of deterministic approximation to this).
predict using sign(v*. x)
m1=3
m2=4
m=10
predict using sign(v*. x)
Averaging/voting
Last perceptron
Also: there’s a sparsification trick that
makes learning the averaged perceptron fast
KERNELS AND PERCEPTRONS
The kernel perceptron
A Binstance xi
Compute: yi = vk . xi^
yi^
yi
If mistake: vk+1 = vk + yi xi
-
-
+
+
ååÎÎ
-= kFP
ikFN
i
kk
y xxxxxx
..ˆ :Compute
FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If
i
i
xx
Mathematically the same as before … but allows use of the kernel trick
10
The kernel perceptron
A Binstance xi
Compute: yi = vk . xi^
yi^
yi
If mistake: vk+1 = vk + yi xi
-
-
+
+
ååÎÎ
-= kFP
ikFN
i
kk
y xxxxxx
..ˆ :Compute
FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If
i
i
xx
Mathematically the same as before … but allows use of the “kernel trick”
),(),(ˆ -
-
+
+
ååÎÎ
-= kFP
ikFN
i
kk
KKy xxxxxx
kkK xxxx ׺),(
Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values.
11
Some common kernels• Linear kernel:
• Polynomial kernel:
• Gaussian kernel:
')',( xxxx ׺K
dK )1'()',( +׺ xxxx
s/'|||| 2
)',( xxxx --º eK
12
Some common kernels
• Polynomial kernel:
• for d=2
dK )1'()',( +׺ xxxx
13
𝑥),𝑥+ , 𝑥′),𝑥′+ + 1 +
= 𝑥)𝑥′) + 𝑥+𝑥′+ + 1 +
= 𝑥)𝑥′) + 𝑥+𝑥′+ + 1 𝑥)𝑥′) + 𝑥+𝑥′+ + 1
= 𝑥)𝑥′) ++2 𝑥)𝑥0)𝑥+𝑥0+ + 2(𝑥)𝑥′))+ 𝑥+𝑥′+ ++2(𝑥+𝑥′+)+1
≅ 1, 𝑥),𝑥+, 𝑥)𝑥+, 𝑥)+, 𝑥++ , 1, 𝑥′),𝑥′+, 𝑥′)𝑥′+, 𝑥′)+, 𝑥′++
= 1, 2� 𝑥), 2� 𝑥+, 2� 𝑥)𝑥+, 𝑥)+, 𝑥++ , 1, 2� 𝑥′), 2
� 𝑥′+, 2� 𝑥′)𝑥′+, 𝑥′)+, 𝑥′++
Some common kernels
• Polynomial kernel:
• for d=2
dK )1'()',( +׺ xxxx
14
𝑥),𝑥+ , 𝑥′),𝑥′+ + 1 +
= 1, 2� 𝑥), 2� 𝑥+, 2� 𝑥)𝑥+, 𝑥)+, 𝑥++ , 1, 2� 𝑥′), 2
� 𝑥′+, 2� 𝑥′)𝑥′+, 𝑥′)+, 𝑥′++
Similarity with the kernel on x is equivalent to dot-product similarity on a transformed feature vector 𝜙(𝒙)
Kernels 101
• Duality: two ways to look at this
),(),(ˆ -
-
+
+
ååÎÎ
-= kFP
ikFN
i
kk
KKy xxxxxx
)()(),( kkK xxxx ff ׺
)(ˆ wx,wx Ky =×=
ååÎÎ -
-
+
+ -=FP
kFN
kkkxx
xxw
wx ×= )(ˆ fy
ååÎÎ -
-
+
+ -=FP
kFN
kkkxx
xxw )()( ffsame behavior but compute time/space are different
Explicitly map from x to φ(x)– i.e.tothepointcorrespondingtoxintheHilbertspace(RKHS)
Implicitly map from x to φ(x)bychangingthekernelfunctionK
15
Observation about perceptron Generalization of perceptron
Generalization: add weights to the sums for w
Kernels 101
• Duality • Gram matrix: K: kij = K(xi,xj)
K(x,x’) = K(x’,x) è Gram matrix is symmetric
K(x,x) > 0 è diagonal of K is positive è K is “positive semi-definite”è zT K z >= 0 for all z
16
A FAMILIAR KERNEL
Learning as optimization for regularized logistic regression + hashes• Algorithm:• Initialize arrays W, A of size R and set k=0• For each iteration t=1,…T
– For each example (xi,yi)• V is a hash table• For j : xj>0 increment V[h[j]] by xj
• pi = … ; k++• For each hash value h: V[h]>0:
»W[h] *= (1 - λ2µ)k-A[h]
»W[h] = W[h] + λ(yi - pi)V[h]»A[h] = k
18
€
V[h] = xij
j:hash( j )%R ==h∑
19
Some details
ϕ[h]= ξ ( j)xij
j:hash( j )%m==h∑ , where ξ ( j)∈ −1,+1{ }
Slightly different hash to avoid systematic bias
€
V[h] = xij
j:hash( j )%R ==h∑
m is the number of buckets you hash into (R in my discussion)
20
Some details
ϕ[h]= ξ ( j)xij
j:hash( j )%m==h∑ , where ξ ( j)∈ −1,+1{ }
Slightly different hash to avoid systematic bias
21
I.e., for large feature sets the variance should be low
Some details
I.e. – a hashed vector is probably close to the original vector
22
Some details
I.e. the inner products between x and x’ are probably not changed too much by the hash function: a classifier will probably still work23
The Voted Perceptron for Ranking and Structured Classification
William Cohen
The voted perceptron for ranking
A Binstances x1 x2 x3 x4… Compute: yi = vk . xi
Return: the index b* of the “best” xi
^
b*
bIf mistake: vk+1 = vk + xb - xb*
u
-u
xx
x
x
x
γ
Ranking some x’s with the target vector u
u
-u
xx
x
x
x
γ
v
Ranking some x’s with some guess vector v – part 1
u
-u
xx
x
x
x
v
Ranking some x’s with some guess vector v – part 2.
The purple-circled x is xb* - the one the learner has chosen to rank highest. The green circled x is xb, the right answer.
u
-u
xx
x
x
x
v
Correcting v by adding xb – xb*
xx
x
x
x
vk
Vk+1
Correcting v by adding xb – xb*
(part 2)
u
-u
2γ
v1
+x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
>γ
u
-u
u
-u
xx
x
x
x
v
u
-u
2γ
v1
+x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
>γ
3
u
-u
u
-u
xx
x
x
x
v
u
-u
2γ
v1
+x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
>γ
3
u
-u
u
-u
xx
x
x
x
v
u
-u
2γ
v1
+x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
>γ
u
-u
u
-u
xx
x
x
x
v
Notice this doesn’t depend at all on the number of x’s being ranked
Neither proof depends on the dimension of the x’s.
Ranking perceptrons èstructured perceptrons
• The API:– A sends B a (maybe
huge) set of items to rank
– B finds the single bestone according to the current weight vector
– A tells B which one was actually best
• Structured classification on a sequence– Input: list of words:
x=(w1,…,wn)– Output: list of labels:
y=(y1,…,yn)– If there are K classes,
there are Kn labels possible for x
36
Borkar et al’s: HMMs for segmentation
– Example: Addresses, bib records– Problem: some DBs may split records up differently (eg no “mail
stop” field, combine address and apt #, …) or not at all– Solution: Learn to segment textual form of records
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.
Author Year Title JournalVolume Page
IE with Hidden Markov Models
37
Title
Journal
Author 0.9
0.5
0.50.8
0.2
0.1
Transition probabilities
Year
Learning
Convex
…
0.06
0.03
..
Comm.
Trans.
Chemical
0.04
0.02
0.004
Smith
Cohen
Jordan
…
0.01
0.05
0.3
…
Emission probabilities
dddd
dd
0.8
0.2
Inference for linear-chain CRFs
When will prof Cohen post the notes …
Idea 1: features are properties of two adjacent tokens, and the pair of labels assigned to them (Begin,Inside,Outside)
• (y(i)==B or y(i)==I) and (token(i) is capitalized)
• (y(i)==I and y(i-1)==B) and (token(i) is hyphenated)
• (y(i)==B and y(i-1)==B)
•eg “tell Rose William is on the way”
Idea 2: construct a graph where each path is a possible sequence labeling.
Inference for a linear-chain CRF
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
•Inference: find the highest-weight path given a weighting of features•This can be done efficiently using dynamic programming (Viterbi)
Ranking perceptrons èstructured perceptrons
• The API:– A sends B a (maybe
huge) set of items to rank
– B finds the single bestone according to the current weight vector
– A tells B which one was actually best
• Structured classification on a sequence– Input: list of words:
x=(w1,…,wn)– Output: list of labels:
y=(y1,…,yn)– If there are K classes,
there are Kn labels possible for x
Ranking perceptrons èstructured perceptrons
• New API:– A sends B the word
sequence x– B finds the single best
y according to the current weight vector using Viterbi
– A tells B which y was actually best
– This is equivalent to ranking pairs g=(x,y’)
• Structured classification on a sequence– Input: list of words:
x=(w1,…,wn)– Output: list of labels:
y=(y1,…,yn)– If there are K classes,
there are Kn labels possible for x
The voted perceptron for ranking
A Binstances x1 x2 x3 x4… Compute: yi = vk . xi
Return: the index b* of the “best” xi
^
b*
bIf mistake: vk+1 = vk + xb - xb*
Change number one is notation: replace x with g
The voted perceptron for structured classification tasks
A Binstances g1 g2 g3 g4… Compute: yi = vk . gi
Return: the index b* of the “best” gi
^
b* bIf mistake: vk+1 = vk + gb - gb*
1. A sends B feature functions, and instructions for creating the instances g:
• A sends a word vector xi. Then B could create the instances g1 =F(xi,y1), g2= F(xi,y2), …
• but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi.
2. A sends B the correct label sequence yi.
3. On errors, B sets vk+1 = vk + gb - gb* = vk + F(xi,y) - F(xi,y*)
EMNLP 2002, Best paper
Results from the original paper….
Collins’ Experiments• POS tagging • NP Chunking (words and POS tags from Brill’s
tagger as features) and BIO output tags• Compared logistic regression methods (MaxEnt)
and “Voted Perceptron trained HMM’s”– With and w/o averaging– With and w/o feature selection (count>5)
Collins’ results
Where we are…• Experiments with a hash-trick implementation
of logistic regression• Next question:
– how do you parallelize SGD, or more generally, this kind of streaming algorithm?
– each example affects the next prediction èorder matters è parallelization changes the behavior
– we will step back to perceptrons and then step forward to parallel perceptrons
– then another nice parallel learning algorithm – then a midterm
47