dr. perceptron - cs.cmu.edu

1http://futurama.wikia.com/wiki/Dr._Perceptron

Where we are…• Experiments with a hash-trick implementation

of logistic regression• Next question:

– how do you parallelize SGD, or more generally, this kind of streaming algorithm?

– each example affects the next prediction èorder matters è parallelization changes the behavior

– we will step back to perceptrons and then step forward to parallel perceptrons

– then another nice parallel learning algorithm – then a midterm

2

Recap: perceptrons

The perceptron

A Binstance xi Compute: yi =sign( vk . xi )^

yi^

yi

If mistake: vk+1 = vk + yi xi

2

The perceptron

A Binstance xi Compute: yi =sign( vk . xi )^

yi^

yi


2

22

÷÷ø

öççè

æ=

gR

u

-u

2γ

margin

positive

negative

+ ++- -

-

Mistake bound:

A lot like SGD update for logistic regression!

On-line to batch learning

1. Pick a vk at random according to mk/m, the fraction of examples it was used for.

2. Predict using the vkyou just picked.

3. (Actually, use some sort of deterministic approximation to this).

m1=3

m2=4

m=10

1. Pick a vk at random according to mk/m, the fraction of examples it was used for.

2. Predict using the vkyou just picked.

3. (Actually, use some sort of deterministic approximation to this).

predict using sign(v*. x)

m1=3

m2=4

m=10

predict using sign(v*. x)

Averaging/voting

Last perceptron

Also: there’s a sparsification trick that

makes learning the averaged perceptron fast

KERNELS AND PERCEPTRONS

The kernel perceptron

A Binstance xi

Compute: yi = vk . xi^

yi^

yi


-

-

+

+

ååÎÎ

-= kFP

ikFN

i

kk

y xxxxxx

..ˆ :Compute

FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If

i

i

xx

Mathematically the same as before … but allows use of the kernel trick

10

The kernel perceptron

A Binstance xi

Compute: yi = vk . xi^

yi^

yi


-

-

+

+

ååÎÎ

-= kFP

ikFN

i

kk

y xxxxxx

..ˆ :Compute

FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If

i

i

xx

Mathematically the same as before … but allows use of the “kernel trick”

),(),(ˆ -

-

+

+

ååÎÎ

-= kFP

ikFN

i

kk

KKy xxxxxx

kkK xxxx ×º),(

Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values.

11

Some common kernels• Linear kernel:

• Polynomial kernel:

• Gaussian kernel:

')',( xxxx ×ºK

dK )1'()',( +×º xxxx

s/'|||| 2

)',( xxxx --º eK

12

Some common kernels


• for d=2

dK )1'()',( +×º xxxx

13

𝑥),𝑥+ , 𝑥′),𝑥′+ + 1 +

= 𝑥)𝑥′) + 𝑥+𝑥′+ + 1 +

= 𝑥)𝑥′) + 𝑥+𝑥′+ + 1 𝑥)𝑥′) + 𝑥+𝑥′+ + 1

= 𝑥)𝑥′) ++2 𝑥)𝑥0)𝑥+𝑥0+ + 2(𝑥)𝑥′))+ 𝑥+𝑥′+ ++2(𝑥+𝑥′+)+1

≅ 1, 𝑥),𝑥+, 𝑥)𝑥+, 𝑥)+, 𝑥++ , 1, 𝑥′),𝑥′+, 𝑥′)𝑥′+, 𝑥′)+, 𝑥′++

= 1, 2� 𝑥), 2� 𝑥+, 2� 𝑥)𝑥+, 𝑥)+, 𝑥++ , 1, 2� 𝑥′), 2

� 𝑥′+, 2� 𝑥′)𝑥′+, 𝑥′)+, 𝑥′++

Some common kernels


• for d=2

dK )1'()',( +×º xxxx

14

𝑥),𝑥+ , 𝑥′),𝑥′+ + 1 +

= 1, 2� 𝑥), 2� 𝑥+, 2� 𝑥)𝑥+, 𝑥)+, 𝑥++ , 1, 2� 𝑥′), 2

� 𝑥′+, 2� 𝑥′)𝑥′+, 𝑥′)+, 𝑥′++

Similarity with the kernel on x is equivalent to dot-product similarity on a transformed feature vector 𝜙(𝒙)

Kernels 101

• Duality: two ways to look at this

),(),(ˆ -

-

+

+

ååÎÎ

-= kFP

ikFN

i

kk

KKy xxxxxx

)()(),( kkK xxxx ff ×º

)(ˆ wx,wx Ky =×=

ååÎÎ -

-

+

+ -=FP

kFN

kkkxx

xxw

wx ×= )(ˆ fy

ååÎÎ -

-

+

+ -=FP

kFN

kkkxx

xxw )()( ffsame behavior but compute time/space are different

Explicitly map from x to φ(x)– i.e.tothepointcorrespondingtoxintheHilbertspace(RKHS)

Implicitly map from x to φ(x)bychangingthekernelfunctionK

15

Observation about perceptron Generalization of perceptron

Generalization: add weights to the sums for w

Kernels 101

• Duality • Gram matrix: K: kij = K(xi,xj)

K(x,x’) = K(x’,x) è Gram matrix is symmetric

K(x,x) > 0 è diagonal of K is positive è K is “positive semi-definite”è zT K z >= 0 for all z

16

A FAMILIAR KERNEL

Learning as optimization for regularized logistic regression + hashes• Algorithm:• Initialize arrays W, A of size R and set k=0• For each iteration t=1,…T

– For each example (xi,yi)• V is a hash table• For j : xj>0 increment V[h[j]] by xj

• pi = … ; k++• For each hash value h: V[h]>0:

»W[h] *= (1 - λ2µ)k-A[h]

»W[h] = W[h] + λ(yi - pi)V[h]»A[h] = k

18

€

V[h] = xij

j:hash( j )%R ==h∑

Some details

ϕ[h]= ξ ( j)xij

j:hash( j )%m==h∑ , where ξ ( j)∈ −1,+1{ }

Slightly different hash to avoid systematic bias

€

V[h] = xij

j:hash( j )%R ==h∑

m is the number of buckets you hash into (R in my discussion)

20

Some details

ϕ[h]= ξ ( j)xij

j:hash( j )%m==h∑ , where ξ ( j)∈ −1,+1{ }

Slightly different hash to avoid systematic bias

21

I.e., for large feature sets the variance should be low

Some details

I.e. – a hashed vector is probably close to the original vector

22

Some details

I.e. the inner products between x and x’ are probably not changed too much by the hash function: a classifier will probably still work23

The Voted Perceptron for Ranking and Structured Classification

William Cohen

The voted perceptron for ranking

A Binstances x1 x2 x3 x4… Compute: yi = vk . xi

Return: the index b* of the “best” xi

^

b*

bIf mistake: vk+1 = vk + xb - xb*

u

-u

xx

x

x

x

γ

Ranking some x’s with the target vector u

u

-u

xx

x

x

x

γ

v

Ranking some x’s with some guess vector v – part 1

u

-u

xx

x

x

x

v

Ranking some x’s with some guess vector v – part 2.

The purple-circled x is xb* - the one the learner has chosen to rank highest. The green circled x is xb, the right answer.

u

-u

xx

x

x

x

v

Correcting v by adding xb – xb*

xx

x

x

x

vk

Vk+1

Correcting v by adding xb – xb*

(part 2)

u

-u

2γ

v1

+x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

>γ

u

-u

u

-u

xx

x

x

x

v

u

-u

2γ

v1

+x2

v2


>γ

3

u

-u

u

-u

xx

x

x

x

v

u

-u

2γ

v1

+x2

v2


>γ

u

-u

u

-u

xx

x

x

x

v

Notice this doesn’t depend at all on the number of x’s being ranked

Neither proof depends on the dimension of the x’s.

Ranking perceptrons èstructured perceptrons

• The API:– A sends B a (maybe

huge) set of items to rank

– B finds the single bestone according to the current weight vector

– A tells B which one was actually best

• Structured classification on a sequence– Input: list of words:

x=(w1,…,wn)– Output: list of labels:

y=(y1,…,yn)– If there are K classes,

there are Kn labels possible for x

36

Borkar et al’s: HMMs for segmentation

– Example: Addresses, bib records– Problem: some DBs may split records up differently (eg no “mail

stop” field, combine address and apt #, …) or not at all– Solution: Learn to segment textual form of records

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

Author Year Title JournalVolume Page

IE with Hidden Markov Models

37

Title

Journal

Author 0.9

0.5

0.50.8

0.2

0.1

Transition probabilities

Year

Learning

Convex

…

0.06

0.03

..

Comm.

Trans.

Chemical

0.04

0.02

0.004

Smith

Cohen

Jordan

…

0.01

0.05

0.3

…

Emission probabilities

dddd

dd

0.8

0.2

Inference for linear-chain CRFs

When will prof Cohen post the notes …

Idea 1: features are properties of two adjacent tokens, and the pair of labels assigned to them (Begin,Inside,Outside)

• (y(i)==B or y(i)==I) and (token(i) is capitalized)

• (y(i)==I and y(i-1)==B) and (token(i) is hyphenated)

• (y(i)==B and y(i-1)==B)

•eg “tell Rose William is on the way”

Idea 2: construct a graph where each path is a possible sequence labeling.

Inference for a linear-chain CRF

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

When will prof Cohen post the notes …

•Inference: find the highest-weight path given a weighting of features•This can be done efficiently using dynamic programming (Viterbi)


• The API:– A sends B a (maybe

huge) set of items to rank

– B finds the single bestone according to the current weight vector

– A tells B which one was actually best






• New API:– A sends B the word

sequence x– B finds the single best

y according to the current weight vector using Viterbi

– A tells B which y was actually best

– This is equivalent to ranking pairs g=(x,y’)





The voted perceptron for ranking

A Binstances x1 x2 x3 x4… Compute: yi = vk . xi

Return: the index b* of the “best” xi

^

b*

bIf mistake: vk+1 = vk + xb - xb*

Change number one is notation: replace x with g

The voted perceptron for structured classification tasks

A Binstances g1 g2 g3 g4… Compute: yi = vk . gi

Return: the index b* of the “best” gi

^

b* bIf mistake: vk+1 = vk + gb - gb*

1. A sends B feature functions, and instructions for creating the instances g:

• A sends a word vector xi. Then B could create the instances g1 =F(xi,y1), g2= F(xi,y2), …

• but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi.

2. A sends B the correct label sequence yi.

3. On errors, B sets vk+1 = vk + gb - gb* = vk + F(xi,y) - F(xi,y*)

EMNLP 2002, Best paper

Results from the original paper….

Collins’ Experiments• POS tagging • NP Chunking (words and POS tags from Brill’s

tagger as features) and BIO output tags• Compared logistic regression methods (MaxEnt)

and “Voted Perceptron trained HMM’s”– With and w/o averaging– With and w/o feature selection (count>5)

Collins’ results

Where we are…• Experiments with a hash-trick implementation

of logistic regression• Next question:

– how do you parallelize SGD, or more generally, this kind of streaming algorithm?

– each example affects the next prediction èorder matters è parallelization changes the behavior

– we will step back to perceptrons and then step forward to parallel perceptrons

– then another nice parallel learning algorithm – then a midterm

47

dr. perceptron - cs.cmu.edu

Documents