dr. perceptron - cs.cmu.edu

47
1 http://futurama.wikia.com/wiki/Dr._Perceptron

Upload: others

Post on 27-Oct-2021

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dr. Perceptron - cs.cmu.edu

1http://futurama.wikia.com/wiki/Dr._Perceptron

Page 2: Dr. Perceptron - cs.cmu.edu

Where we are…• Experiments with a hash-trick implementation

of logistic regression• Next question:

– how do you parallelize SGD, or more generally, this kind of streaming algorithm?

– each example affects the next prediction èorder matters è parallelization changes the behavior

– we will step back to perceptrons and then step forward to parallel perceptrons

– then another nice parallel learning algorithm – then a midterm

2

Page 3: Dr. Perceptron - cs.cmu.edu

Recap: perceptrons

Page 4: Dr. Perceptron - cs.cmu.edu

The perceptron

A Binstance xi Compute: yi =sign( vk . xi )^

yi^

yi

If mistake: vk+1 = vk + yi xi

2

Page 5: Dr. Perceptron - cs.cmu.edu

The perceptron

A Binstance xi Compute: yi =sign( vk . xi )^

yi^

yi

If mistake: vk+1 = vk + yi xi

2

22

÷÷ø

öççè

æ=

gR

u

-u

margin

positive

negative

+ ++- -

-

Mistake bound:

A lot like SGD update for logistic regression!

Page 6: Dr. Perceptron - cs.cmu.edu

On-line to batch learning

1. Pick a vk at random according to mk/m, the fraction of examples it was used for.

2. Predict using the vkyou just picked.

3. (Actually, use some sort of deterministic approximation to this).

m1=3

m2=4

m=10

Page 7: Dr. Perceptron - cs.cmu.edu

1. Pick a vk at random according to mk/m, the fraction of examples it was used for.

2. Predict using the vkyou just picked.

3. (Actually, use some sort of deterministic approximation to this).

predict using sign(v*. x)

m1=3

m2=4

m=10

Page 8: Dr. Perceptron - cs.cmu.edu

predict using sign(v*. x)

Averaging/voting

Last perceptron

Also: there’s a sparsification trick that

makes learning the averaged perceptron fast

Page 9: Dr. Perceptron - cs.cmu.edu

KERNELS AND PERCEPTRONS

Page 10: Dr. Perceptron - cs.cmu.edu

The kernel perceptron

A Binstance xi

Compute: yi = vk . xi^

yi^

yi

If mistake: vk+1 = vk + yi xi

-

-

+

+

ååÎÎ

-= kFP

ikFN

i

kk

y xxxxxx

..ˆ :Compute

FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If

i

i

xx

Mathematically the same as before … but allows use of the kernel trick

10

Page 11: Dr. Perceptron - cs.cmu.edu

The kernel perceptron

A Binstance xi

Compute: yi = vk . xi^

yi^

yi

If mistake: vk+1 = vk + yi xi

-

-

+

+

ååÎÎ

-= kFP

ikFN

i

kk

y xxxxxx

..ˆ :Compute

FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If

i

i

xx

Mathematically the same as before … but allows use of the “kernel trick”

),(),(ˆ -

-

+

+

ååÎÎ

-= kFP

ikFN

i

kk

KKy xxxxxx

kkK xxxx ׺),(

Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values.

11

Page 12: Dr. Perceptron - cs.cmu.edu

Some common kernels• Linear kernel:

• Polynomial kernel:

• Gaussian kernel:

')',( xxxx ׺K

dK )1'()',( +׺ xxxx

s/'|||| 2

)',( xxxx --º eK

12

Page 13: Dr. Perceptron - cs.cmu.edu

Some common kernels

• Polynomial kernel:

• for d=2

dK )1'()',( +׺ xxxx

13

𝑥),𝑥+ , 𝑥′),𝑥′+ + 1 +

= 𝑥)𝑥′) + 𝑥+𝑥′+ + 1 +

= 𝑥)𝑥′) + 𝑥+𝑥′+ + 1 𝑥)𝑥′) + 𝑥+𝑥′+ + 1

= 𝑥)𝑥′) ++2 𝑥)𝑥0)𝑥+𝑥0+ + 2(𝑥)𝑥′))+ 𝑥+𝑥′+ ++2(𝑥+𝑥′+)+1

≅ 1, 𝑥),𝑥+, 𝑥)𝑥+, 𝑥)+, 𝑥++ , 1, 𝑥′),𝑥′+, 𝑥′)𝑥′+, 𝑥′)+, 𝑥′++

= 1, 2� 𝑥), 2� 𝑥+, 2� 𝑥)𝑥+, 𝑥)+, 𝑥++ , 1, 2� 𝑥′), 2

� 𝑥′+, 2� 𝑥′)𝑥′+, 𝑥′)+, 𝑥′++

Page 14: Dr. Perceptron - cs.cmu.edu

Some common kernels

• Polynomial kernel:

• for d=2

dK )1'()',( +׺ xxxx

14

𝑥),𝑥+ , 𝑥′),𝑥′+ + 1 +

= 1, 2� 𝑥), 2� 𝑥+, 2� 𝑥)𝑥+, 𝑥)+, 𝑥++ , 1, 2� 𝑥′), 2

� 𝑥′+, 2� 𝑥′)𝑥′+, 𝑥′)+, 𝑥′++

Similarity with the kernel on x is equivalent to dot-product similarity on a transformed feature vector 𝜙(𝒙)

Page 15: Dr. Perceptron - cs.cmu.edu

Kernels 101

• Duality: two ways to look at this

),(),(ˆ -

-

+

+

ååÎÎ

-= kFP

ikFN

i

kk

KKy xxxxxx

)()(),( kkK xxxx ff ׺

)(ˆ wx,wx Ky =×=

ååÎÎ -

-

+

+ -=FP

kFN

kkkxx

xxw

wx ×= )(ˆ fy

ååÎÎ -

-

+

+ -=FP

kFN

kkkxx

xxw )()( ffsame behavior but compute time/space are different

Explicitly map from x to φ(x)– i.e.tothepointcorrespondingtoxintheHilbertspace(RKHS)

Implicitly map from x to φ(x)bychangingthekernelfunctionK

15

Observation about perceptron Generalization of perceptron

Generalization: add weights to the sums for w

Page 16: Dr. Perceptron - cs.cmu.edu

Kernels 101

• Duality • Gram matrix: K: kij = K(xi,xj)

K(x,x’) = K(x’,x) è Gram matrix is symmetric

K(x,x) > 0 è diagonal of K is positive è K is “positive semi-definite”è zT K z >= 0 for all z

16

Page 17: Dr. Perceptron - cs.cmu.edu

A FAMILIAR KERNEL

Page 18: Dr. Perceptron - cs.cmu.edu

Learning as optimization for regularized logistic regression + hashes• Algorithm:• Initialize arrays W, A of size R and set k=0• For each iteration t=1,…T

– For each example (xi,yi)• V is a hash table• For j : xj>0 increment V[h[j]] by xj

• pi = … ; k++• For each hash value h: V[h]>0:

»W[h] *= (1 - λ2µ)k-A[h]

»W[h] = W[h] + λ(yi - pi)V[h]»A[h] = k

18

V[h] = xij

j:hash( j )%R ==h∑

Page 19: Dr. Perceptron - cs.cmu.edu

19

Page 20: Dr. Perceptron - cs.cmu.edu

Some details

ϕ[h]= ξ ( j)xij

j:hash( j )%m==h∑ , where ξ ( j)∈ −1,+1{ }

Slightly different hash to avoid systematic bias

V[h] = xij

j:hash( j )%R ==h∑

m is the number of buckets you hash into (R in my discussion)

20

Page 21: Dr. Perceptron - cs.cmu.edu

Some details

ϕ[h]= ξ ( j)xij

j:hash( j )%m==h∑ , where ξ ( j)∈ −1,+1{ }

Slightly different hash to avoid systematic bias

21

I.e., for large feature sets the variance should be low

Page 22: Dr. Perceptron - cs.cmu.edu

Some details

I.e. – a hashed vector is probably close to the original vector

22

Page 23: Dr. Perceptron - cs.cmu.edu

Some details

I.e. the inner products between x and x’ are probably not changed too much by the hash function: a classifier will probably still work23

Page 24: Dr. Perceptron - cs.cmu.edu

The Voted Perceptron for Ranking and Structured Classification

William Cohen

Page 25: Dr. Perceptron - cs.cmu.edu

The voted perceptron for ranking

A Binstances x1 x2 x3 x4… Compute: yi = vk . xi

Return: the index b* of the “best” xi

^

b*

bIf mistake: vk+1 = vk + xb - xb*

Page 26: Dr. Perceptron - cs.cmu.edu

u

-u

xx

x

x

x

γ

Ranking some x’s with the target vector u

Page 27: Dr. Perceptron - cs.cmu.edu

u

-u

xx

x

x

x

γ

v

Ranking some x’s with some guess vector v – part 1

Page 28: Dr. Perceptron - cs.cmu.edu

u

-u

xx

x

x

x

v

Ranking some x’s with some guess vector v – part 2.

The purple-circled x is xb* - the one the learner has chosen to rank highest. The green circled x is xb, the right answer.

Page 29: Dr. Perceptron - cs.cmu.edu

u

-u

xx

x

x

x

v

Correcting v by adding xb – xb*

Page 30: Dr. Perceptron - cs.cmu.edu

xx

x

x

x

vk

Vk+1

Correcting v by adding xb – xb*

(part 2)

Page 31: Dr. Perceptron - cs.cmu.edu

u

-u

v1

+x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

u

-u

u

-u

xx

x

x

x

v

Page 32: Dr. Perceptron - cs.cmu.edu

u

-u

v1

+x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

3

u

-u

u

-u

xx

x

x

x

v

Page 33: Dr. Perceptron - cs.cmu.edu

u

-u

v1

+x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

3

u

-u

u

-u

xx

x

x

x

v

Page 34: Dr. Perceptron - cs.cmu.edu

u

-u

v1

+x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

u

-u

u

-u

xx

x

x

x

v

Notice this doesn’t depend at all on the number of x’s being ranked

Neither proof depends on the dimension of the x’s.

Page 35: Dr. Perceptron - cs.cmu.edu

Ranking perceptrons èstructured perceptrons

• The API:– A sends B a (maybe

huge) set of items to rank

– B finds the single bestone according to the current weight vector

– A tells B which one was actually best

• Structured classification on a sequence– Input: list of words:

x=(w1,…,wn)– Output: list of labels:

y=(y1,…,yn)– If there are K classes,

there are Kn labels possible for x

Page 36: Dr. Perceptron - cs.cmu.edu

36

Borkar et al’s: HMMs for segmentation

– Example: Addresses, bib records– Problem: some DBs may split records up differently (eg no “mail

stop” field, combine address and apt #, …) or not at all– Solution: Learn to segment textual form of records

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

Author Year Title JournalVolume Page

Page 37: Dr. Perceptron - cs.cmu.edu

IE with Hidden Markov Models

37

Title

Journal

Author 0.9

0.5

0.50.8

0.2

0.1

Transition probabilities

Year

Learning

Convex

0.06

0.03

..

Comm.

Trans.

Chemical

0.04

0.02

0.004

Smith

Cohen

Jordan

0.01

0.05

0.3

Emission probabilities

dddd

dd

0.8

0.2

Page 38: Dr. Perceptron - cs.cmu.edu

Inference for linear-chain CRFs

When will prof Cohen post the notes …

Idea 1: features are properties of two adjacent tokens, and the pair of labels assigned to them (Begin,Inside,Outside)

• (y(i)==B or y(i)==I) and (token(i) is capitalized)

• (y(i)==I and y(i-1)==B) and (token(i) is hyphenated)

• (y(i)==B and y(i-1)==B)

•eg “tell Rose William is on the way”

Idea 2: construct a graph where each path is a possible sequence labeling.

Page 39: Dr. Perceptron - cs.cmu.edu

Inference for a linear-chain CRF

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

When will prof Cohen post the notes …

•Inference: find the highest-weight path given a weighting of features•This can be done efficiently using dynamic programming (Viterbi)

Page 40: Dr. Perceptron - cs.cmu.edu

Ranking perceptrons èstructured perceptrons

• The API:– A sends B a (maybe

huge) set of items to rank

– B finds the single bestone according to the current weight vector

– A tells B which one was actually best

• Structured classification on a sequence– Input: list of words:

x=(w1,…,wn)– Output: list of labels:

y=(y1,…,yn)– If there are K classes,

there are Kn labels possible for x

Page 41: Dr. Perceptron - cs.cmu.edu

Ranking perceptrons èstructured perceptrons

• New API:– A sends B the word

sequence x– B finds the single best

y according to the current weight vector using Viterbi

– A tells B which y was actually best

– This is equivalent to ranking pairs g=(x,y’)

• Structured classification on a sequence– Input: list of words:

x=(w1,…,wn)– Output: list of labels:

y=(y1,…,yn)– If there are K classes,

there are Kn labels possible for x

Page 42: Dr. Perceptron - cs.cmu.edu

The voted perceptron for ranking

A Binstances x1 x2 x3 x4… Compute: yi = vk . xi

Return: the index b* of the “best” xi

^

b*

bIf mistake: vk+1 = vk + xb - xb*

Change number one is notation: replace x with g

Page 43: Dr. Perceptron - cs.cmu.edu

The voted perceptron for structured classification tasks

A Binstances g1 g2 g3 g4… Compute: yi = vk . gi

Return: the index b* of the “best” gi

^

b* bIf mistake: vk+1 = vk + gb - gb*

1. A sends B feature functions, and instructions for creating the instances g:

• A sends a word vector xi. Then B could create the instances g1 =F(xi,y1), g2= F(xi,y2), …

• but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi.

2. A sends B the correct label sequence yi.

3. On errors, B sets vk+1 = vk + gb - gb* = vk + F(xi,y) - F(xi,y*)

Page 44: Dr. Perceptron - cs.cmu.edu

EMNLP 2002, Best paper

Results from the original paper….

Page 45: Dr. Perceptron - cs.cmu.edu

Collins’ Experiments• POS tagging • NP Chunking (words and POS tags from Brill’s

tagger as features) and BIO output tags• Compared logistic regression methods (MaxEnt)

and “Voted Perceptron trained HMM’s”– With and w/o averaging– With and w/o feature selection (count>5)

Page 46: Dr. Perceptron - cs.cmu.edu

Collins’ results

Page 47: Dr. Perceptron - cs.cmu.edu

Where we are…• Experiments with a hash-trick implementation

of logistic regression• Next question:

– how do you parallelize SGD, or more generally, this kind of streaming algorithm?

– each example affects the next prediction èorder matters è parallelization changes the behavior

– we will step back to perceptrons and then step forward to parallel perceptrons

– then another nice parallel learning algorithm – then a midterm

47