lecture 6: representing words - uclanlp.github.io · algorithm 2 vm : a hyper-parameter, sort words...

25
Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA [email protected] Couse webpage: https://uclanlp.github.io/CS269-17/ 1 ML in NLP

Upload: others

Post on 31-Aug-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Lecture 6:Representing Words

Kai-Wei ChangCS @ UCLA

[email protected]

Couse webpage: https://uclanlp.github.io/CS269-17/

1ML in NLP

Page 2: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Bag-of-Words with N-grams

vN-grams: a contiguous sequence of n tokens from a given piece of text

CS 6501: Natural Language Processing 2

http://recognize-speech.com/language-model/n-gram-model/comparison

Page 3: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Language model

vProbability distributions over sentences (i.e., word sequences )

P(W) = P(𝑤"𝑤#𝑤$𝑤% …𝑤')vCan use them to generate strings

P(𝑤' ∣ 𝑤#𝑤$𝑤% …𝑤')")vRank possible sentences

v P(“Today is Tuesday”) > P(“Tuesday Today is”)v P(“Today is Tuesday”) > P(“Today is Los Angeles”)

CS 6501: Natural Language Processing 3

Page 4: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

N-Gram Models

v Unigram model: 𝑃 𝑤" 𝑃 𝑤# 𝑃 𝑤$ …𝑃(𝑤,)v Bigram model: 𝑃 𝑤" 𝑃 𝑤#|𝑤" 𝑃 𝑤$|𝑤# …𝑃(𝑤,|𝑤,)")

v Trigram model:𝑃 𝑤" 𝑃 𝑤#|𝑤" 𝑃 𝑤$|𝑤#, 𝑤" …𝑃(𝑤,|𝑤,)"𝑤,)#)

v N-gram model:𝑃 𝑤" 𝑃 𝑤#|𝑤" … 𝑃(𝑤,|𝑤,)"𝑤,)# …𝑤,)0)

CS 6501: Natural Language Processing 4

Page 5: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Random language via n-gram

v http://www.cs.jhu.edu/~jason/465/PowerPoint/lect01,3tr-ngram-gen.pdf

v https://research.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html

CS 6501: Natural Language Processing 5

Collection of n-gram

Page 6: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

N-Gram Viewer

ML in NLP 6

https://books.google.com/ngrams

Page 7: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

How to represent words?

vN-gram -- cannot capture word similarity

vWord clustersvBrown ClusteringvPart-of-speech tagging

vContinuous space representation vWord embedding

ML in NLP 7

Page 8: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Brown Clustering

v Similar to language modelBut, basic unit is “word clusters”

v Intuition: similar words appear in similar context

v Recap: Bigram Language Modelsv 𝑃 𝑤1,𝑤", 𝑤#, … ,𝑤,

= 𝑃 𝑤" 𝑤1 𝑃 𝑤# 𝑤" …𝑃 𝑤, 𝑤,)"= Π56"7 P(w: ∣ 𝑤:)")

86501 Natural Language Processing

𝑤1 isadummywordrepresenting”beginofasentence”

Page 9: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Motivation example

v ”a dog is chasing a cat”v 𝑃 𝑤1, “𝑎”, ”𝑑𝑜𝑔”, … , “𝑐𝑎𝑡”

= 𝑃 ”𝑎” 𝑤1 𝑃 ”𝑑𝑜𝑔” ”𝑎” …𝑃 ”𝑐𝑎𝑡” ”𝑎”

v Assume Every word belongs to a cluster

96501 Natural Language Processing

Cluster3athe

Cluster46dogcatfoxrabbitbirdboy

Cluster64iswas

Cluster64chasingfollowingbiting…

Page 10: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Motivation example

v Assume every word belongs to a clusterv “a dog is chasing a cat”

106501 Natural Language Processing

Cluster3athe

Cluster46dogcatfoxrabbitbirdboy

Cluster64iswas

Cluster8chasingfollowingbiting…

C3 C46 C64 C8 C3 C46

Page 11: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Motivation example

v Assume every word belongs to a clusterv “a dog is chasing a cat”

116501 Natural Language Processing

Cluster3athe

Cluster46dogcatfoxrabbitbirdboy

Cluster64iswas

Cluster8chasingfollowingbiting…

C3 C46 C64 C8 C3 C46

adogischasingacat

Page 12: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Motivation example

v Assume every word belongs to a clusterv “the boy is following a rabbit”

126501 Natural Language Processing

Cluster3athe

Cluster46dogcatfoxrabbitbirdboy

Cluster64iswas

Cluster8chasingfollowingbiting…

C3 C46 C64 C8 C3 C46

theboyisfollowingarabbit

Page 13: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Motivation example

v Assume every word belongs to a clusterv “a fox was chasing a bird”

136501 Natural Language Processing

Cluster3athe

Cluster46dogcatfoxrabbitbirdboy

Cluster64iswas

Cluster8chasingfollowingbiting…

C3 C46 C64 C8 C3 C46

afoxwaschasingabird

Page 14: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Brown Clustering

v Let 𝐶 𝑤 denote the cluster that 𝑤 belongs tov “a dog is chasing a cat”

146501 Natural Language Processing

Cluster3athe

Cluster46dogcatfoxrabbitbirdboy

Cluster64iswas

Cluster8chasingfollowingbiting…

C3 C46 C64 C8 C3 C46

adogischasingacat

P(C(dog)|C(a)) P(cat|C(cat))

Page 15: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Brown clustering model

vP(“a dog is chasing a cat”)= P(C(“a”)|𝐶1) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))…

P(“a”|C(“a”))P(“dog”|C(“dog”))...

156501 Natural Language Processing

Cluster3athe

Cluster46dogcatfoxrabbitbirdboy

Cluster64iswas

Cluster8chasingfollowingbiting…

C3 C46 C64 C8 C3 C46

adogischasingacat

P(C(dog)|C(a)) P(cat|C(cat))

Page 16: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Brown clustering model

vP(“a dog is chasing a cat”)= P(C(“a”)|𝐶1) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))…

P(“a”|C(“a”))P(“dog”|C(“dog”))...

v In general𝑃 𝑤1,𝑤", 𝑤#, … ,𝑤,= 𝑃 𝐶(𝑤") 𝐶 𝑤1 𝑃 𝐶(𝑤#) 𝐶(𝑤") …𝑃 𝐶 𝑤, 𝐶 𝑤,)"𝑃(𝑤"|𝐶 𝑤" 𝑃 𝑤# 𝐶 𝑤# …𝑃(𝑤,|𝐶 𝑤, )

= Π56"7 P 𝐶 w: 𝐶 𝑤:)" 𝑃(𝑤: ∣ 𝐶 𝑤: )

166501 Natural Language Processing

Page 17: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Model parameters

𝑃 𝑤1,𝑤", 𝑤#, … ,𝑤,= Π56"7 P 𝐶 w: 𝐶 𝑤:)" 𝑃(𝑤: ∣ 𝐶 𝑤: )

176501 Natural Language Processing

Cluster3athe

Cluster46dogcatfoxrabbitbirdboy

Cluster64iswas

Cluster8chasingfollowingbiting…

C3 C46 C64 C8 C3 C46

adogischasingacat

Parameterset1:𝑃(𝐶(𝑤:)|𝐶 𝑤:)" )

Parameterset2:𝑃(𝑤:|𝐶 𝑤: )

Parameterset3:𝐶 𝑤:

Page 18: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Model parameters

𝑃 𝑤1,𝑤", 𝑤#, … ,𝑤,= Π56"7 P 𝐶 w: 𝐶 𝑤:)" 𝑃(𝑤: ∣ 𝐶 𝑤: )v A vocabulary set 𝑊v A function 𝐶:𝑊 → {1, 2, 3, … 𝑘}

v A partition of vocabulary into k classesv Conditional probability 𝑃(𝑐′ ∣ 𝑐) for 𝑐, 𝑐N ∈ 1,… , 𝑘v Conditional probability 𝑃(𝑤 ∣ 𝑐) for 𝑐, 𝑐N ∈ 1,… , 𝑘 , 𝑤 ∈ 𝑐

186501 Natural Language Processing

𝜃 representsthesetofconditionalprobabilityparametersCrepresentstheclustering

Page 19: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Log likelihood

LL(𝜃, 𝐶) = log 𝑃 𝑤1,𝑤", 𝑤#, … ,𝑤, 𝜃, 𝐶= logΠ56"7 P 𝐶 w: 𝐶 𝑤:)" 𝑃(𝑤: ∣ 𝐶 𝑤: )= ∑56"7 [logP 𝐶 w: 𝐶 𝑤:)" + log 𝑃(𝑤: ∣ 𝐶 𝑤: ) ]

vMaximizing LL(𝜃, 𝐶) can be done by alternatively update 𝜃 and 𝐶1. max

\∈]𝐿𝐿(𝜃, 𝐶)

2. max_𝐿𝐿(𝜃, 𝐶)

196501 Natural Language Processing

Page 20: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

max\∈]

𝐿𝐿(𝜃, 𝐶)

LL(𝜃, 𝐶) = log 𝑃 𝑤1,𝑤", 𝑤#, … ,𝑤, 𝜃, 𝐶= logΠ56"7 P 𝐶 w: 𝐶 𝑤:)" 𝑃(𝑤: ∣ 𝐶 𝑤: )= ∑56"7 [logP 𝐶 w: 𝐶 𝑤:)" + log 𝑃(𝑤: ∣ 𝐶 𝑤: ) ]

v𝑃(𝑐′ ∣ 𝑐) = #(ab,a)#a

v𝑃(𝑤 ∣ 𝑐) = #(c,a)#a

6501 Natural Language Processing 20

ThispartisthesameastrainingaPOStaggingmodel

Seesection9.2:http://ciml.info/dl/v0_99/ciml-v0_99-ch09.pdf

Page 21: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

max_𝐿𝐿(𝜃, 𝐶)

max_∑56"7 [logP 𝐶 w: 𝐶 𝑤:)" + log 𝑃(𝑤: ∣ 𝐶 𝑤: ) ]

= n ∑ ∑ 𝑝 𝑐, 𝑐N log e a,ab

e a e(ab)+ 𝐺'

aN6"'a6"

where G is a constantvHere,

𝑝 𝑐, 𝑐N = # a,ab

∑ #(a,ab)�h,hb

, 𝑝 𝑐 = # a∑ #(a)�h

v 𝑐: cluster of w:, 𝑐′:cluster of w:)"

ve a,ab

e a e(ab)= e 𝑐 𝑐N

e a(mutual information)

216501 Natural Language Processing

Seeclassnote here:http://web.cs.ucla.edu/~kwchang/teaching/NLP16/slides/classnote.pdf

Page 22: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Algorithm 1

vStart with |V| clusters each word is in its own cluster

vThe goal is to get k clustersv We run |V|-k merge steps:

vPick 2 clusters and merge themvEach step pick the merge maximizing𝐿𝐿(𝜃, 𝐶)

vCost? (can be improved to 𝑂( 𝑉 $))O(|V|-k)𝑂( 𝑉 #)𝑂( 𝑉 #) = 𝑂( 𝑉 k)

#Iters #pairs compute LL

6501 Natural Language Processing 22

Page 23: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Algorithm 2

v m : a hyper-parameter, sort words by frequencyv Take the top m most frequent words, put each of

them in its own cluster 𝑐", 𝑐#, 𝑐$, … 𝑐lv For 𝑖 = 𝑚 + 1 … |𝑉|

v Create a new cluster 𝑐lo" (we have m+1 clusters)v Choose two cluster from m+1 clusters based on 𝐿𝐿 𝜃, 𝐶 and merge ⇒ back to m clusters

v Carry out (m-1) final merges ⇒ full hierarchy v Running time O 𝑉 𝑚# + 𝑛 ,

n=#words in corpus

6501 Natural Language Processing 23

Page 24: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Example clusters (Brown+1992)

6501 Natural Language Processing 24

Page 25: Lecture 6: Representing Words - uclanlp.github.io · Algorithm 2 vm : a hyper-parameter, sort words by frequency vTake the top m most frequent words, put each of them in its own cluster

Example Hierarchy(Miller+2004)

6501 Natural Language Processing 25