word embeddings - cocoxu.github.io

50
Word Embeddings Wei Xu (many slides from Greg Durrett)

Upload: others

Post on 30-Apr-2022

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Word Embeddings - cocoxu.github.io

WordEmbeddings

WeiXu(many slides from Greg Durrett)

Page 2: Word Embeddings - cocoxu.github.io

Recall:FeedforwardNNs

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

so<maxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))num_classesprobs

Page 3: Word Embeddings - cocoxu.github.io

Recall:BackpropagaEon

V

dhiddenunits

so<maxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@W err(root)@z

@Verr(z)

zf(x)

Page 4: Word Embeddings - cocoxu.github.io

ThisLecture

‣ Training

‣WordrepresentaEons

‣ word2vec/GloVe

‣ EvaluaEngwordembeddings

Page 5: Word Embeddings - cocoxu.github.io

TrainingTips

Page 6: Word Embeddings - cocoxu.github.io

Batching

‣ BatchingdatagivesspeedupsduetomoreefficientmatrixoperaEons

‣ NeedtomakethecomputaEongraphprocessabatchatthesameEme

probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))

...

‣ Batchsizesfrom1-100o<enworkwell

def make_update(input, gold_label)

# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]

...

Page 7: Word Embeddings - cocoxu.github.io

TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopEmizaEonmethod(SGD,Adagrad,etc.)

‣ HowtoiniEalize?Howtoregularize?WhatopEmizertouse?

‣ Thislecture:somepracEcaltricks.TakedeeplearningoropEmizaEoncoursestounderstandthisfurther

Page 8: Word Embeddings - cocoxu.github.io

HowdoesiniEalizaEonaffectlearning?

V

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

so<maxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))

‣ HowdoweiniEalizeVandW?Whatconsequencesdoesthishave?

‣ Nonconvexproblem,soiniEalizaEonma[ers!

Page 9: Word Embeddings - cocoxu.github.io

‣ Nonlinearmodel…howdoesthisaffectthings?

‣ Tanh:IfcellacEvaEonsaretoolargeinabsolutevalue,gradientsaresmall

‣ ReLU:largerdynamicrange(allposiEvenumbers),butcanproducebigvalues,andcanbreakdownifeverythingistoonegaEve(“dead”ReLU)

HowdoesiniEalizaEonaffectlearning?

Krizhevskyetal.(2012)

Page 10: Word Embeddings - cocoxu.github.io

IniEalizaEon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

‣ Candorandomuniform/normaliniEalizaEonwithappropriatescale

U

"�r

6

fan-in + fan-out,+

r6

fan-in + fan-out

#‣ XavieriniEalizer:

‣Wantvarianceofinputsandgradientsforeachlayertobethesame

2)IniEalizetoolargeandcellsaresaturated

https://mmuratarat.github.io/2019-02-25/xavier-glorot-he-weight-inithttps://arxiv.org/pdf/1502.01852v1.pdf

Page 11: Word Embeddings - cocoxu.github.io

BatchNormalizaEon‣ BatchnormalizaEon(IoffeandSzegedy,2015):periodicallyshi<+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)

https://medium.com/@shiyan/xavier-initialization-and-batch-normalization-my-understanding-b5b91268c25c

Page 12: Word Embeddings - cocoxu.github.io

Dropout‣ ProbabilisEcallyzerooutpartsofthenetworkduringtrainingtopreventoverfigng,usewholenetworkattestEme

Srivastavaetal.(2014)

‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy

‣ FormofstochasEcregularizaEon

‣ OnelineinPytorch/Tensorflow

Page 13: Word Embeddings - cocoxu.github.io

OpEmizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused

‣ AdapEvestepsizelikeAdagrad,incorporatesmomentum

Page 14: Word Embeddings - cocoxu.github.io

OpEmizer‣Wilsonetal.NIPS2017:adapEvemethodscanactuallyperformbadlyattestEme(Adamisinpink,SGDinblack)

‣ Checkdevsetperiodically,decreaselearningrateifnotmakingprogress

Page 15: Word Embeddings - cocoxu.github.io

‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

‣ ObjecEve:manylossfuncEonslooksimilar,justchangesthelastlayeroftheneuralnetwork

‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)

‣ Training:lotsofchoicesforopEmizaEon/hyperparameters

FourElementsofNNs

Page 16: Word Embeddings - cocoxu.github.io

WordRepresentaEons

Page 17: Word Embeddings - cocoxu.github.io

WordRepresentaEons

‣ ConEnuousmodel<->expectsconEnuoussemanEcsfrominput

‣ “Youshallknowawordbythecompanyitkeeps”Firth(1957)

‣ NeuralnetworksworkverywellatconEnuousdata,butwordsarediscrete

slidecredit:DanKlein,DanJurafsky

Page 18: Word Embeddings - cocoxu.github.io

DiscreteWordRepresentaEons

good

enjoyablegreat

0

fishcat

‣ Brownclusters:hierarchicalagglomeraEvehardclustering(eachwordhasonecluster,notsomeposteriordistribuEonlikeinmixturemodels)

‣Maximize

‣ UsefulfeaturesfortaskslikeNER,notsuitableforNNs

dog…

isgo

0

0 1 1

11

1

1

0

0

P (wi|wi�1) = P (ci|ci�1)P (wi|ci)

Brownetal.(1992)

Page 19: Word Embeddings - cocoxu.github.io

�19

‣ Brownclusters:hierarchicalagglomeraEvehardclustering‣Wegiveaverybriefsketchofthealgorithmhere:

k: a hyper-parameter, sort words by frequency

Take the top k most frequent words, put each of them in its own cluster

For

Create a new cluster (we have clusters)

Choose two clusters from clusters based on quality(C) and merge (back to k clusters)

Carry out final merges (full hierarchy)

Running time , n=#words in corpus

𝑐1, 𝑐2,  𝑐3,  …𝑐k𝑖 = (k + 1)… |𝑉 |

𝑐k+1  k + 1k + 1

k − 1𝑂( 𝑉 k2 + 𝑛)  

Learn more: Percy Liang’s phd thesis - Semi-Supervised Learning for Natural Language

Quality(C) = loge(wi |C(wi ))i

n

∑ q(C(wi ) |C(wi−1)) = p(c,c ')log p(c,c ')p(c)p(c ')c '=1

k

∑c=1

k

∑ +G

mutual information between adjacent clusters

entropy of the word distribution

DiscreteWordRepresentaEons

Page 20: Word Embeddings - cocoxu.github.io

�20

‣ Brownclusters:hierarchicalagglomeraEvehardclustering

DiscreteWordRepresentaEons

‣ ExampleClustersfromMilleretal.2004

word cluster features (bit string prefix)

Page 21: Word Embeddings - cocoxu.github.io

NERinTwi[er

�21

2m 2ma 2mar 2mara 2maro 2marrow 2mor 2mora 2moro 2morow 2morr 2morro 2morrow 2moz 2mr

2mro 2mrrw 2mrw 2mw tmmrw tmo tmoro tmorrow tmoz tmr tmro tmrow tmrrow tmrrw tmrw tmrww tmw tomaro tomarow tomarro tomarrow tomm tommarow

tommarrow tommoro tommorow tommorrow tommorw tommrow tomo tomolo tomoro tomorow

tomorro tomorrw tomoz tomrw tomz

Ri[eretal.(2011) Cherry&Guo(2015)

Word2vecBoth

Page 22: Word Embeddings - cocoxu.github.io

WordEmbeddings

Bothaetal.(2017)

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

emb(interest)

emb(rates)

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

‣WhatproperEesshouldthesevectorshave?

Page 23: Word Embeddings - cocoxu.github.io

SenEmentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput

Iyyeretal.(2015)

Page 24: Word Embeddings - cocoxu.github.io

goodenjoyable

bad

dog

great

is

‣Wantavectorspacewheresimilarwordshavesimilarembeddings

themoviewasgreat

themoviewasgood

~~

WordEmbeddings

‣ Goal:comeupwithawaytoproducetheseembeddings

‣ Foreachword,want“medium”dimensionalvector(50-300dims)represenEngit.

Page 25: Word Embeddings - cocoxu.github.io

WordRepresentaEons

�25

‣ Count-based:v*idf,PPMI,…

‣ Class-based:BrownClusters,…

‣ DistributedpredicEon-basedembeddings:Word2vec,GloVe,FastText,…

‣ Distributedcontextualembeddings:ELMo,BERT,GPT,…

‣ +manymorevariants:mulE-senseembeddings,syntacEcembeddings,…

Page 26: Word Embeddings - cocoxu.github.io

word2vec/GloVe

Page 27: Word Embeddings - cocoxu.github.io

NeuralProbabilisEcLanguageModel

�27 Bengioetal.(2003)

Page 28: Word Embeddings - cocoxu.github.io

word2vec:ConEnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

‣ Parameters:dx|V|(oned-lengthcontextvectorpervocword),|V|xdoutputparameters(W)

dog

the

+

sized

so<max

goldlabel=bit,nomanuallabelingrequired!

Mikolovetal.(2013)

d-dimensional wordembeddings

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

size|V|xd

W

size|V|

Page 29: Word Embeddings - cocoxu.github.io

word2vec:Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

bit

so<max

goldlabel=dog

‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)(alsousableasvectors!)

‣ Anothertrainingexample:bit->the

P (w0|w) = softmax(We(w))

Mikolovetal.(2013)

W

d-dimensional wordembeddings

size|V|xd

Page 30: Word Embeddings - cocoxu.github.io

HierarchicalSo<max

‣Matmul+so<maxover|V|isveryslowtocomputeforCBOWandSG

‣ Hierarchicalso<max:

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ Standardso<max: O(|V|)dotproductsofsized-pertraininginstancepercontextword

O(log(|V|))dotproductsofsized,

thea

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

|V|xdparametersMikolovetal.(2013)

P (w0|w) = softmax(We(w))

‣ log(|V|)binarydecisions

http://building-babylon.net/2017/08/01/hierarchical-softmax/

Page 31: Word Embeddings - cocoxu.github.io

Skip-GramwithNegaEveSampling

‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaEveexamplesbysamplingfromunigramdistribuEon

wordsinsimilarcontextsselectforsimilarcvectors

P (y = 1|w, c) = ew·c

ew·c + 1

‣ ObjecEve= logP (y = 1|w, c)� 1

k

nX

i=1

logP (y = 0|wi, c)sampled

thedogbittheman

logP (y = 1|w, c)� 1

k

nX

i=1

logP (y = 0|wi, c)

Page 32: Word Embeddings - cocoxu.github.io

ConnecEonswithMatrixFactorizaEon

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

|V|

|V|

knife dog sword love like

knife 0 1 6 5 5

dog 1 0 5 5 5

sword 6 5 0 5 5

love 5 5 5 0 5

like 5 5 5 5 2

Two words are “similar” in meaning if their context vectors are similar. Similarity == relatedness

Page 33: Word Embeddings - cocoxu.github.io

ConnecEonswithMatrixFactorizaEon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

|V|

|V| |V|

d

d

|V|

contextvecsword vecs

‣ LooksalmostlikeamatrixfactorizaEon…canweinterpretitthisway?

Page 34: Word Embeddings - cocoxu.github.io

Skip-GramasMatrixFactorizaEon

Levyetal.(2014)

|V|

|V|Mij = PMI(wi, cj)� log k

PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

‣ IfwesamplenegaEveexamplesfromtheuniformdistribuEonoverwords

numnegaEvesamples

‣ …andit’saweightedfactorizaEonproblem(weightedbywordfreq)

Skip-gramobjecEveexactlycorrespondstofactoringthismatrix:

Page 35: Word Embeddings - cocoxu.github.io

Co-occurenceMatrix

�35

‣ Typicalproblemsinword-wordco-occurrences:‣ RawfrequencyisnotthebestmeasureofassociaEonbetweenwords.‣ Frequentwordsareo<enmoreimportantthanrarewordsthatonlyappearonceortwice;‣ But,frequentwords(e.g.,the)thatappearinalldocumentsarealsonotveryusefulsignal.

‣ SoluEons—weighingtermsinword-word/word-docco-occurrencematrix‣ Tf*idf‣ PPMI(PosiEvePMI)

Page 36: Word Embeddings - cocoxu.github.io

Co-occurenceMatrix

�36

‣ Tf*idf‣ Tf:termfrequency

‣ Idf:inversedocumentfrequency

As You Like It Twelfth Night Julius Caesar Henry V

battle 1 0 7 17

solider 2 80 62 89

fool 36 58 1 4

clown 20 15 2 3� 𝑡𝑓 = log10(count(𝑡,  𝑑) + 1)

𝑖𝑑𝑓𝑖 = log10(𝑁𝑑𝑓𝑖

)

TotalnumberofdocsincollecEon

numberofdocsthathavewordi

word-docco-occurrences

Page 37: Word Embeddings - cocoxu.github.io

GloVe(GlobalVectors)

Penningtonetal.(2014)

X

i,j

f(count(wi, cj))�w>

i cj + ai + bj � log count(wi, cj))�2‣ Loss=

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix

‣ Constantinthedatasetsize(justneedcounts),quadraEcinvocsize

‣ Byfarthemostcommonnon-contextualwordvectorsusedtoday(10000+citaEons)

wordpaircounts

|V|

|V|

Page 38: Word Embeddings - cocoxu.github.io

Preview:Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaEonsaswordvectors

‣ Context-sensiCvewordembeddings:dependonrestofthesentence

‣ HugeimprovementsacrossnearlyallNLPtasksoverword2vec&GloVe

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

Page 39: Word Embeddings - cocoxu.github.io

EvaluaEon

Page 40: Word Embeddings - cocoxu.github.io

VisualizaEon

Source: tensorflow.org

Page 41: Word Embeddings - cocoxu.github.io

VisualizaEon

Kulkarnietal.(2015)

Page 42: Word Embeddings - cocoxu.github.io

EvaluaEngWordEmbeddings

‣WhatproperEesoflanguageshouldwordembeddingscapture?

goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was

‣ Similarity:similarwordsareclosetoeachother

‣ Analogy:

ParisistoFranceasTokyoisto???

goodistobestassmartisto???

Page 43: Word Embeddings - cocoxu.github.io

WordSimilarity

Levyetal.(2015)

�cosine(→𝑣 , →𝑤) =  →𝑣 ⋅ →𝑤→𝑣 |→𝑤 |

=  ∑𝑁

𝑖=1 𝑣𝑖𝑤𝑖

∑𝑁𝑖=1 𝑣2

𝑖 ∑𝑁𝑖=1 𝑤2

𝑖

‣ CosineSimilarity:

Page 44: Word Embeddings - cocoxu.github.io

WordSimilarity

Levyetal.(2015)

‣ SVD=singularvaluedecomposiEononPMImatrix

‣ GloVedoesnotappeartobethebestwhenexperimentsarecarefullycontrolled,butitdependsonhyperparameters+thesedisEncEonsdon’tma[erinpracEce

Word2vec

Page 45: Word Embeddings - cocoxu.github.io

HypernymyDetecEon

‣ Hypernyms:detecEveisaperson,dogisaanimal

‣ word2vec(SGNS)worksbarelybe[erthanrandomguessinghere

‣ DowordvectorsencodetheserelaEonships?

Changetal.(2017)

Page 46: Word Embeddings - cocoxu.github.io

Analogies

queen

king

woman

man

(king-man)+woman=queen

‣Whywouldthisbe?

‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin

king+(woman-man)=queen

‣ Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen

Page 47: Word Embeddings - cocoxu.github.io

Analogies

Levyetal.(2015)

‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods

cos(b, a2 � a1 + b1)Maximizingforb:Add= Mul=cos(b2, a2) cos(b2, b1)

cos(b2, a1) + ✏

Page 48: Word Embeddings - cocoxu.github.io

UsingSemanEcKnowledge

Faruquietal.(2015)

‣ StructurederivedfromaresourcelikeWordNet

Originalvectorforfalse

Adaptedvectorforfalse

‣ Doesn’thelpmostproblems

Page 49: Word Embeddings - cocoxu.github.io

UsingWordEmbeddings

‣ Approach1:learnembeddingsasparametersfromyourdata

‣ Approach2:iniEalizeusingGloVe/word2vec/ELMo,keepfixed

‣ Approach3:iniEalizeusingGloVe,fine-tune‣ Fasterbecausenoneedtoupdatetheseparameters

‣Worksbestforsometasks,notusedforELMo,o<enusedforBERT

‣ O<enworkspre[ywell

Page 50: Word Embeddings - cocoxu.github.io

Takeaways

‣ Lotstotunewithneuralnetworks

‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaEonapproaches(constantindatasetsize)

‣ Training:opEmizer,iniEalizer,regularizaEon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

‣ NextEme:RNNsandCNNs

‣ LotsofpretrainedembeddingsworkwellinpracEce,theycapturesomedesirableproperEes

‣ Evenbe[er:context-sensiEvewordembeddings(ELMo/BERT)