word embeddings - cocoxu.github.io
TRANSCRIPT
WordEmbeddings
WeiXu(many slides from Greg Durrett)
Recall:FeedforwardNNs
V
nfeatures
dhiddenunits
dxnmatrix num_classesxdmatrix
so<maxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))num_classesprobs
Recall:BackpropagaEon
V
dhiddenunits
so<maxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@W err(root)@z
@Verr(z)
zf(x)
ThisLecture
‣ Training
‣WordrepresentaEons
‣ word2vec/GloVe
‣ EvaluaEngwordembeddings
TrainingTips
Batching
‣ BatchingdatagivesspeedupsduetomoreefficientmatrixoperaEons
‣ NeedtomakethecomputaEongraphprocessabatchatthesameEme
probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))
...
‣ Batchsizesfrom1-100o<enworkwell
def make_update(input, gold_label)
# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]
...
TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopEmizaEonmethod(SGD,Adagrad,etc.)
‣ HowtoiniEalize?Howtoregularize?WhatopEmizertouse?
‣ Thislecture:somepracEcaltricks.TakedeeplearningoropEmizaEoncoursestounderstandthisfurther
HowdoesiniEalizaEonaffectlearning?
V
nfeatures
dhiddenunits
dxnmatrix mxdmatrix
so<maxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))
‣ HowdoweiniEalizeVandW?Whatconsequencesdoesthishave?
‣ Nonconvexproblem,soiniEalizaEonma[ers!
‣ Nonlinearmodel…howdoesthisaffectthings?
‣ Tanh:IfcellacEvaEonsaretoolargeinabsolutevalue,gradientsaresmall
‣ ReLU:largerdynamicrange(allposiEvenumbers),butcanproducebigvalues,andcanbreakdownifeverythingistoonegaEve(“dead”ReLU)
HowdoesiniEalizaEonaffectlearning?
Krizhevskyetal.(2012)
IniEalizaEon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange
‣ Candorandomuniform/normaliniEalizaEonwithappropriatescale
U
"�r
6
fan-in + fan-out,+
r6
fan-in + fan-out
#‣ XavieriniEalizer:
‣Wantvarianceofinputsandgradientsforeachlayertobethesame
2)IniEalizetoolargeandcellsaresaturated
https://mmuratarat.github.io/2019-02-25/xavier-glorot-he-weight-inithttps://arxiv.org/pdf/1502.01852v1.pdf
BatchNormalizaEon‣ BatchnormalizaEon(IoffeandSzegedy,2015):periodicallyshi<+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)
https://medium.com/@shiyan/xavier-initialization-and-batch-normalization-my-understanding-b5b91268c25c
Dropout‣ ProbabilisEcallyzerooutpartsofthenetworkduringtrainingtopreventoverfigng,usewholenetworkattestEme
Srivastavaetal.(2014)
‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy
‣ FormofstochasEcregularizaEon
‣ OnelineinPytorch/Tensorflow
OpEmizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused
‣ AdapEvestepsizelikeAdagrad,incorporatesmomentum
OpEmizer‣Wilsonetal.NIPS2017:adapEvemethodscanactuallyperformbadlyattestEme(Adamisinpink,SGDinblack)
‣ Checkdevsetperiodically,decreaselearningrateifnotmakingprogress
‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework
‣ ObjecEve:manylossfuncEonslooksimilar,justchangesthelastlayeroftheneuralnetwork
‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)
‣ Training:lotsofchoicesforopEmizaEon/hyperparameters
FourElementsofNNs
WordRepresentaEons
WordRepresentaEons
‣ ConEnuousmodel<->expectsconEnuoussemanEcsfrominput
‣ “Youshallknowawordbythecompanyitkeeps”Firth(1957)
‣ NeuralnetworksworkverywellatconEnuousdata,butwordsarediscrete
slidecredit:DanKlein,DanJurafsky
DiscreteWordRepresentaEons
good
enjoyablegreat
0
fishcat
‣ Brownclusters:hierarchicalagglomeraEvehardclustering(eachwordhasonecluster,notsomeposteriordistribuEonlikeinmixturemodels)
‣Maximize
‣ UsefulfeaturesfortaskslikeNER,notsuitableforNNs
dog…
isgo
0
0 1 1
11
1
1
0
0
P (wi|wi�1) = P (ci|ci�1)P (wi|ci)
Brownetal.(1992)
�19
‣ Brownclusters:hierarchicalagglomeraEvehardclustering‣Wegiveaverybriefsketchofthealgorithmhere:
k: a hyper-parameter, sort words by frequency
Take the top k most frequent words, put each of them in its own cluster
For
Create a new cluster (we have clusters)
Choose two clusters from clusters based on quality(C) and merge (back to k clusters)
Carry out final merges (full hierarchy)
Running time , n=#words in corpus
𝑐1, 𝑐2, 𝑐3, …𝑐k𝑖 = (k + 1)… |𝑉 |
𝑐k+1 k + 1k + 1
k − 1𝑂( 𝑉 k2 + 𝑛)
Learn more: Percy Liang’s phd thesis - Semi-Supervised Learning for Natural Language
Quality(C) = loge(wi |C(wi ))i
n
∑ q(C(wi ) |C(wi−1)) = p(c,c ')log p(c,c ')p(c)p(c ')c '=1
k
∑c=1
k
∑ +G
mutual information between adjacent clusters
entropy of the word distribution
DiscreteWordRepresentaEons
�20
‣ Brownclusters:hierarchicalagglomeraEvehardclustering
DiscreteWordRepresentaEons
‣ ExampleClustersfromMilleretal.2004
word cluster features (bit string prefix)
NERinTwi[er
�21
2m 2ma 2mar 2mara 2maro 2marrow 2mor 2mora 2moro 2morow 2morr 2morro 2morrow 2moz 2mr
2mro 2mrrw 2mrw 2mw tmmrw tmo tmoro tmorrow tmoz tmr tmro tmrow tmrrow tmrrw tmrw tmrww tmw tomaro tomarow tomarro tomarrow tomm tommarow
tommarrow tommoro tommorow tommorrow tommorw tommrow tomo tomolo tomoro tomorow
tomorro tomorrw tomoz tomrw tomz
Ri[eretal.(2011) Cherry&Guo(2015)
Word2vecBoth
WordEmbeddings
Bothaetal.(2017)
…
Fedraisesinterestratesinorderto…
f(x)?? emb(raises)
‣Wordembeddingsforeachwordforminput
emb(interest)
emb(rates)
previousword
currword
nextword
otherwords,feats,etc.
‣ Part-of-speechtaggingwithFFNNs
‣WhatproperEesshouldthesevectorshave?
SenEmentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput
Iyyeretal.(2015)
goodenjoyable
bad
dog
great
is
‣Wantavectorspacewheresimilarwordshavesimilarembeddings
themoviewasgreat
themoviewasgood
~~
WordEmbeddings
‣ Goal:comeupwithawaytoproducetheseembeddings
‣ Foreachword,want“medium”dimensionalvector(50-300dims)represenEngit.
WordRepresentaEons
�25
‣ Count-based:v*idf,PPMI,…
‣ Class-based:BrownClusters,…
‣ DistributedpredicEon-basedembeddings:Word2vec,GloVe,FastText,…
‣ Distributedcontextualembeddings:ELMo,BERT,GPT,…
‣ +manymorevariants:mulE-senseembeddings,syntacEcembeddings,…
word2vec/GloVe
NeuralProbabilisEcLanguageModel
�27 Bengioetal.(2003)
word2vec:ConEnuousBag-of-Words‣ Predictwordfromcontext
thedogbittheman
‣ Parameters:dx|V|(oned-lengthcontextvectorpervocword),|V|xdoutputparameters(W)
dog
the
+
sized
so<max
goldlabel=bit,nomanuallabelingrequired!
Mikolovetal.(2013)
d-dimensional wordembeddings
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
size|V|xd
W
size|V|
word2vec:Skip-Gram
thedogbittheman‣ Predictonewordofcontextfromword
bit
so<max
goldlabel=dog
‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)(alsousableasvectors!)
‣ Anothertrainingexample:bit->the
P (w0|w) = softmax(We(w))
Mikolovetal.(2013)
W
d-dimensional wordembeddings
size|V|xd
HierarchicalSo<max
‣Matmul+so<maxover|V|isveryslowtocomputeforCBOWandSG
‣ Hierarchicalso<max:
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ Standardso<max: O(|V|)dotproductsofsized-pertraininginstancepercontextword
O(log(|V|))dotproductsofsized,
…
…
thea
‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake
|V|xdparametersMikolovetal.(2013)
P (w0|w) = softmax(We(w))
‣ log(|V|)binarydecisions
http://building-babylon.net/2017/08/01/hierarchical-softmax/
Skip-GramwithNegaEveSampling
‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)
Mikolovetal.(2013)
(bit,the)=>+1(bit,cat)=>-1
(bit,a)=>-1(bit,fish)=>-1
‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaEveexamplesbysamplingfromunigramdistribuEon
wordsinsimilarcontextsselectforsimilarcvectors
P (y = 1|w, c) = ew·c
ew·c + 1
‣ ObjecEve= logP (y = 1|w, c)� 1
k
nX
i=1
logP (y = 0|wi, c)sampled
thedogbittheman
logP (y = 1|w, c)� 1
k
nX
i=1
logP (y = 0|wi, c)
ConnecEonswithMatrixFactorizaEon
‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors
wordpaircounts
|V|
|V|
knife dog sword love like
knife 0 1 6 5 5
dog 1 0 5 5 5
sword 6 5 0 5 5
love 5 5 5 0 5
like 5 5 5 5 2
Two words are “similar” in meaning if their context vectors are similar. Similarity == relatedness
ConnecEonswithMatrixFactorizaEon
Levyetal.(2014)
‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors
wordpaircounts
|V|
|V| |V|
d
d
|V|
contextvecsword vecs
‣ LooksalmostlikeamatrixfactorizaEon…canweinterpretitthisway?
Skip-GramasMatrixFactorizaEon
Levyetal.(2014)
|V|
|V|Mij = PMI(wi, cj)� log k
PMI(wi, cj) =P (wi, cj)
P (wi)P (cj)=
count(wi,cj)D
count(wi)D
count(cj)D
‣ IfwesamplenegaEveexamplesfromtheuniformdistribuEonoverwords
numnegaEvesamples
‣ …andit’saweightedfactorizaEonproblem(weightedbywordfreq)
Skip-gramobjecEveexactlycorrespondstofactoringthismatrix:
Co-occurenceMatrix
�35
‣ Typicalproblemsinword-wordco-occurrences:‣ RawfrequencyisnotthebestmeasureofassociaEonbetweenwords.‣ Frequentwordsareo<enmoreimportantthanrarewordsthatonlyappearonceortwice;‣ But,frequentwords(e.g.,the)thatappearinalldocumentsarealsonotveryusefulsignal.
‣ SoluEons—weighingtermsinword-word/word-docco-occurrencematrix‣ Tf*idf‣ PPMI(PosiEvePMI)
Co-occurenceMatrix
�36
‣ Tf*idf‣ Tf:termfrequency
‣ Idf:inversedocumentfrequency
As You Like It Twelfth Night Julius Caesar Henry V
battle 1 0 7 17
solider 2 80 62 89
fool 36 58 1 4
clown 20 15 2 3� 𝑡𝑓 = log10(count(𝑡, 𝑑) + 1)
𝑖𝑑𝑓𝑖 = log10(𝑁𝑑𝑓𝑖
)
TotalnumberofdocsincollecEon
numberofdocsthathavewordi
word-docco-occurrences
GloVe(GlobalVectors)
Penningtonetal.(2014)
X
i,j
f(count(wi, cj))�w>
i cj + ai + bj � log count(wi, cj))�2‣ Loss=
‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix
‣ Constantinthedatasetsize(justneedcounts),quadraEcinvocsize
‣ Byfarthemostcommonnon-contextualwordvectorsusedtoday(10000+citaEons)
wordpaircounts
|V|
|V|
Preview:Context-dependentEmbeddings
Petersetal.(2018)
‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaEonsaswordvectors
‣ Context-sensiCvewordembeddings:dependonrestofthesentence
‣ HugeimprovementsacrossnearlyallNLPtasksoverword2vec&GloVe
they hit the ballsthey dance at balls
‣ Howtohandledifferentwordsenses?Onevectorforballs
EvaluaEon
VisualizaEon
Source: tensorflow.org
VisualizaEon
Kulkarnietal.(2015)
EvaluaEngWordEmbeddings
‣WhatproperEesoflanguageshouldwordembeddingscapture?
goodenjoyable
bad
dog
great
is
cat
wolf
Cger
was
‣ Similarity:similarwordsareclosetoeachother
‣ Analogy:
ParisistoFranceasTokyoisto???
goodistobestassmartisto???
WordSimilarity
Levyetal.(2015)
�cosine(→𝑣 , →𝑤) = →𝑣 ⋅ →𝑤→𝑣 |→𝑤 |
= ∑𝑁
𝑖=1 𝑣𝑖𝑤𝑖
∑𝑁𝑖=1 𝑣2
𝑖 ∑𝑁𝑖=1 𝑤2
𝑖
‣ CosineSimilarity:
WordSimilarity
Levyetal.(2015)
‣ SVD=singularvaluedecomposiEononPMImatrix
‣ GloVedoesnotappeartobethebestwhenexperimentsarecarefullycontrolled,butitdependsonhyperparameters+thesedisEncEonsdon’tma[erinpracEce
Word2vec
HypernymyDetecEon
‣ Hypernyms:detecEveisaperson,dogisaanimal
‣ word2vec(SGNS)worksbarelybe[erthanrandomguessinghere
‣ DowordvectorsencodetheserelaEonships?
Changetal.(2017)
Analogies
queen
king
woman
man
(king-man)+woman=queen
‣Whywouldthisbe?
‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin
king+(woman-man)=queen
‣ Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen
Analogies
Levyetal.(2015)
‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods
cos(b, a2 � a1 + b1)Maximizingforb:Add= Mul=cos(b2, a2) cos(b2, b1)
cos(b2, a1) + ✏
UsingSemanEcKnowledge
Faruquietal.(2015)
‣ StructurederivedfromaresourcelikeWordNet
Originalvectorforfalse
Adaptedvectorforfalse
‣ Doesn’thelpmostproblems
UsingWordEmbeddings
‣ Approach1:learnembeddingsasparametersfromyourdata
‣ Approach2:iniEalizeusingGloVe/word2vec/ELMo,keepfixed
‣ Approach3:iniEalizeusingGloVe,fine-tune‣ Fasterbecausenoneedtoupdatetheseparameters
‣Worksbestforsometasks,notusedforELMo,o<enusedforBERT
‣ O<enworkspre[ywell
Takeaways
‣ Lotstotunewithneuralnetworks
‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaEonapproaches(constantindatasetsize)
‣ Training:opEmizer,iniEalizer,regularizaEon(dropout),…
‣ Hyperparameters:dimensionalityofwordembeddings,layers,…
‣ NextEme:RNNsandCNNs
‣ LotsofpretrainedembeddingsworkwellinpracEce,theycapturesomedesirableproperEes
‣ Evenbe[er:context-sensiEvewordembeddings(ELMo/BERT)