nlp guest lecture: morphological word embeddingsmorphological analyzer (madamira), train word and...
TRANSCRIPT
![Page 1: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/1.jpg)
NLP Guest Lecture: Morphological Word Embeddings PAMELA SHAPIRO
![Page 2: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/2.jpg)
Outline Review:Morphology,WordEmbeddings Research:MorphologicalWordEmbeddings Application:NeuralMachineTranslation
2
![Page 3: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/3.jpg)
Review: Morphology Morphology:studiessub-wordlevelphenomena◦ e.g.prefixes,suffixes,compounding
Morpheme:sub-wordcomponentthatbearsmeaning◦ e.g.play,-ing,-s,-er,re-
Inflectionalmorphology:doesn’tchangecorecontentofword(includingPOS),tenses/cases◦ Basewhichinflectionalmorphologyattachestocalledlemma
Derivationalmorphology:canchangemeaning,partofspeech
Compounding:combiningcompletewords◦ e.g.toothbrush
Interactswithphonology,orthography,andsyntax
3
![Page 4: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/4.jpg)
Morphology Across Languages Morphologicalpatternsvaryacrosslanguages
Analytic:fewmorphemesperword(e.g.Chinese,Vietnamese)◦ Englishismoderatelyanalytic
Fusional:tendto“fuse”multiplefeaturesintooneaffix◦ IncludesmostIndo-EuropeanandSemiticlanguages◦ e.g.habló,-ódenotesbothpasttenseand3rdpersonsingular
Agglutinative:tendtohavemoremorphemesperword,moreclearlydemarcatedandregular◦ IncludesFinnish,Hungarian,Turkish◦ e.g.anlamadım'Ididnotunderstand’–verbrootanla-,negativemarker–m(a),definitepastmarker–d(i),and1stpersonindicator,-(ı)m.
4
![Page 5: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/5.jpg)
Non-concatenative Morphology Non-concatenativemorphology:additionalpropertyofSemiticlanguages,morphemesarenotjustconcatenatedtogether,include“templates”whichrootsareinsertedinto
5
![Page 6: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/6.jpg)
Review: Word Embeddings Skip-gramobjective(word2vec)
Learninputvectorsandoutputvectorssimultaneouslys.t.inputvectorpredictsoutputvectorsofsurroundingwords
Mikolovetal.(2013)
w(t-2)
w(t-1)
w(t+1)
w(t+2)
OUTPUTPROJECTION
w(t)
INPUT
6
![Page 7: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/7.jpg)
Problem: Sparsity due to Morphology Rarewordsdon’tgettheirvectorsupdatedenoughtohavemeaningfulrepresentations Inmorphologicallyrichlanguages,maybeonlymostcommonvariantgetsupdatedfrequently Solution:Bringmorphologicalvariantsclosertogether◦ “morphologicalwordembeddings”isaveryvaguetermthatcouldmeanmanythings,butgenerallyaddressesthisissue
WetryacoupleapproachestothisprobleminArabic,trainingwordvectorsonArabicWikipedia
7
![Page 8: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/8.jpg)
Approach #1: Bag of Character N-Grams Fromexistingliterature(fastText)
Noaccesstomorphologicalinformation,butmaybeabletolearnimplicitly
Bojanowskietal.(2017)
OUTPUTPROJECTIONINPUT
w(t-2)
w(t-1)
w(t+1)
w(t+2)
gi(w(t))
gi-1(w(t))
gi+1(w(t))
+
8
![Page 9: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/9.jpg)
Approch #2: Lemma-Informed Lemma-informed(morph)
Acquirelemmasusingspecializedmorphologicalanalyzer(MADAMIRA),trainwordandlemmaembeddings
w(t-2)
w(t-1)
w(t+1)
w(t+2)
OUTPUTPROJECTION
w(t)
l(t)
INPUT
9
![Page 10: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/10.jpg)
Word Embeddings: Evaluation Intrinsic: Assesssimilarityofwordseitherdirectlyorwithanalogies,comparingtohumanjudgment
Extrinsic: Assessusefulnessindownstreamapplication(e.g.NeuralMachineTranslation)
10
![Page 11: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/11.jpg)
Word Similarity WordSimilarity-353testset:wordpairswithhumanjudgmentsof“relatedness”onscaleof1-10◦ e.g.tigercat7.35,noonstring0.54(averagedoverseveralhumanjudgments)
Toevaluatesuccess:◦ Usemodel’swordvectorstodeterminecosinesimilarity(between0and1)◦ Assesscorrelationwithhumanjudgments,typicallyuseSpearman'srankcorrelationcoefficient
WeuseaversionofthistestsetthathasbeentranslatedintoArabic(HassanandMihalcea,2009)
11
![Page 12: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/12.jpg)
Results: Word Similarity
12
w(t-2)
w(t-1)
w(t+1)
w(t+2)
OUTPUTPROJECTION
w(t)
l(t)
INPUT
![Page 13: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/13.jpg)
Results: Word Similarity
13
![Page 14: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/14.jpg)
Extrinsic Evaluation: NMT UsewordembeddingstoinitializesourcewordvectorsofNMTsystem
MTBitext:AràEncorpusofTEDtalks,consideredlow-resourcesetting◦ ~175ktrainsentences◦ 2kdev◦ 2ktest
14
![Page 15: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/15.jpg)
Background: Neural Machine Translation
15
vielendank!wordembeddings
BidirectionalLSTM/GRU
![Page 16: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/16.jpg)
Background: Neural Machine Translation
16
wordembeddings
BidirectionalLSTM/GRU
Attention
UnidirectionalLSTM/GRU
thank
vielendank!
![Page 17: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/17.jpg)
Background: Neural Machine Translation
17
wordembeddings
BidirectionalLSTM/GRU
thank
Attention
UnidirectionalLSTM/GRU
thankyou
vielendank!
![Page 18: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/18.jpg)
Background: Neural Machine Translation
18
wordembeddings
BidirectionalLSTM/GRU
thankyou
Attention
UnidirectionalLSTM/GRU
thankyouvery
vielendank!
![Page 19: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/19.jpg)
Background: Neural Machine Translation
19
wordembeddings
BidirectionalLSTM/GRU
thankyouvery
Attention
UnidirectionalLSTM/GRU
thankyouverymuch
vielendank!
![Page 20: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/20.jpg)
Background: Neural Machine Translation
20
wordembeddings
BidirectionalLSTM/GRU
thankyouverymuch!
thankyouverymuch!
Attention
UnidirectionalLSTM/GRU
vielendank!
![Page 21: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/21.jpg)
MT Evaluation: BLEU Score Humanevaluationisexpensive,souseautomaticmetricbasedonn-gramprecision
Notperfect,butshowntocorrelatewithhumanjudgment
Higherisbetter
21
![Page 22: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/22.jpg)
Results: MT
22
![Page 23: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/23.jpg)
Example Sentence
23
CandiscussBPEattheendiftime(tl;drsegmentswordswithanautomaticheuristic)
![Page 24: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/24.jpg)
Conclusion IncorporatingawarenessofmorphologyintowordembeddingscanhelpinNMT
Lemma-informed(morph)andbagofcharactern-grams(fastText)modificationstoskip-gram(word2vec)◦ Bothhelpinbothinstrinsic&extrinsicevaluation◦ However,lemmaismoreusefulforwordsimilarity◦ Meanwhile,fastTextdoesbetterwithMT
24
![Page 25: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/25.jpg)
Byte-Pair Encoding Compressionalgorithm(Gage1994,Sennrichetal.,2016):aimedatkeepingfrequentwordsintact,breakinguprare/unknownwords
Thentranslatesthese“subwordunits”withsamemodel
Hasbecomestandardpractice
25
![Page 26: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/26.jpg)
Byte-Pair Encoding Beginswitheverythingsplitupintocharacters,iterativelymergesmostfrequentpairsofsymbols,useslearnedmergeoperationsontestdata(doesn’tcrosswordboundaries)
e.g.alloftheanimalsaregoingdownthepath
all|of|the|animals|are|going|down|the|path
26
![Page 27: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/27.jpg)
Byte-Pair Encoding Beginswitheverythingsplitupintocharacters,iterativelymergesmostfrequentpairsofsymbols,useslearnedmergeoperationsontestdata(doesn’tcrosswordboundaries)
e.g.alloftheanimalsaregoingdownthepath
all|of|the|animals|are|going|down|the|path
1)thàth
27
![Page 28: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/28.jpg)
Byte-Pair Encoding Beginswitheverythingsplitupintocharacters,iterativelymergesmostfrequentpairsofsymbols,useslearnedmergeoperationsontestdata(doesn’tcrosswordboundaries)
e.g.alloftheanimalsaregoingdownthepath
all|of|the|animals|are|going|down|the|path
1)thàth
2)alàal
28
![Page 29: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/29.jpg)
Byte-Pair Encoding Beginswitheverythingsplitupintocharacters,iterativelymergesmostfrequentpairsofsymbols,useslearnedmergeoperationsontestdata(doesn’tcrosswordboundaries)
e.g.alloftheanimalsaregoingdownthepath
all|of|the|animals|are|going|down|the|path
1)thàth
2)alàal
3)theàthe
29
![Page 30: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc3e2010d7dfb1c5d357380/html5/thumbnails/30.jpg)
Byte-Pair Encoding TypicalrangeofBPE:15k-100k
30