cs388: natural language processing lecture 24 ... · this lecture ‣ morphology: effects and...
TRANSCRIPT
CS388:NaturalLanguageProcessingLecture24:Mul9linguality+Morphology
GregDurrett
Administrivia
‣ Project2graded;average=19.0
‣ Finalprojectpresenta9onsnextweek
‣ SeeCanvasannouncementforwhoispresen9ngwhen
‣ Canbe“workinprogress”,butthereshouldbeatleastpreliminaryresults
‣ FinalreportsdueonDecember14;noslipdays
Dealingwithotherlanguages
‣ManyalgorithmssofarhavebeendevelopedforEnglish
‣ Somestructureslikecons9tuencyparsingdon’tmakesenseforotherlanguages‣ NeuralmethodsaretypicallytunedtoEnglish-scaleresources,maynotbethebestforotherlanguageswherelessdataisavailable
1)Whatotherphenomena/challengesdoweneedtosolve?
‣ Ques9on:
2)Howcanweleverageexis9ngresourcestodobe]erinotherlanguageswithoutjustannota9ngmassivedata?
‣ OtherlanguagespresentsomeproblemsnotseeninEnglishatall!
ThisLecture
‣Morphology:effectsandchallenges
‣ Cross-lingualtaggingandparsing
‣Morphologytasks:analysis,inflec9on,wordsegmenta9on
Morphology
Whatismorphology?‣ Studyofhowwordsform
‣ Deriva9onalmorphology:createanewlexemefromabaseestrange(v)=>estrangement(n)become(v)=>unbecoming(adj)
Ibecome/shebecomes
‣ Inflec9onalmorphology:wordisinflectedbasedonitscontext
‣Maynotbetotallyregular:enflame=>inflammable
‣Mostlyappliestoverbsandnouns
MorphologicalInflec9on‣ InEnglish: Iarrive youarrive he/she/itarrives
wearrive youarrive theyarrive[X]arrived
‣ InFrench:
MorphologicalInflec9on‣ InSpanish:
NounInflec9on
‣ Nomina9ve:I/he/she,accusa9ve:me/him/her,geni9ve:mine/his/hers
‣ Notjustverbseither;gender,number,casecomplicatethings
Igivethechildrenabook<=>IchgebedenKinderneinBuchItaughtthechildren<=>IchunterrichtedieKinder
‣ Da9ve:mergedwithaccusa9veinEnglish,showsrecipientofsomething
IrregularInflec9on‣ Commonwordsareolenirregular
‣ Iam/youare/sheis
‣ However,lesscommonwordstypicallyfallintosomeregularparadigm—thesearesomewhatpredictable
‣ Jesuis/tues/elleest
‣ Yosoy/ustedestá/ellaes
Agglu9na9ngLangauges‣ Finnish/Turkish/Hungarian(Finno-Ugric):whatapreposi9onwoulddoinEnglishisinsteadpartoftheverb
‣Manypossibleforms—andinnewswiredata,onlyafewareobservedilla9ve:“into” adessive:“on”
Morphologically-RichLanguages‣ManylanguagesspokenallovertheworldhavemuchrichermorphologythanEnglish(Chineseisthemainexcep9on)
‣ CoNLL2006/2007:dependencyparsing+morphologicalanalysesfor~15mostlyIndo-Europeanlanguages
‣Wordpiece/byte-pairencodingmodelsforMTarepre]ygoodathandlingtheseifthere’senoughdata
‣ SPMRLsharedtasks(2013-2014):Syntac9cParsingofMorphologically-RichLanguages
Morphologically-RichLanguages
‣ Greatresourcesforchallengingyourassump9onsaboutlanguageandforunderstandingmul9lingualmodels!
MorphologicalAnalysis/Inflec9on
MorphologicalAnalysis‣ InEnglish,notthatmanywordforms,lexicalfeaturesonwordsandwordvectorsarepre]yeffec9ve
‣Whenwe’rebuildingsystems,weprobablywanttoknowbaseform+morphologicalfeaturesexplicitly
‣ Inotherlanguages,*lots*moreunseenwords!Affectsparsing,transla9on,…
‣ Howtodothiskindofmorphologicalanalysis?
MorphologicalAnalysis
Ámakormányegyetlenadócsökkentésétsemjavasolja.
n=singular|case=nomina9ve|proper=no
deg=posi9ve|n=singular|case=nomina9ve
n=singular|case=nomina9ve|proper=no
n=singular|case=accusa9ve|proper=no|pperson=3rd|pnumber=singular
mood=indica9ve|t=present|p=3rd|n=singular|def=yes
Butthegovernmentdoesnotrecommendreducingtaxes.
‣Whyisthisuseful?
MorphologicalAnalysis‣ Givenaword,needtorecognizewhatitsmorphologicalfeaturesare
‣ LotsofworkonArabicinflec9on(highamountsofambiguity)
‣ Basicapproach:
‣ Lexicon:tellsyouwhatpossibili9esare
‣ Analyzer:sta9s9calmodelthatdisambiguates
‣ModelsarelargelyCRF-like:scoremorphologicalfeaturesincontext
Predic9ngInflec9on‣ Otherdirec9on:givenbaseform+features,inflecttheword
Durre]andDeNero(2013)
‣ Hardforunknownwords—needmodelsthatgeneralize
w i n d e n
Predic9ngInflec9on
w i n d e n
i
iiia...
i1
en
eestet-...
en2
n
-sttte...
n1
i1 n1
= =====
n1
en1
en2
to wind (de)
en
estt-...
en1
‣ Otherdirec9on:givenbaseform+features,inflecttheword
‣ Hardforunknownwords—needmodelsthatgeneralize
‣ Takeabunchofexis9ngverbsfromWik9onary,extractthesechangerulesusingcharacteralignments
Durre]andDeNero(2013)
‣ TrainaCRFwithcharactern-gramcontextfeaturestolearnwheretoapplythem
MorphologicalReinflec9on
Chahuneauetal.(2013)
‣Machinetransla9onwherephrasetableisdefinedintermsoflemmas‣ “Translate-and-inflect”:translateintouninflectedwordsandpredictinflec9onbasedonsourceside
WordSegmenta9on
MorphemeSegmenta9on‣ Canwedosomethingunsupervisedratherthanthesecomplicatedanalyses?
CreutzandLagus(2002)
‣ unbecoming=>un+becom+ing—weshouldbeabletorecognizethesecommonpiecesandsplitthemoff
‣ Howdoweodothis?
MorphemeSegmenta9on‣ Simpleprobabilis9cmodel
CreutzandLagus(2002)
‣ p(mi)=count(token)/count(alltokens)
‣ TrainwithEM:E-stepinvolveses9ma9ngbestsegmenta9onwithViterbi,M-step:collecttokencounts
allowedexpectedneedneeded all+owe+dexpe+ctedn+e+edne+ed+ed E0
M0:edhascount3 all+ow+edexpect+edne+edne+ed+ed
‣ Someheuris9cs:rejectraremorphemes,one-le]ermorphemes
‣ Doesn’thandlestemchanges:becoming=>becom+ing
E1
ChineseWordSegmenta9on
‣ LSTMsovercharacterembeddings/characterbigramembeddingstopredictwordboundaries
‣ SomelanguagesincludingChinesearetotallyuntokenized
Chenetal.(2015)
‣ Havingtherightsegmenta9oncanhelpmachinetransla9on
Cross-LingualTaggingandParsing
Cross-LingualTagging‣Mul9lingualPOSinduc9on
Snyderetal.(2008)
‣ Genera9vemodeloftwolanguagessimultaneously,jointalignment+taglearning
‣ Complexgenera9vemodel,requiresGibbssamplingforinference
Cross-LingualTagging‣WehaveresourcesforlanguageslikeEnglish—canweusethesemoredirectly?
DasandPetrov(2011)
Ilikeitalot
Jel’aimebeaucoup
NVPRDTADJ
NPRV??
‣ TagwithEnglishtagger,projectacrossbitext,trainFrenchtagger?‣ Candosomethingsmarter
Cross-LingualTagging
DasandPetrov(2011)
{Ilikeitalot
Jel’aimebeaucoup
NVPRDTADJ
{‣ Formagraphoftrigrams,usethesetopropagateknowledgeabouttags
Cross-LingualTaggingDasandPetrov(2011)
Ilikeit
l’aimebeaucoup l’adoreunpeul’adorebeaucoup
edgeweightsbasedonsimilarityofcontextsthesetrigramsoccurin
Iloveit helovesitshelovesit
edgeweightsbasedonalignments(middlewordmustbealigned)
‣ Eachnodeisassociatedwithadistribu9onovertags,labelpropaga9onupdatestheseusingthegraph
Cross-LingualTagging
DasandPetrov(2011)
Cross-LingualTagging
DasandPetrov(2011)
‣ Takethesetrigramsandtreatthemas“soltrainingexamples”andlearnanHMMtagger
‣ Labelpropaga9on:encouragesnodeswithhigher-weightedgesbetweenthemtohavesimilartags
‣ Prunetoonlykeeptagsabovesomeprobabilitytogetthelexicon(validtag-wordpairs)
Cross-LingualTagging
DasandPetrov(2011)
‣ EM-HMM/featureHMM:unsupervisedmethodswithagreedymappingfromlearnedtagstogoldtags
‣ Projec9on:projecttagsacrossbitexttomakepseudogoldcorpus,trainonthat
Cross-LingualParsing
McDonaldetal.(2011)
‣ NowthatwecanPOStagotherlanguages,canweparsethemtoo?
‣ Directtransfer:trainaparseroverPOSsequencesinonelanguage,thenapplyittoanotherlanguage
Iliketomatoes
PRONVERBNOUN
Jelesaime
PRONPRONVERB
Ilikethem
PRONVERBPRON ‣ Eventhoughwe'veneverseenthissequenceinEnglishanddon’tknowthewords,wecans9llfigureitout
Cross-LingualParsing
McDonaldetal.(2011)
‣Mul9-dir:transferaparsertrainedonseveralsourcetreebankstothetargetlanguage
‣Mul9-proj:morecomplexannota9onprojec9onapproach
Cross-LingualEmbeddings
Ammaretal.(2016)
‣ mul9Cluster:usebilingualdic9onariestoformclustersofwordsthataretransla9onsofoneanother,replacecorporawithclusterIDs,train“monolingual”embeddingsoverallthesecorpora
‣ mul9CCA:“project”allotherlanguagesintoEnglish
‣ CCA:learnaprojec9onofaligneddatapointsintoasharedspace
‣ Learnasharedmul9lingualembeddingspacesoanyneuralsystemcantransferover
Cross-LingualEmbeddings
Ammaretal.(2016)
‣Wordvectorsworkpre]ywellat“intrinsic”tasks,someimprovementonthingslikedocumentclassifica9onanddependencyparsingaswell
Wherearewenow?‣ Universaldependencies:treebanks(+tags)for70+languages
‣Manylanguagesares9llsmall,soprojec9ontechniquesmays9llhelp
‣Morecorporainotherlanguages,lessandlessrelianceonstructuredtoolslikeparsers,andpretrainingonunlabeleddatameansthatperformanceonotherlanguagesisbe]erthanever
‣ BERThaspretrainedmul9lingualmodelsthatseemtoworkpre]ywell(trainedonawholebunchoflanguages)
Takeaways
‣ManylanguageshaverichermorphologythanEnglishandposedis9nctchallenges
‣ Problems:howtoanalyzerichmorphology,howtogeneratewithit
‣ CanleverageresourcesforEnglishusingbitexts
‣ Next9me:wrapup+ethicsofNLP