nlp with h2o
TRANSCRIPT
![Page 1: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/1.jpg)
MichalKurkaMeganKurka
![Page 2: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/2.jpg)
Agenda
![Page 3: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/3.jpg)
Today’s Talk
• WhatisH2O?• H2OinR
H2OOverview
• Word2VecAlgorithm• ConvertingTexttoVectors:AnExample
NaturalLanguageProcessing
• IntroductiontoMachineLearning• OverviewofSupervisedLearningAlgorithms• TrainingModelsinRandFlow• Hyper-ParameterTuning
SupervisedLearning
![Page 4: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/4.jpg)
HighLevel Architecture
![Page 5: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/5.jpg)
H2O inR
![Page 6: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/6.jpg)
UsingH2OwithR
• RestAPIdrivesH2OfromR
• Allcomputationsareperformedinperformance-optimizedJavacodeintheH2Ocluster
![Page 7: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/7.jpg)
ReadingData intoH2OwithR
![Page 8: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/8.jpg)
ReadingDatafromHDFS intoH2OwithR
![Page 9: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/9.jpg)
ReadingDatafromHDFS intoH2OwithR
![Page 10: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/10.jpg)
Natural LanguageProcessing
![Page 11: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/11.jpg)
Word2Vec
• Learnsvectorrepresentationsofwordsbyanalyzinglargetextcorpus• WordEmbeddings=mappingofwordstovectorsfromahighdimensionalspace(100-1000)• Textsources:GoogleNews,Wikipedia,Tweets,…
• Embeddingscapturemeaningoftheword• Semanticallysimilarwordsareclosetoeachother• Canbeusedtofindsynonyms&analogies• Famousexample:
» manistowomanaskingisto??? (queen)» Vec(king)– Vec(man)+Vec(woman)=Vec(queen)
• Differentvariationsofword2vec• Skip-Gram,CBOW• HierarchicalSoftmax,NegativeSampling
car
alien-0.09 -0.14 -0.12 -0.06 0.16
-0.38 -0.11 0.10 -0.26 -0.24
![Page 12: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/12.jpg)
Word2Vec
![Page 13: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/13.jpg)
Word2Vec
Source:twitter.com/DanilBaibak/status/844647217885581312
![Page 14: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/14.jpg)
Word2VecAlgorithm
InputLayer
HiddenLayer
OutputLayer
• word2vectrainsaNeuralNetwork• Singlehiddenlayer• Numberofneuronsinhiddenlayer=lengthofembeddings
• Usesatricktoformulatetheproblemasasupervisedproblem• Thenetworkistrainedtopredicttargetwordsbasedongiven
inputwords(windowslidingoverthetext)• Theweightsontheedgesconnectinginputlayertothe
hiddenlayerconstitutethewordembedding(weightmatrix)• Similarapproachtoauto-encoders
• Optimizations• HierarchicalSoftmax – binarytreetorepresentoutput
probabilities(implementedinH2O)• NegativeSampling– sampletheoutputvectorstobeupdated• HogWild!- lock-freeparallelization(ignoresconflicts)
![Page 15: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/15.jpg)
Word2VecAlgorithm
”Aseeminglyindestructiblehumanoidcyborg issentfrom2029to1984toassassinateawaitress,whoseunbornsonwillleadhumanityinawaragainstthemachines,whileasoldierfromthatwarissenttoprotectheratallcosts.”
Predict:indestructible andcyborg
humanoid
indestructible
cyborg
InputLayer
HiddenLayer
OutputLayer• H2OsupportsSkip-Gramarchitecture
• Networktriestopredictsurroundingwordsbased(context)onagiveninputword
• Ittreatseachcontext-targetpairasanewobservation
• Skip-Gramworkswellwithlargertrainingdata• Representswellevenrarewordsandphrases
• CBOWsupportisplanned
Given:humanoid
![Page 16: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/16.jpg)
Word2Vec- Usage
• word2vecembeddings aretypicallyusedinapre-processingstepofanotherMLalgo• weneedtoaggregatethewordembeddings forwordsequencesofvariablelength(sentences,
paragraphs,...)• pooling
• averaging(aka“meanpooling”)• (TF-IDF)weightedaveraging
• Notgoodforusecaseswhenorderofwordsisimportant(SentimentAnalysis)• http://jxieeducation.com/static/research/documentembedding_poster.pdf
• PCA&FisherVectors:http://www.cs.tau.ac.il/~wolf/papers/qagg.pdf• simplyconcatenateembeddings
• forshortinputs,eg.jobtitles• useanextensionofword2vec
• ParagraphVectors/doc2vec• Adds“paragraphtoken”asaninputofthetrainingprocess,itactsasamemorythatrememberswhat
ismissingfromthecurrentcontext– orthetopicoftheparagraph• https://cs.stanford.edu/~quocle/paragraph_vector.pdf
![Page 17: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/17.jpg)
Word2Vec inanExampleMLWorkf low
1. word2vecworkflow1. Tokenizeplots:breakupplotsintoseparatewords2. Filterwords:removestopwordslike“and”,“the”,”of”;removetooshortwords3. Trainaword2vecmodel4. Sanitycheck:findsynonymsbasedonthewordembeddings5. (optional)Evaluatesemanticaccuracyofthemodel
• Butthebestmeasureisalwaystheperformanceoftheoverallproblem(howwellwepredictthegenres)
2. Usemodeltotransformplotstovectors3. Useplotvectorrepresentationswithasupervisedmachinelearningalgorithmtopredictthegenre
UseCase:Predictmoviegenrefromamoviesynopsis/plotsummary.
![Page 18: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/18.jpg)
Word2Vec inH2O– KeyFunct ions
# Break job titles into sequence of wordswords <- h2o.tokenize(movies$plot, “\\\\W+”)
# Build word2vec modelw2v_model <- h2o.word2vec(words, epochs = 10)
# Sanity check - find synonyms for the word 'teacher'h2o.findSynonyms(w2v_model, "teacher", count = 5)
# Transform words into embeddings and aggregate for each plotplot_vecs <- h2o.transform(w2v_model, words, aggregate = “AVERAGE”)
![Page 19: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/19.jpg)
MachineLearning
![Page 20: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/20.jpg)
RuleBasedModel
MoviePlots
• Dog• Magic• Princess
• Alien• Future• Scientist
IfPlotContains
Sci-Fi
Family
![Page 21: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/21.jpg)
MachineLearningModel
0
50
100
Sci-Fi
Family
![Page 22: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/22.jpg)
MachineLearningModel
MoviePlotsSci-Fi
Family
![Page 23: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/23.jpg)
SupervisedLearning
![Page 24: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/24.jpg)
SupervisedLearning
0
50
100
Regression
Classification
Howmuchwillamoviecosttomake?
Whatisthegenreofamovie?IsitaComedy?Drama?Action?
![Page 25: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/25.jpg)
Overfi t t ing
FindingtheSignalintheData MemorizingtheData
![Page 26: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/26.jpg)
DemoTime
![Page 27: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/27.jpg)
Establ ishaBasel ine
print(”Build Gradient Boosted Machine with default parameters")
gbm_model <- h2o.gbm(x = myX, y = "genre", training_frame = train, validation_frame = test)
InR:
InFlow:
![Page 28: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/28.jpg)
PerformanceMetr ics in F low
HitRatioScoringHistory
![Page 29: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/29.jpg)
PerformanceMetr ics in F low
ConfusionMatrixonValidationData
![Page 30: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/30.jpg)
ExaminingResults
WeseethatthemodelpredictedalotofmoviesthatwereThrillerasDrama.
Why?
PlotofaDramafromtheTrainingData:"CordeliaGrayisthereluctantownerofaramshackleinvestigationagencyfollowingthesuicideofherboss.Watchingoverherasshehuntsdowncluesinthemurkyandsinisterworldofcrime,isherstraight-lacedandintuitiveofficeassistantEdithSparshott."
![Page 31: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/31.jpg)
DemoTime
![Page 32: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/32.jpg)
AddingNewFeatures
Features MeanPerClassError• Pre-trainedWordEmbeddings 0.637
• Pre-trainedWordEmbeddings• H2OWordEmbeddings
0.542
![Page 33: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/33.jpg)
Examine Results
VariableImportance WordswithHighC58
• spirituals
• percussion
• philharmonic
• orchestral
• concerto
• sonata
• hadyn
![Page 34: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/34.jpg)
Examine Results
VariableImportance WordswithHighC74
• einsteins
• patrolmen
• upperclassman
• speakeasies
• razzle
![Page 35: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/35.jpg)
Early Stopping
Earlystoppingoncethevalidationmeanperclasserrordoesn’timprovebyatleast0.01%for5consecutivescoringevents
• stopping_rounds = 5• stopping_tolerance = 1e-4• stopping_metric = “mean_per_class_error”
EarlyStopping
Overfitting
![Page 36: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/36.jpg)
ExamineResults
Model MeanPerClassError• Pre-trainedWordEmbeddings 0.637
• Pre-trainedWordEmbeddings• H2OWordEmbeddings
0.542
• Pre-trainedWordEmbeddings• H2OWordEmbeddings• EarlyStopping
0.503
![Page 37: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/37.jpg)
GridSearch
grid <- h2o.grid(hyper_params = list(max_depth = c(1:20)), search_criteria = list(strategy = "RandomDiscrete", max_runtime_secs = 3600),algorithm = "gbm“, ..)
InR:
InFlow:
![Page 38: NLP with H2O](https://reader034.vdocuments.net/reader034/viewer/2022042619/58e505111a28ab2c1c8b46e5/html5/thumbnails/38.jpg)
Questions?