exploring word2vec in scalaa gentle introduction to machine learning a full machine learning...
TRANSCRIPT
01
ExploringWord2vecinScala
GarySieling@garysielingWingspan,anIQVIACompany
Jan11,2018PHASE
1
01
FindLectures.com:Acasestudyon naturallanguagesearch
• Demo• Crawling• SearchUseCases• MachineLearning
2
01
Goals
• Usingmachinelearningontext• PracticalexamplesofWord2VecinScala• ShowusesofCUDA
3
01
Agenda
• ProofofConcept:Emailalerts• ConceptSearch• CUDA • Demo
• Crawling• SearchUseCases• MachineLearning
4
01
Papers
5
AnempiricalstudyofsemanticsimilarityinWordNetandWord2Vechttp://scholarworks.uno.edu/cgi/viewcontent.cgi?article=3003&context=td
ADualEmbeddingSpaceModelforDocumentRankinghttps://arxiv.org/pdf/1602.01137v1.pdf
01
• Demo• Crawling• SearchUseCases• MachineLearning
6
01
EmailAlerts
7
01
ConceptSearch• Writing,NOTCode• Excludes“writingcss”,“writingphp”• Implies"poetry","fiction",“copyediting”
8
01
ConceptSearch• Recipes,VegetarianFood• NOTDairy• Allthreemightinclude"vegancooking"• Impliesnomilk,cheese
9
01
Requirements
• Demo• Crawling• SearchUseCases• MachineLearning
10
• Talks”about”thechosentopic• Incorporatemeaning– “Scala”+“MachineLearning”->Dl4j
• Maybeaconcepthierarchy• Don’tcombinemeaningifnothingincommon(hiking,art)• Don’tsendduplicatetalks/articles(e.g.announcementfrom
differentpublications)• Chooseawidevarietyoftalks(not5ontypesystems,etc)• Bonuspointsfor“negative”meanings(scala,butnotmonads)
01
Thisis”search”problem
• Demo• Crawling• SearchUseCases• MachineLearning
11
• Tokenizetext• Maybemarkknown“entities”• Filter/de-emphasizecommonterms/meanings• Findthetermsweshouldhavesearchedfor• Searchforthoseterms• Re-rank/filterresults
01
Solution:Word2Vec
12
https://github.com/idio/wiki2vec
13
Termsincontext:PoliticalCodinghttp://findlectures.com/?q=liberation
14
Termsincontext:Contextdefinitionshttp://findlectures.com/?q=quaker
15
TrainingVectorsWasraisedaQuaker[”was”,“raised”,”a”,“religious”,“since”,“the”,“whose”,“patience”][1,1,1,0,0,0,0,0]
TheQuakerwhosepatiencewas[”was”,“raised”,”a”,“religious”,“since”,“the”,“whose”,“patience”][1,0,0,0, 0, 1,1,1]
16
Word2VecOutputP(Term|context)
Or
P(Context|Term)
01
Example:VectorAdditionGloriaSteinem- Person+Ideology~=1. MarxistFeminism2. RadicalFeminism3. FeministMovement4. FeministTheory
17
01SuggestedSearch
18
01Example:DataFormat
19
{"word":"zulus""count":30,"syn0":[-0.064,0.118,0.031,0.163,0.019,0.197,0.097,-0.139,-0.055,0.155,-0.033,-0.252,-0.029,0.119,0.007,-0.017,0.187,0.017,0.058,-0.097,-0.255,-0.159,-0.053,-0.090,-0.118,0.119,0.068,0.025,0.160,-0.035,-0.216,0.065,0.017,0.038,-0.068,0.101,0.090,0.089,-0.023,0.265,-0.161,-0.178,-0.362,0.016,0.226,-0.070,-0.079,0.040,0.368,-0.150
],"syn1":[0.312,0.379,0.168,-0.371,-0.094,0.218,-0.022,-0.051,0.003,-0.010,0.233,-0.005,-0.037,0.105,0.025,-0.040,-0.127,.201,0.175,0.277,0.185,-0.219,-0.504,-0.187,0.069,0.041,0.237,-0.245,0.067,-0.186,0.127,0.235,-0.262,-0.020,-0.152,0.007,-0.346,0.008,-0.173,-0.267,-0.049,0.051,0.087,0.046,-0.059,0.147,0.024,0.032,-0.403,0.019
]}
01Example:SimilarityNumberfrom[0,1]
20
Imagecredit:https://engineering.aweber.com/cosine-similarity/
Operation1:“Similarity”defcosineSimilarity(a:INDArray,b:INDArray
):Double={Transforms.cosineSim(a,b)
}
INDArray- Similartonumpy array- Implementationdependsondependency:
libraryDependencies +="org.nd4j"%"nd4j-cuda-8.0-platform"%nd4jVersion
libraryDependencies +="org.nd4j"%"nd4j-native"%nd4jVersion
01
CUDA• Specializedinstructionsetinvideocards/GPUs• RequiresNVIDIASDKandarecentcard($100-$xx,xxx)• AvailableonAWS• Deeplearning4j:JVMlibrariesformachinelearning• Nd4j/nd4s:matrixalgebraonlargearrays
23
CUDA:exampleCcode__global__voidcoalescedMultiply(float*a,float*c,int M)
{
__shared__floataTile[TILE_DIM][TILE_DIM],
transposedTile[TILE_DIM][TILE_DIM];
int row=blockIdx.y *blockDim.y +threadIdx.y;
int col=blockIdx.x *blockDim.x +threadIdx.x;
floatsum=0.0f;
aTile[threadIdx.y][threadIdx.x]=a[row*TILE_DIM+threadIdx.x];
transposedTile[threadIdx.x][threadIdx.y]=
a[(blockIdx.x*blockDim.x +threadIdx.y)*TILE_DIM+
threadIdx.x];
__syncthreads();
for(int i =0;i <TILE_DIM;i++)
sum+=aTile[threadIdx.y][i]*transposedTile[i][threadIdx.x];
c[row*M+col]=sum;
}
01
WaystoobtainGPUS• Buying
• Renting• AWS($0.90/hr)
25
Name GPUs vCPUs RAM (GiB)
NetworkBandwidth Price/Hour* RI Price /
Hour**p2.xlarge 1 4 61 High $0.900 $0.425p2.8xlarge 8 32 488 10Gbps $7.200 $3.400p2.16xlarge 16 64 732 20Gbps $14.400 $6.800
TrainingWord2Vecval vec =newWord2Vec.Builder().minWordFrequency(5).iterations(1).layerSize(100).seed(42).windowSize(5).iterate(sentenceIterator).tokenizerFactory(tokenizer).build
vec.fit();
Howdoyoutellifyourcodeisrunning- GPU
Howdoesthisaffectword2vec
• Dl4jDemoproject:72minutes(CPU)• Dl4jDemoproject:41minutes(GPU)
MostSimilar….
Definining opswecanuse– shouldthisbesooner?
Operation2:ComputeadocumentmeandefgetWordVectorsMean(tokens:List[String]):INDArray ={val words=tokens.filter(model.getWordVector(_)!=null
).sorted
model.getWordVectorsMean(words.asJavaCollection
)}
Nd4s/Nd4j
- Everythingisonelongarray,withdimensions(likenumpy)- Createonewithabigiterator- Easytoreshape- Parallelism– min32cores,allfollowingsamepath
01
Problem:SuggestionsBythenextsearch?
32
01
Problem:Noise
33
Nd4s– Makeanarrayval data:Seq[Double]=
Seq(
words.flatMap(
(w)=>wordVectors(w)
),
words.flatMap(
(w)=>Seq.iterate(1,widthOfWordVector)((idx:Int)=>termFrequencies(w)).map(
(vv:Int)=>vv.toDouble
)
),
words.flatMap(
(w)=>Seq.iterate(1,widthOfWordVector)((idx:Int)=>documentFrequencies(w)).map(
(vv:Int)=>vv.toDouble
)
)
).flatten
Nd4s– ComputationofTF*IDFaverageval modeVectors =arr.reshape(modes,widthOfWordVector *numWords)
val scores=modeVectors(0->1)
val tf =modeVectors(1->2)
val df =modeVectors(2->3)
val weighted=scores*tf /df
val wordVects =weighted.reshape(numWords,widthOfWordVector)
//thisistheweightedeverage
wordVects.sum(0)/numWords
//TODOisthisanybetter?
01
"Synonym" Discovery Example
"Code"
36
Imagecredit:https://engineering.aweber.com/cosine-similarity/
"Coat"
01Word2Vec– BuildaFullTextQuery
37
List("python","machine","learning").map((queryTerm)=>"("+model.wordsNearest(List(queryTerm),//positivetermsList(),//negativeterms25
).map((nearWord)=>"transcript:"+term2+"^"+model.similarity(nearWord,term2)
).mkString("OR")+")"
).mkString("AND")
01
Visual– Nearestterms
38
Imagecredit:https://engineering.aweber.com/cosine-similarity/
QueryTerm
TopNclosest
01
Example– Query(“Python+MachineLearning”)
39
title_s:python^10ORtitle_s:"machine learning"^10…(title_s:software^1.21ORtitle_s:database^1.20ORtitle_s:format^1.18title_s:applications^1.14ORtitle_s:browser^1.14ORtitle_s:setup^1.13title_s:bootstrap^1.13ORtitle_s:in-class^1.13ORtitle_s:campesina^1.12ORtitle_s:excel^1.12ORtitle_s:hardware^1.11ORtitle_s:programming^1.11ORtitle_s:api^1.11ORtitle_s:prototype^1.11ORtitle_s:middleware^1.11ORtitle_s:openstreetmap^1.10ORtitle_s:product^1.10ORtitle_s:app^1.09ORtitle_s:hbp^1.09ORtitle_s:programmers^1.09ORtitle_s:application^1.09ORtitle_s:databases^1.09ORtitle_s:idiomatic^1.09ORtitle_s:spreadsheet^1.09ORtitle_s:java^1.09…AND(…)
01
Results(Python+MachineLearning+BM25)
40
PythonforDataAnalysisHowToGetStartedWithMachineLearning?|TwoMinutePapersThe/r/playrust Classifier:RealWorldRustDataScienceAndreasMueller- CommodityMachineLearningAGentleIntroductionToMachineLearningAfullMachinelearningpipelineinScikit-learnvsinscala-SparkHelloWorld- MachineLearningRecipes#1VisualdiagnosticsformoreinformedmachinelearningLabtoFactory:RobustMachineLearningSystemsMachineLearningwithScalaonSparkbyJoseQuesada
01
Word2Vec– “Writing”
41
IssuesRelatedtotheTeachingofCreativeWritingIsNonfictionLiterature?"Oh,youliar,youstoryteller":OnFibbing,FactandFabulationTheValueoftheEssayinthe21stCenturyRewritingRereadingRethinking– WebDesigninWordsAspenNewYorkBookSeries:TheArtoftheMemoirCherylStrayed:"Wild"SiriHustvedt inConversationwithPaulAusterMaryKarr:The2016DianaandSimonRaab Writer-in-ResidenceHistory,Memory,andtheNovel
01AboutnessRe-sortingtop100documents
val queryMean =model.getWordVectorsMean(List(“writing”))val mean=model.getWordVectorsMean(NLP.getWords(document._1))val distance=Transforms.cosineSim(vec._2,queryMean)
5min45seconds@16parallelthreads
01
Visual– Aboutness
43
Imagecredit:https://engineering.aweber.com/cosine-similarity/
QueryAverage
DocumentAverage
01
Aboutness- Results
IssuesRelatedtotheTeachingofCreativeWriting:0.43Autobiography:0.41ContemporaryIndianWriters:TheSearchforCreativity:0.41MarjorieWelish:Lecture:0.40HistoryandLiterature:TheStateofPlay:ARoundtableDiscussion:0.40CriticalReadingofGreatWriters:AlbertCamus:0.40DanielSchwarz:InDefenseofReading:0.39TheJourneyToTheWestbyProfessorAnthonyC.Yu:0.39Blogs,Twitter,theKindle:TheFutureofReading:0.39
01
Word2Vec+OverlappingSearchTerms
45
Python,ProgrammingvsArt,Hiking
terms.map((term1)=>terms.map((term2)=>(term1,term2))
).flatten.filter((tuple)=>tuple._1<tuple._2).map((tuple)=>(tuple._1,tuple._2,w2v.model.get.similarity(tuple._1,tuple._2))
)
01
Visual– OverlappingSearchTerms
46
Imagecredit:https://engineering.aweber.com/cosine-similarity/
QueryTerm1
QueryTerm2
01
Word2Vec+OverlappingSearchTerms
programming<-->python:0.61
47
art<-->hiking:0.10
Python,Programming
Hiking,Art
(pythonANDprogramming)
(hikingORart)
01
TopicDiversity
AConversationwithDavidGerrold,WriterofStarTrek:TheTroublewithTribbles- Teletalk (58minutes)
StarTrek:ScienceFictiontoScienceFact- STEMin30(28minutes)
PythonsPositivePressPumps Pandas
WhyisPythonGrowingSoQuickly?- StackOverflowBlog
Pythonexplosionblamedon pandas
Writing
Python
01
Visual– TopicDiversity
49
Imagecredit:https://engineering.aweber.com/cosine-similarity/
Document1- Average
Document2- Average
01
Pickone,findtheleastrelated(Python+Pandas)
50
Pythonexplosionblamedonpandas:1.0ConsideringPython'sTargetAudience:0.97AnimatedrouteswithQGISandPython:0.97Ican'tgetsomeSQLtocommitreadingdatafromadatabase:0.97UsingPythontobuildanAITwitterbotpeopletrust:0.96GettingaJobasaSelf-TaughtPythonDeveloper:0.96DownloadandProcessDEMsinPython:0.96HowtominenewsfeeddataandextractinteractiveinsightsinPython:0.94DifferentialEquationSolverInMATLAB,R,Julia,Python,C,Mathematica,Maple,andFortran:0.86MypersonaldatasciencetoolboxwritteninPython:0.75
1 min30seconds@16parallelthreads
01
Technique- Summary• GettopXresults,re-shuffle• Morecomputingresources+data->higherrelevance
51
01
WhereWord2VecWorks• Synonymgeneration• Improverecall• Searchsuggestions• Incorporatesecondarydataset(e.g.forenterprisesearch,privacy)
52
01
WhyScala?• Ecosystem:Lucene,Spark• DependencyManagement
53
01
Performance• Modelstake1-2weekstotrain• Someofcomputationstakeminutes,whichwouldnotworkin
asearchengine• Changes:
• Pre-computetokens(e.g.useLucene)• Pre-computeaverages(don’tnaturallystoreinLucene)• Hazelcast
54
HowdoyoutellifyourcodeisrunningonaGPU(Spark+Deeplearning4j)• 15:17:27,828INFO~Loaded[CpuBackend]backend• 15:17:28,008INFO~NumberofthreadsusedforNativeOps:4• 15:17:29,182INFO~NumberofthreadsusedforBLAS:4• 15:17:29,185INFO~Backendused:[CPU];OS:[Windows10]• 15:17:29,185INFO~Cores:[8];Memory:[3.6GB];• 15:17:29,185INFO~Blasvendor:[MKL]• 15:17:34,546INFO~UsingSparkLocal
01
CUDA• SwitchbetweenCPUandGPUbychangingsbt configuration:
• Threadingresources.Executionpipelinesonhostsystemscansupportalimitednumberofconcurrentthreads.Serversthathavefourhex-coreprocessorstodaycanrunonly24threadsconcurrently(or48iftheCPUssupportHyperThreading.)Bycomparison,thesmallestexecutableunitofparallelismonaCUDAdevicecomprises32threads(termedawarpofthreads).ModernNVIDIAGPUscansupportupto1536activethreadsconcurrentlypermultiprocessor(seeSectionF.1oftheCUDACProgrammingGuide).OnGPUswith16multiprocessors,thisleadstomorethan24,000concurrentlyactivethreads.
56
01
Hazelcast• Justvideos– 241.8minutes• Nothingcached,buthazelcast- 76minutes• Onquerycombos– 234minutes• AddingHazelcast onqueries- 62.091• Afterallcached– 2.38• Moveword2vecmodelfromspinnertoSSD:
57
jCudadefmemory={cuInit(0)val device=newCUdeviceJCudaDriver.cuDeviceGet(device,deviceId)
val total=Array(0L)val free=Array(0L)cuInit(0)cuDeviceGet(device,deviceId)
val context=newCUcontextcuCtxCreate(context,0,device)cuMemGetInfo(free,total)
cuCtxDestroy(context)
(total(0),free(0))}
Tokenize- LucenedefgetTokens(text:String):List[String]={val result=newutil.ArrayList[String]()val analyzer:Analyzer=newStandardAnalyzer()
val stream:TokenStream =analyzer.tokenStream(null,newStringReader(text))stream.reset()
while(stream.incrementToken){result.add(stream.getAttribute(classOf[CharTermAttribute]).toString())}
importscala.collection.JavaConversions._result.toList}
OtherLessons
- Inventingyourownmathdoesnotwork- High-dimensional“objects”donotfollowyourintuitionlike2D/3D- Floatingpointmathnotassociative
- Mathinpapersisuntyped- ”Distance”betweentwovectors– cosine,euclidean,manhattan?- vs.Probabilitycurves- UnlikePhysics(typesnaturallycompose,kg⋅m2⋅s−2)
- Followapaper- Nearlyimpossibletotestonyourown- Almostnoonepublishescode
NextIdea…
CUDASurprises
• HighendGPUsdon’tdovideo• Atonofpeopleareusingtheseforbitcoinmining(seelocalcraigslist)• CUDAusesalotofCPU• Floating-PointMathIsNotAssociative• “…thepeaktheoreticalmemorybandwidthoftheNVIDIATeslaM2090is177.6GB/sec:(1.85× 109× (384/8)× 2)/109=177.6GB/sec“• “….thepeaktheoreticalbandwidthbetweenhostmemoryanddevicememory(8GB/sonthePCIe ×16Gen2).• “…if,switch,do,for,whilesignificantlyaffectthroughput...Thedifferentexecutionpathsmustbeserialized,sinceallthreadsofawarpshareaprogramcounter;thisincreasesthetotalnumberofinstructionsexecutedforthiswarp”
01
Resources• "RelevantSearch"• “DeepLearning– APractitioner’sApproach”• Deeplearning4j• Gensim• https://github.com/DiceTechJobs/ConceptualSearch• https://www.reddit.com/r/datasets/comments/3mg812/full_r
eddit_submission_corpus_now_available_2006/
63
01
FindLectures.comWeeklyEmailswithLunchandLearnSuggestions
http://findlectures.com/emails
64
01
Nextinstallment:
JavaUsersGroupInFebruary2018
“GPUProgrammingforJavaDevelopers”
65
01Contact:@garysieling@[email protected]
https://www.findlectures.comhttps://www.garysieling.comhttps://github.com/garysieling/
66