exploring word2vec in scalaa gentle introduction to machine learning a full machine learning...

66
01 Exploring Word2vec in Scala Gary Sieling @garysieling Wingspan, an IQVIA Company Jan 11, 2018 PHASE 1

Upload: others

Post on 05-Jul-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

ExploringWord2vecinScala

GarySieling@garysielingWingspan,anIQVIACompany

Jan11,2018PHASE

1

Page 2: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

FindLectures.com:Acasestudyon naturallanguagesearch

• Demo• Crawling• SearchUseCases• MachineLearning

2

Page 3: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Goals

• Usingmachinelearningontext• PracticalexamplesofWord2VecinScala• ShowusesofCUDA

3

Page 4: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Agenda

• ProofofConcept:Emailalerts• ConceptSearch• CUDA • Demo

• Crawling• SearchUseCases• MachineLearning

4

Page 5: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Papers

5

AnempiricalstudyofsemanticsimilarityinWordNetandWord2Vechttp://scholarworks.uno.edu/cgi/viewcontent.cgi?article=3003&context=td

ADualEmbeddingSpaceModelforDocumentRankinghttps://arxiv.org/pdf/1602.01137v1.pdf

Page 6: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

• Demo• Crawling• SearchUseCases• MachineLearning

6

Page 7: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

EmailAlerts

7

Page 8: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

ConceptSearch• Writing,NOTCode• Excludes“writingcss”,“writingphp”• Implies"poetry","fiction",“copyediting”

8

Page 9: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

ConceptSearch• Recipes,VegetarianFood• NOTDairy• Allthreemightinclude"vegancooking"• Impliesnomilk,cheese

9

Page 10: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Requirements

• Demo• Crawling• SearchUseCases• MachineLearning

10

• Talks”about”thechosentopic• Incorporatemeaning– “Scala”+“MachineLearning”->Dl4j

• Maybeaconcepthierarchy• Don’tcombinemeaningifnothingincommon(hiking,art)• Don’tsendduplicatetalks/articles(e.g.announcementfrom

differentpublications)• Chooseawidevarietyoftalks(not5ontypesystems,etc)• Bonuspointsfor“negative”meanings(scala,butnotmonads)

Page 11: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Thisis”search”problem

• Demo• Crawling• SearchUseCases• MachineLearning

11

• Tokenizetext• Maybemarkknown“entities”• Filter/de-emphasizecommonterms/meanings• Findthetermsweshouldhavesearchedfor• Searchforthoseterms• Re-rank/filterresults

Page 12: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Solution:Word2Vec

12

https://github.com/idio/wiki2vec

Page 13: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

13

Termsincontext:PoliticalCodinghttp://findlectures.com/?q=liberation

Page 14: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

14

Termsincontext:Contextdefinitionshttp://findlectures.com/?q=quaker

Page 15: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

15

TrainingVectorsWasraisedaQuaker[”was”,“raised”,”a”,“religious”,“since”,“the”,“whose”,“patience”][1,1,1,0,0,0,0,0]

TheQuakerwhosepatiencewas[”was”,“raised”,”a”,“religious”,“since”,“the”,“whose”,“patience”][1,0,0,0, 0, 1,1,1]

Page 16: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

16

Word2VecOutputP(Term|context)

Or

P(Context|Term)

Page 17: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Example:VectorAdditionGloriaSteinem- Person+Ideology~=1. MarxistFeminism2. RadicalFeminism3. FeministMovement4. FeministTheory

17

Page 18: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01SuggestedSearch

18

Page 19: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01Example:DataFormat

19

{"word":"zulus""count":30,"syn0":[-0.064,0.118,0.031,0.163,0.019,0.197,0.097,-0.139,-0.055,0.155,-0.033,-0.252,-0.029,0.119,0.007,-0.017,0.187,0.017,0.058,-0.097,-0.255,-0.159,-0.053,-0.090,-0.118,0.119,0.068,0.025,0.160,-0.035,-0.216,0.065,0.017,0.038,-0.068,0.101,0.090,0.089,-0.023,0.265,-0.161,-0.178,-0.362,0.016,0.226,-0.070,-0.079,0.040,0.368,-0.150

],"syn1":[0.312,0.379,0.168,-0.371,-0.094,0.218,-0.022,-0.051,0.003,-0.010,0.233,-0.005,-0.037,0.105,0.025,-0.040,-0.127,.201,0.175,0.277,0.185,-0.219,-0.504,-0.187,0.069,0.041,0.237,-0.245,0.067,-0.186,0.127,0.235,-0.262,-0.020,-0.152,0.007,-0.346,0.008,-0.173,-0.267,-0.049,0.051,0.087,0.046,-0.059,0.147,0.024,0.032,-0.403,0.019

]}

Page 20: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01Example:SimilarityNumberfrom[0,1]

20

Imagecredit:https://engineering.aweber.com/cosine-similarity/

Page 21: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

Operation1:“Similarity”defcosineSimilarity(a:INDArray,b:INDArray

):Double={Transforms.cosineSim(a,b)

}

Page 22: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

INDArray- Similartonumpy array- Implementationdependsondependency:

libraryDependencies +="org.nd4j"%"nd4j-cuda-8.0-platform"%nd4jVersion

libraryDependencies +="org.nd4j"%"nd4j-native"%nd4jVersion

Page 23: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

CUDA• Specializedinstructionsetinvideocards/GPUs• RequiresNVIDIASDKandarecentcard($100-$xx,xxx)• AvailableonAWS• Deeplearning4j:JVMlibrariesformachinelearning• Nd4j/nd4s:matrixalgebraonlargearrays

23

Page 24: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

CUDA:exampleCcode__global__voidcoalescedMultiply(float*a,float*c,int M)

{

__shared__floataTile[TILE_DIM][TILE_DIM],

transposedTile[TILE_DIM][TILE_DIM];

int row=blockIdx.y *blockDim.y +threadIdx.y;

int col=blockIdx.x *blockDim.x +threadIdx.x;

floatsum=0.0f;

aTile[threadIdx.y][threadIdx.x]=a[row*TILE_DIM+threadIdx.x];

transposedTile[threadIdx.x][threadIdx.y]=

a[(blockIdx.x*blockDim.x +threadIdx.y)*TILE_DIM+

threadIdx.x];

__syncthreads();

for(int i =0;i <TILE_DIM;i++)

sum+=aTile[threadIdx.y][i]*transposedTile[i][threadIdx.x];

c[row*M+col]=sum;

}

Page 25: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

WaystoobtainGPUS• Buying

• Renting• AWS($0.90/hr)

25

Name GPUs vCPUs RAM (GiB)

NetworkBandwidth Price/Hour* RI Price /

Hour**p2.xlarge 1 4 61 High $0.900 $0.425p2.8xlarge 8 32 488 10Gbps $7.200 $3.400p2.16xlarge 16 64 732 20Gbps $14.400 $6.800

Page 26: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

TrainingWord2Vecval vec =newWord2Vec.Builder().minWordFrequency(5).iterations(1).layerSize(100).seed(42).windowSize(5).iterate(sentenceIterator).tokenizerFactory(tokenizer).build

vec.fit();

Page 27: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

Howdoyoutellifyourcodeisrunning- GPU

Page 28: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

Howdoesthisaffectword2vec

• Dl4jDemoproject:72minutes(CPU)• Dl4jDemoproject:41minutes(GPU)

Page 29: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

MostSimilar….

Definining opswecanuse– shouldthisbesooner?

Page 30: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

Operation2:ComputeadocumentmeandefgetWordVectorsMean(tokens:List[String]):INDArray ={val words=tokens.filter(model.getWordVector(_)!=null

).sorted

model.getWordVectorsMean(words.asJavaCollection

)}

Page 31: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

Nd4s/Nd4j

- Everythingisonelongarray,withdimensions(likenumpy)- Createonewithabigiterator- Easytoreshape- Parallelism– min32cores,allfollowingsamepath

Page 32: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Problem:SuggestionsBythenextsearch?

32

Page 33: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Problem:Noise

33

Page 34: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

Nd4s– Makeanarrayval data:Seq[Double]=

Seq(

words.flatMap(

(w)=>wordVectors(w)

),

words.flatMap(

(w)=>Seq.iterate(1,widthOfWordVector)((idx:Int)=>termFrequencies(w)).map(

(vv:Int)=>vv.toDouble

)

),

words.flatMap(

(w)=>Seq.iterate(1,widthOfWordVector)((idx:Int)=>documentFrequencies(w)).map(

(vv:Int)=>vv.toDouble

)

)

).flatten

Page 35: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

Nd4s– ComputationofTF*IDFaverageval modeVectors =arr.reshape(modes,widthOfWordVector *numWords)

val scores=modeVectors(0->1)

val tf =modeVectors(1->2)

val df =modeVectors(2->3)

val weighted=scores*tf /df

val wordVects =weighted.reshape(numWords,widthOfWordVector)

//thisistheweightedeverage

wordVects.sum(0)/numWords

//TODOisthisanybetter?

Page 36: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

"Synonym" Discovery Example

"Code"

36

Imagecredit:https://engineering.aweber.com/cosine-similarity/

"Coat"

Page 37: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01Word2Vec– BuildaFullTextQuery

37

List("python","machine","learning").map((queryTerm)=>"("+model.wordsNearest(List(queryTerm),//positivetermsList(),//negativeterms25

).map((nearWord)=>"transcript:"+term2+"^"+model.similarity(nearWord,term2)

).mkString("OR")+")"

).mkString("AND")

Page 38: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Visual– Nearestterms

38

Imagecredit:https://engineering.aweber.com/cosine-similarity/

QueryTerm

TopNclosest

Page 39: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Example– Query(“Python+MachineLearning”)

39

title_s:python^10ORtitle_s:"machine learning"^10…(title_s:software^1.21ORtitle_s:database^1.20ORtitle_s:format^1.18title_s:applications^1.14ORtitle_s:browser^1.14ORtitle_s:setup^1.13title_s:bootstrap^1.13ORtitle_s:in-class^1.13ORtitle_s:campesina^1.12ORtitle_s:excel^1.12ORtitle_s:hardware^1.11ORtitle_s:programming^1.11ORtitle_s:api^1.11ORtitle_s:prototype^1.11ORtitle_s:middleware^1.11ORtitle_s:openstreetmap^1.10ORtitle_s:product^1.10ORtitle_s:app^1.09ORtitle_s:hbp^1.09ORtitle_s:programmers^1.09ORtitle_s:application^1.09ORtitle_s:databases^1.09ORtitle_s:idiomatic^1.09ORtitle_s:spreadsheet^1.09ORtitle_s:java^1.09…AND(…)

Page 40: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Results(Python+MachineLearning+BM25)

40

PythonforDataAnalysisHowToGetStartedWithMachineLearning?|TwoMinutePapersThe/r/playrust Classifier:RealWorldRustDataScienceAndreasMueller- CommodityMachineLearningAGentleIntroductionToMachineLearningAfullMachinelearningpipelineinScikit-learnvsinscala-SparkHelloWorld- MachineLearningRecipes#1VisualdiagnosticsformoreinformedmachinelearningLabtoFactory:RobustMachineLearningSystemsMachineLearningwithScalaonSparkbyJoseQuesada

Page 41: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Word2Vec– “Writing”

41

IssuesRelatedtotheTeachingofCreativeWritingIsNonfictionLiterature?"Oh,youliar,youstoryteller":OnFibbing,FactandFabulationTheValueoftheEssayinthe21stCenturyRewritingRereadingRethinking– WebDesigninWordsAspenNewYorkBookSeries:TheArtoftheMemoirCherylStrayed:"Wild"SiriHustvedt inConversationwithPaulAusterMaryKarr:The2016DianaandSimonRaab Writer-in-ResidenceHistory,Memory,andtheNovel

Page 42: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01AboutnessRe-sortingtop100documents

val queryMean =model.getWordVectorsMean(List(“writing”))val mean=model.getWordVectorsMean(NLP.getWords(document._1))val distance=Transforms.cosineSim(vec._2,queryMean)

5min45seconds@16parallelthreads

Page 43: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Visual– Aboutness

43

Imagecredit:https://engineering.aweber.com/cosine-similarity/

QueryAverage

DocumentAverage

Page 44: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Aboutness- Results

IssuesRelatedtotheTeachingofCreativeWriting:0.43Autobiography:0.41ContemporaryIndianWriters:TheSearchforCreativity:0.41MarjorieWelish:Lecture:0.40HistoryandLiterature:TheStateofPlay:ARoundtableDiscussion:0.40CriticalReadingofGreatWriters:AlbertCamus:0.40DanielSchwarz:InDefenseofReading:0.39TheJourneyToTheWestbyProfessorAnthonyC.Yu:0.39Blogs,Twitter,theKindle:TheFutureofReading:0.39

Page 45: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Word2Vec+OverlappingSearchTerms

45

Python,ProgrammingvsArt,Hiking

terms.map((term1)=>terms.map((term2)=>(term1,term2))

).flatten.filter((tuple)=>tuple._1<tuple._2).map((tuple)=>(tuple._1,tuple._2,w2v.model.get.similarity(tuple._1,tuple._2))

)

Page 46: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Visual– OverlappingSearchTerms

46

Imagecredit:https://engineering.aweber.com/cosine-similarity/

QueryTerm1

QueryTerm2

Page 47: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Word2Vec+OverlappingSearchTerms

programming<-->python:0.61

47

art<-->hiking:0.10

Python,Programming

Hiking,Art

(pythonANDprogramming)

(hikingORart)

Page 48: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

TopicDiversity

AConversationwithDavidGerrold,WriterofStarTrek:TheTroublewithTribbles- Teletalk (58minutes)

StarTrek:ScienceFictiontoScienceFact- STEMin30(28minutes)

PythonsPositivePressPumps Pandas

WhyisPythonGrowingSoQuickly?- StackOverflowBlog

Pythonexplosionblamedon pandas

Writing

Python

Page 49: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Visual– TopicDiversity

49

Imagecredit:https://engineering.aweber.com/cosine-similarity/

Document1- Average

Document2- Average

Page 50: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Pickone,findtheleastrelated(Python+Pandas)

50

Pythonexplosionblamedonpandas:1.0ConsideringPython'sTargetAudience:0.97AnimatedrouteswithQGISandPython:0.97Ican'tgetsomeSQLtocommitreadingdatafromadatabase:0.97UsingPythontobuildanAITwitterbotpeopletrust:0.96GettingaJobasaSelf-TaughtPythonDeveloper:0.96DownloadandProcessDEMsinPython:0.96HowtominenewsfeeddataandextractinteractiveinsightsinPython:0.94DifferentialEquationSolverInMATLAB,R,Julia,Python,C,Mathematica,Maple,andFortran:0.86MypersonaldatasciencetoolboxwritteninPython:0.75

1 min30seconds@16parallelthreads

Page 51: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Technique- Summary• GettopXresults,re-shuffle• Morecomputingresources+data->higherrelevance

51

Page 52: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

WhereWord2VecWorks• Synonymgeneration• Improverecall• Searchsuggestions• Incorporatesecondarydataset(e.g.forenterprisesearch,privacy)

52

Page 53: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

WhyScala?• Ecosystem:Lucene,Spark• DependencyManagement

53

Page 54: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Performance• Modelstake1-2weekstotrain• Someofcomputationstakeminutes,whichwouldnotworkin

asearchengine• Changes:

• Pre-computetokens(e.g.useLucene)• Pre-computeaverages(don’tnaturallystoreinLucene)• Hazelcast

54

Page 55: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

HowdoyoutellifyourcodeisrunningonaGPU(Spark+Deeplearning4j)• 15:17:27,828INFO~Loaded[CpuBackend]backend• 15:17:28,008INFO~NumberofthreadsusedforNativeOps:4• 15:17:29,182INFO~NumberofthreadsusedforBLAS:4• 15:17:29,185INFO~Backendused:[CPU];OS:[Windows10]• 15:17:29,185INFO~Cores:[8];Memory:[3.6GB];• 15:17:29,185INFO~Blasvendor:[MKL]• 15:17:34,546INFO~UsingSparkLocal

Page 56: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

CUDA• SwitchbetweenCPUandGPUbychangingsbt configuration:

• Threadingresources.Executionpipelinesonhostsystemscansupportalimitednumberofconcurrentthreads.Serversthathavefourhex-coreprocessorstodaycanrunonly24threadsconcurrently(or48iftheCPUssupportHyperThreading.)Bycomparison,thesmallestexecutableunitofparallelismonaCUDAdevicecomprises32threads(termedawarpofthreads).ModernNVIDIAGPUscansupportupto1536activethreadsconcurrentlypermultiprocessor(seeSectionF.1oftheCUDACProgrammingGuide).OnGPUswith16multiprocessors,thisleadstomorethan24,000concurrentlyactivethreads.

56

Page 57: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Hazelcast• Justvideos– 241.8minutes• Nothingcached,buthazelcast- 76minutes• Onquerycombos– 234minutes• AddingHazelcast onqueries- 62.091• Afterallcached– 2.38• Moveword2vecmodelfromspinnertoSSD:

57

Page 58: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

jCudadefmemory={cuInit(0)val device=newCUdeviceJCudaDriver.cuDeviceGet(device,deviceId)

val total=Array(0L)val free=Array(0L)cuInit(0)cuDeviceGet(device,deviceId)

val context=newCUcontextcuCtxCreate(context,0,device)cuMemGetInfo(free,total)

cuCtxDestroy(context)

(total(0),free(0))}

Page 59: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

Tokenize- LucenedefgetTokens(text:String):List[String]={val result=newutil.ArrayList[String]()val analyzer:Analyzer=newStandardAnalyzer()

val stream:TokenStream =analyzer.tokenStream(null,newStringReader(text))stream.reset()

while(stream.incrementToken){result.add(stream.getAttribute(classOf[CharTermAttribute]).toString())}

importscala.collection.JavaConversions._result.toList}

Page 60: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

OtherLessons

- Inventingyourownmathdoesnotwork- High-dimensional“objects”donotfollowyourintuitionlike2D/3D- Floatingpointmathnotassociative

- Mathinpapersisuntyped- ”Distance”betweentwovectors– cosine,euclidean,manhattan?- vs.Probabilitycurves- UnlikePhysics(typesnaturallycompose,kg⋅m2⋅s−2)

- Followapaper- Nearlyimpossibletotestonyourown- Almostnoonepublishescode

Page 61: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

NextIdea…

Page 62: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

CUDASurprises

• HighendGPUsdon’tdovideo• Atonofpeopleareusingtheseforbitcoinmining(seelocalcraigslist)• CUDAusesalotofCPU• Floating-PointMathIsNotAssociative• “…thepeaktheoreticalmemorybandwidthoftheNVIDIATeslaM2090is177.6GB/sec:(1.85× 109× (384/8)× 2)/109=177.6GB/sec“• “….thepeaktheoreticalbandwidthbetweenhostmemoryanddevicememory(8GB/sonthePCIe ×16Gen2).• “…if,switch,do,for,whilesignificantlyaffectthroughput...Thedifferentexecutionpathsmustbeserialized,sinceallthreadsofawarpshareaprogramcounter;thisincreasesthetotalnumberofinstructionsexecutedforthiswarp”

Page 63: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Resources• "RelevantSearch"• “DeepLearning– APractitioner’sApproach”• Deeplearning4j• Gensim• https://github.com/DiceTechJobs/ConceptualSearch• https://www.reddit.com/r/datasets/comments/3mg812/full_r

eddit_submission_corpus_now_available_2006/

63

Page 64: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

FindLectures.comWeeklyEmailswithLunchandLearnSuggestions

http://findlectures.com/emails

64

Page 65: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01

Nextinstallment:

JavaUsersGroupInFebruary2018

“GPUProgrammingforJavaDevelopers”

65

Page 66: Exploring Word2vec in ScalaA Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World -Machine Learning Recipes #1 Visual

01Contact:@garysieling@[email protected]

https://www.findlectures.comhttps://www.garysieling.comhttps://github.com/garysieling/

66