bi over petabytes: meet apache mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... ·...
TRANSCRIPT
![Page 1: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/1.jpg)
BIOverPetabytes:MeetApacheMahout
IndustrialStrengthMachineLearningApril2009
h@p://lucene.apache.org/mahout/
4/22/09 [email protected]
![Page 2: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/2.jpg)
BIandML
• BusinessIntelligence– OLAP– AnalyJcs– Datamining– Performanceanalysis
– Textmining– PredicJveanalysis
• MachineLearning– ClassificaJon – Clustering– Regression– CollaboraJvefiltering
– EvoluJonaryalgorithms
4/22/09 [email protected]
![Page 3: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/3.jpg)
WhatisMachineLearning?
• “MachinelearningisthesubfieldofarJficialintelligencethatisconcernedwiththedesignanddevelopmentofalgorithmsthatallowcomputerstoimprovetheirperformanceoverJme…”(h@p://en.wikipedia.org/wiki/Machine_learning)
• TypesofMLalgorithms– Supervised:Usinglabeledtrainingdata,createafuncJonthatpredictsoutputforunseeninputs
– Unsupervised:UsingunlabeleddatacreateafuncJonthatcanpredictoutput
– Semi‐supervised:Useslabeledandunlabeleddata
4/22/09 [email protected]
![Page 6: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/6.jpg)
WhereMLisUsedToday
• Internetsearchclustering• Knowledgemanagementsystems• Socialnetworkmapping• TaxonomytransformaJons• MarkeJnganalyJcs• RecommendaJonsystems• Loganalysis&eventfiltering• SPAMfiltering,frauddetecJon
4/22/09 [email protected]
![Page 7: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/7.jpg)
CurrentSituaJon
• VastamountsofdataarenowavailableviatheInternet
• PlahormsnowexisttoruncomputaJonsoverlargedatasets(MapReduce,Hadoop,Dryad)
• SophisJcatedanalyJcsareneededtoturndataintoinformaJonpeoplecanuse
• AcJveMachineLearningresearchcommunityandresearch/proprietaryimplementaJonsofMLalgorithms
• TheworldneedsscalableimplementaJonsofMLunderopenlicense‐ASF
4/22/09 [email protected]
![Page 8: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/8.jpg)
HistoryofMahout
• Summer2007– DevelopersneededscalableML– Mailinglistformed
• Communityformed– Apachecontributors– Academia&industry– LotsofiniJalinterest
• MahoutprojectformedunderApacheLucene– January25,2008– Mahout0.1releaseApril,2009
4/22/09 [email protected]
![Page 9: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/9.jpg)
WhoWeAre(sofar)
GrantIngersoll KarlWemn
IsabelDrostTedDunningJeffEastman
DawidWeiss
OJsGospodneJc
ErikHatcher
SeanOwen
OzgurYilmazel
4/22/09 [email protected]
![Page 10: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/10.jpg)
Release0.1CodeBase• Matrix&Vectorlibrary
– Memoryresidentsparse&denseimplementaJons• ClassificaJon
– NaïveBayes,ComplementaryNaïveBayes• Clustering
– Canopy– K‐Means,fuzzyK‐Means– MeanShiq– DirichletProcess
• CollaboraJveFiltering– Taste
• EvoluJonaryAlgorithms– Watchmaker
• UJliJes– DistanceMeasures– Parameters
Highlyscalable,parallelimplementa3onsontheApache
Hadooppla7orm
4/22/09 [email protected]
![Page 11: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/11.jpg)
Examples:Clustering
• Canopy– Singlepass(fastapproximaJon)assignseverypointtoasinglecluster– Inputs:DistanceMeasure,T1,T2canopyvalues
• MeanShiq– IteraJveprocessconvergesonmodesofdensitydistribuJon– Inputs:DistanceMeasure,T1,T2values,convergencecriteria
• K‐Means– IteraJveprocessconvergesonasingle,‘best’assignmentofpointstoclusters– Inputs:DistanceMeasure,iniJalclusters,convergencecriteria
• FuzzyK‐Means– LikeK‐MeansbutusesprobabilitydensityfuncJontoweightallpointsagainstallclusters
• DirichletProcess– Bayesian:incorporatespriordomainknowledgeasamixtureofmodels– IteraJveprocessconvergesonmulJple,‘mostlikely’answers– Inputs:
• Numberofmodels,numberofiteraJonstoperform• Model(parameters,observaJons,probabilitydensityfuncJon)• ModelDistribu3on(prior,posteriorsampling)
4/22/09 [email protected]
![Page 19: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/19.jpg)
ApacheHadoop
• Usesclustersof(5‐10,000)generalpurposeLinuxboxes• HDFSsupportsredundantfilestorageandstreamingaccessin
thefaceofpredictablehardwarefailures• Map/ReduceAPIsimplifiesprogrammingofalgorithmsthat
operateovervastdatasets• HbaseoffersGoogleBigTablestyleofschema‐less,temporal
database• PIGoffershigherlevellanguageformanipulaJngverylarge
datasetsthatreducestheneedforM/Rprogramming• ZookeeperisahighlyavailableandreliablecoordinaJon
systemusedtosynchronizestatebetweenapplicaJons• Hiveisadatawarehouseinfrastructurethatprovidesdata
summarizaJon,adhocqueryingandanalysisofdatasets
h@p://hadoop.apache.org
4/22/09 [email protected]
![Page 20: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/20.jpg)
TheHadoopIceberg
StorageReplicaJon
ProcessScheduling
FailureHandling
Map/ReduceCode
DataMovement
DiskManagement NetworkManagement
(h@p://hadoop.apache.org)
Monitoring
4/22/09 [email protected]
![Page 21: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/21.jpg)
ReferenceDirichletImplementaJonprivatevoiditerate(intitera-on,DirichletState<Observa-on>state){
//createnewposteriormodelsModel<ObservaJon>[]newModels=modelFactory.sampleFromPosterior(state.getModels());
//iterateoverthesamples,assigningeachtoamodelfor(Observa-onx:sampleData){//computenormalizedvectorofprobabiliJesthatxisdescribedbyeachmodelVectorpi=normalizedProbabiliJes(state,x);//thenpickoneclusterbysamplingaMulJnomialdistribuJonbaseduponthem//see:h@p://en.wikipedia.org/wiki/MulJnomial_distribuJonintk=UncommonDistribu-ons.rMul%nom(pi);//asktheselectedmodeltoobservethedatumnewModels[k].observe(x);}
//updatethestatefromthenewmodelsstate.update(newModels);}
4/22/09 [email protected]
![Page 22: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/22.jpg)
DirichletMapperonHadoop
publicvoidmap(WritableComparable<?>key,Textvalue,OutputCollector<Text,Text>output,Reporterreporter)throwsIOExcepJon{//readthenextsamplepointVectorsample=DenseVector.decodeFormat(value.toString());//computeavectorofprobabiliJesthatsampleisdescribedbyeachmodelVectorpi=normalizedProbabili3es(state,sample);//thenpickonemodelbysamplingaMulJnomialdistribuJonbaseduponthem//see:h@p://en.wikipedia.org/wiki/MulJnomial_distribuJonintk=UncommonDistribuJons.rMul3nom(pi);//outputvaluewithkeyofselectedmodeloutput.collect(newText(String.valueOf(k)),value);}
4/22/09 [email protected]
![Page 24: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/24.jpg)
DirichletReduceronHadooppublicvoidreduce(Textkey,Iterator<Text>values,OutputCollector<Text,Text>output,Reporterreporter)throwsIOExcep-on{//loadthemodelforthissetofvaluesIntegerk=newInteger(key.toString());Model<Vector>model=newModels[k];while(values.hasNext()){Vectorv=DenseVector.decodeFormat(values.next().toString());//asktheselectedmodeltoobservethedatummodel.observe(v);}//compute&setnewmodelparametersbasedupontheobservaJonsmodel.computeParameters();state.clusters.get(k).setModel(model);//outputtheclusterstateforthenextiteraJonoutput.collect(key,newText(cluster.asFormatString()));}
4/22/09 [email protected]
![Page 25: BI Over Petabytes: Meet Apache Mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... · 2009. 7. 22. · – Bayesian: incorporates prior domain knowledge as a mixture of](https://reader033.vdocuments.net/reader033/viewer/2022051923/6011930f5c2dcf5e875e041c/html5/thumbnails/25.jpg)
Conclusion• Thisisjustthebeginning• Highdemandforscalablemachinelearning
• Contributorsareneededwhohave– Interest,enthusiasm&programmingability– Testdrivendevelopmentskills– Comfortwiththescarymath(orbravery)
– Interestand/orproficiencywithHadoop– Somelargedatasetsyouwanttoanalyze
h@p://lucene.apache.org/mahout/
4/22/09 [email protected]