automated machine learning (automl) and pentaho - presentation · •adding, of course, in most...

54
Automated Machine Learning (AutoML) and Pentaho Caio Moreno de Souza Pentaho Senior Consultant, Hitachi Vantara

Upload: others

Post on 20-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

AutomatedMachineLearning(AutoML)andPentahoCaio MorenodeSouzaPentahoSeniorConsultant,HitachiVantara

Page 2: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

Agenda

WewilldiscusshowAutomatedMachineLearning(AutoML)andPentaho,together,canhelpcustomerssavetimeintheprocessofcreatingamodelanddeployingthismodelintoproduction.

• BusinessCaseforAutomatedMachineLearning(AutoML)andPentaho;

• HighleveloverviewaboutAutomatedMachineLearning(AutoML);

• Demonstrations(Pentaho+AutoML).

Page 3: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

ThePerfectModelDoesNotExist

“Allmodelsarewrong,butsomeareuseful.”

– GEORGEBOX,1919-2013

Page 4: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

BusinessCaseforAutoMLandPentaho

• Findingthecorrectmachinelearningalgorithmisnotaneasytask.

• YouneedtofindabalancebetweenthetimeyouwouldneedtospendandthetimeyoucanactuallyspendontheMLproblem.

• Tocreateagoodmodelyouwillneedtoknowverywelltheproblem,thevariables(instances),preparethedata,featureengineeringandtestdifferentalgorithms.

• SomedatascientistswillalsosaytoaddalittlebitofMAGICJ.

• Adding,ofcourse,inmostcases,alotofcomputerpower.

Page 5: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

MachineLearningHigh-LevelOverview

Page 6: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

WhatisAutomatedMachineLearning(AutoML)?

IllustrationbyShyam Sundar Srinivasan

Page 7: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

WhatisAutomatedMachineLearning(AutoML)?

“Machinelearningisverysuccessful,butitssuccessescruciallyrelyonhumanmachinelearningexperts,whoselectappropriateMLarchitectures(deeplearningarchitecturesormoretraditionalMLworkflows)andtheirhyperparameters.Asthecomplexityofthesetasksisoftenbeyondnon-experts,therapidgrowthofmachinelearningapplicationshascreatedademandforoff-the-shelfmachinelearningmethodsthatcanbeusedeasilyandwithoutexpertknowledge.WecalltheresultingresearchareathattargetsprogressiveautomationofmachinelearningAutoML.”https://sites.google.com/site/automl2016/

Page 8: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

WhyAutomatedMachineLearning(AutoML)?

• Thedemandformachinelearningexpertshasoutpacedthesupply.Toaddressthisgap,therehavebeenbigstridesinthedevelopmentofuser-friendlymachinelearningsoftwarethatcanbeusedbynon-expertsandexperts,alike.

• AutoMLsoftwarecanbeusedforautomatingalargepartofthemachinelearningworkflow,whichincludesautomatictrainingandtuningofmanymodelswithinauser-specifiedtime-limit.

Page 9: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

WhatisNOTAutomatedMachineLearning(AutoML)?

• AutoML isnotautomateddatascience;

• AutoML willnotreplaceDataScientist;– Allthemethodsofautomatedmachinelearningaredevelopedtosupportdatascientists,nottoreplacethem.– AutoML istofreedatascientistsfromtheburdenofrepetitiveandtime-consumingtasks(e.g.,machinelearningpipelinedesignandhyperparameteroptimization)sotheycanbetterspendtheirtimeontasksthataremuchmoredifficulttoautomate.

Page 10: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

AutoMLTools

• AutoWeka(OpenSource)– http://www.cs.ubc.ca/labs/beta/Projects/autoweka/

• H2o.aiAutoML(OpenSource)– https://www.h2o.ai/

• TPOT(OpenSource)– https://github.com/rhiever/tpot

• AutoSklearn(OpenSource)– https://github.com/automl/auto-sklearn– http://automl.github.io/auto-sklearn/stable/

• machineJS (OpenSource)– https://github.com/ClimbsRocks/machineJS

Page 11: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

PDI+AutoML

Page 12: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

MachineLearningwithPentahoin4Steps

http://www.pentaho.com/blog/4-steps-machine-learning-pentaho

Page 13: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

CRISP-DM

http://www.pentaho.com/blog/4-steps-machine-learning-pentaho

BusinessUnderstanding

DataUnderstanding

DataPreparation

Modeling

Evaluation

Deployment

Data

Page 14: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

UseCase:AutoML+Pentaho

• OurusershaveawelldefinedMLproblemandtheinitialversionofthedataset(trainandtest).

• Unfortunately,theyhaven’tcreatedaMLmodelyet.

• Also,theyhavenoideahowtocreateit.• AndtheywantustohelpthemtocreateitassoonaspossibleusingonlyOpenSourcetools.

Page 15: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

TheJourney

• Ifyouembarkinthisjourney,youcanstickinthisproblemforever…

…oryoucanfindquickwaystodoitinaspecifiedtime.

• CustomerscanthenspendenoughtimelatertoimprovetheircurrentModel.

• Thenextstepswillbe:– Hireadatascientistorateamofdatascientists;– Hireadomainexpertinthatproblem.

Page 16: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

OurGoal

• Inthisspecificscenario,ourgoalwillbetohelpthemtostarttheprocessofcreatingadummymodelusingAutoML.

Page 17: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

CreateYourFirstMLModel

1. Definetheproblem;

2. Analyzeandpreparethedata;

3. Selectalgorithms(startsimple);

4. Runandevaluatethealgorithms;

5. Improvetheresultswithfocusedexperiments;

6. Finalizeresultswithfinetuning.

Page 18: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

SampleDataset

• Moredataisbetter,butmoredatameansmorecomplexity.

• Moredatameansmoretimethatyouwillhavetospendinyourproblem.

• Whynotcreateasampledataset?!– Create1to20datasetstotestyourproblemandcreateyourmodels;

Page 19: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

DemoAutoML+Pentaho

• ThispresentationaimstodemotheprocessofhowAutoML opensourcetoolsandPentaho,together,canhelpcustomerssavetimeintheprocessofcreatingamodelanddeployingthismodelintoproduction.

Page 20: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

ThePowerofPDI

• PDI(PentahoDataIntegration)willhelpdatascientistanddataengineerswithdataonboarding,datapreparation,datablending,modelorchestration(modelandpredict),savingandvisualizingthedata.

Page 21: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

DataOnboarding,DataPreparationandDataBlending

• BelowwecanseeaDataPreparationProcessusingPDI(PentahoDataIntegration);• MLdatasetoutput:ARFFFile(WekaFile),CSV(Python,RandApacheSparkMLlib)andHadoopOutputtosavethetxtfiletotheDataLake;

Page 22: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

PredictingNewValuesUsingYourModel

Page 23: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

Demonstration

Page 24: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

DemoAgenda

Whatwewillcoverinthedemo:

• DataPreparationwithPDI;• ModelcreationusingAutoML Tool;

• ModelDeploymentwithPDI;

Page 25: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

PentahoDataIntegration+H2OAutoML

Page 26: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

Summary

Whatwecoveredtoday:

• BusinessCaseforAutomatedMachineLearning(AutoML)andPentaho;

• HighleveloverviewaboutAutomatedMachineLearning(AutoML);

• Demonstrations(Pentaho+AutoML).

Page 27: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

NextSteps

Wanttolearnmore?

• TalktomeduringPentahoWorld2017orsendmeane-mailcaio.moreno@HitachiVantara.com;

• Meet-the-Experts:– https://www.pentahoworld.com/meet-the-experts

Page 28: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated
Page 29: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

Appendices

Page 30: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

TopPredictionAlgorithms

• AccordingtoDataiku,thetoppredictionalgorithmsaretheonesexplainedintheimageontherightside.

• Thisimagealsoexplains(resumes)theadvantagesanddisadvantagesofeachalgorithm.

Source:https://blog.dataiku.com/machine-learning-explained-algorithms-are-your-friend

Page 31: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

Algorithms

REXERanalyticsdatasciencesurvey*givesusagoodideaaboutwhichalgorithmshavebeenusedovertheyears.

*SpecialthankstoMarkHall(Pentaho)forsharingthisdocumentwithme.Documentavailableat:http://www.rexeranalytics.com/data-science-survey.html

Page 32: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

CoreAlgorithms

Source: http://www.rexeranalytics.com/files/Rexer_Data_Science_Survey_Highlights_Apr-2016.pdf

Page 33: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

Tools

• Thehugeamountoftoolsincreasesthecomplexity.

Source: http://www.rexeranalytics.com/files/Rexer_Data_Science_Survey_Highlights_Apr-2016.pdf

Page 34: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

AutoWeka

• AutoWeka– providesautomaticselectionofmodelsandhyperparametersfor WEKA.– http://www.cs.ubc.ca/labs/beta/Projects/autoweka/

• OpendatasetsforAutoWeka– http://www.cs.ubc.ca/labs/beta/Projects/autoweka/datasets/

Page 35: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

AutoSklearn

• AutoWekainspiredtheauthorsofAutoSklearn;

• AutoSklearn– auto-sklearnisanautomatedmachinelearningtoolkitandadrop-inreplacementforascikit-learnestimator.– https://github.com/automl/auto-sklearn– http://automl.github.io/auto-sklearn/stable/

Page 36: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

TypesofMLProblemswith(AutoML)

• ThetypesofMachineLearningproblemsthatwecansolveusingAutoWekaandAutoSklearn areClassification,RegressionandClustering:– ClassificationandRegressionarealreadysupportedinAuto-sklearn&Auto-WEKA.– Forclustering,youcanuseaslongasyouhaveanobjectivefunctiontooptimize.

Page 37: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

AutomatedbyTPOT

• TPOTwillautomatethemosttediouspartofmachinelearningbyintelligentlyexploringthousandsofpossiblepipelinestofindthebestoneforyourdata.

https://github.com/rhiever/tpot

Page 38: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

AutoMLToolsInstallation

Page 39: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

InstallingAutoWeka

• ToinstallAutoWeka,gotoWekaPackageManager>SearchforAuto-WEKAandclickthe“Install”button.

Page 40: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

InstallingTPOT

• CommandtoinstallTPOT– $pipinstalltpot

• Learnmore:– http://rhiever.github.io/tpot/installing/

Page 41: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

InstallingAutoSklearnonUbuntu

• Usethedocumentationbelowtohelpyou:– http://automl.github.io/auto-sklearn/stable/

• Runthiscommandonubuntuterminal:– $condainstallgccswig– $curlhttps://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt|xargs-n1-L1pipinstall– $sudoapt-getinstallbuild-essentialswig– $pipinstall–Uauto-sklearn

Page 42: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

ErrorAutoSklearnonUbuntu

• ErrorreportedonJune,14th 2017.Solutionsentonthesameday.

• ChecktheGitHublinkbelowtofindthesolution:https://github.com/automl/auto-sklearn/issues/308

Page 43: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

InstallingH20.ai

• ToinstallH20.aiAutoMLvisitthewebsites:– https://blog.h2o.ai/2017/06/automatic-machine-learning/– https://www.h2o.ai/

Page 44: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

AutoMLDemonstration

Page 45: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

UsingAutoWeka

• timeLimit=Youcandefinethetimeinminutesthat youwantAutoWekatousetorunandfindthebestoption.– Moretime=betterresults

Page 46: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

UsingAutoWeka

• YoucanrunAutoWekafromtheWekaExplorerUserInterface

Page 47: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

UsingAutoWeka

• Forbetterperformance,trygivingAuto-WEKAmoretime

Page 48: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

UsingAutoWeka

• AutoWekaoutputresults

Page 49: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

TestingAutoSklearn

• OpenSpyderandtestthecodebelow:

Sourcecode:http://automl.github.io/auto-sklearn/stable/

Page 50: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

TestingAutoSklearn withIrisDataset

Page 51: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

TestingH2o.aiAutoML

TotestH2oAutoMLisnecessarytoinstalltheversion3.11.0.3888orsuperior.http://h2o-release.s3.amazonaws.com/h2o/rel-vapnik/1/index.html

https://github.com/caiomsouza/machine-learning-orchestration/blob/master/AutoML/src/r/h2o-automl/H20_AutoML_Example.R

aml<- h2o.automl(x=x,y=y,training_frame=train,leaderboard_frame=test,max_runtime_secs=30)

#ViewtheAutoMLLeaderboardlb<- aml@leaderboardlb

Page 52: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

DemoAutoML(AutoWeka)+Pentaho

• UsingAutoWekafromtheWekaUserInterfacewecreatedafirst“dummy”modelin15minutes.

Page 53: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

• AutoWekawilloutputthebestmodelcreatedinthetimespecified,thismodelcanthenbeusedtopredictnewvalues.

AutoWekaoutput

Page 54: Automated Machine Learning (AutoML) and Pentaho - Presentation · •Adding, of course, in most cases, a lot of computer power. Machine Learning High-Level Overview. What is Automated

NoFreeLunchTheorem

https://ti.arc.nasa.gov/m/profile/dhw/papers/78.pdf

http://www.no-free-lunch.org/

http://philosophy.wisc.edu/forster/papers/Krakow.pdf