spark tutorial pycon 2016 part 1
TRANSCRIPT
![Page 1: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/1.jpg)
DavidTaiebSTSM-IBMCloudDataServicesDeveloperadvocatedavid_taieb@us.ibm.com
HANDS-ONSESSION:DEVELOPINGANALYTICAPPLICATIONSUSINGAPACHESPARK™ANDPYTHONPart1:FlightDelayPredictwithSparkML PyCon2016,Portland
![Page 2: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/2.jpg)
©2016IBMCorpora6on �
Agenda
• Pre-requisitestepstobecompletedbeforethesession
• FlightPredictappdescrip6onandarchitecture• TrainthemodelsintheNotebook• AccuracyAnalysisandmodelsrefinement• Deployandrunthemodels
![Page 3: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/3.jpg)
©2016IBMCorpora6on �
Sign up for Bluemix • AccessIBMBluemixwebsiteonhMps://console.ng.bluemix.net• ClickonGetStartedforFree
• CompletetheformandclickCreateaccount• Lookforconfirma6onemailandclickonconfirmyouaccountlink
Signupforflightstats
![Page 4: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/4.jpg)
©2016IBMCorpora6on �
Sign up for a free trial at Flightstats.com
• SignupathMps://developer.flightstats.com/signup• Fillouttheformandmonitoremailforconfirma6onlink(accesstoAPIsmay
takeupto24hours)• Onceaccessisgrantedgoto
hMps://developer.flightstats.com/admin/applica6onstoviewappIdandappKey(youwillneedtheminthesimple-data-pipetooltocreatetrainingsets.
• Op6onal:getfamiliarwiththevariousflightstatsapis:– hMps://developer.flightstats.com/api-docs/scheduledFlights/v1– hMps://developer.flightstats.com/api-docs/airports/v1
Howtofindyourappidandkey
![Page 5: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/5.jpg)
©2016IBMCorpora6on �
Where to find the FlightStats app id and app key
APPID
APPKey
Prepareyourbluemixspace
![Page 6: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/6.jpg)
©2016IBMCorpora6on �
Create a new space on Bluemix Inprepara6onforrunningtheproject,wecreateanewspaceonBluemix
CreateaSparkInstance
Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse
![Page 7: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/7.jpg)
©2016IBMCorpora6on �
Create a Spark Instance
Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse
![Page 8: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/8.jpg)
©2016IBMCorpora6on �
Create New Spark Instance Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse
![Page 9: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/9.jpg)
©2016IBMCorpora6on �
Agenda
• Pre-requisitestepstobecompletedbeforethesession
• FlightPredictappdescrip6onandarchitecture• TrainthemodelsintheNotebook• AccuracyAnalysisandmodelsrefinement• Deployandrunthemodels
![Page 10: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/10.jpg)
©2016IBMCorpora6on �
Flight App Project Description • Usecase
– Flightdelaysareacommondisturbanceduringbusinesstrips– Beingabletopredicthowlikelyaflightwillbedelayedcanremoveuncertaintyandenableuserstoplanaroundit.
– Idea:Weatherdatacanbeagoodexplanatoryvariableforbuildingpredic6vemodels
• ImplementaSon– Combineflightsta6s6csfromflightstats.com(Systemofrecords)withweatherdatafromIBMInsightforWeather(Systemofopera6ons)tobuildatraining,testandblindset
– UseSparkMLLibtotrainpredic6vemodelsandcrossvalidatethem– CreateacustomcardforGoogleNowthatwillautoma6callyno6fyuserofimpendingflightdelay
– Proposealterna6ngflightroutes(e.g.Freebird)Get/Build/Analyze
![Page 11: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/11.jpg)
©2016IBMCorpora6on �
Get/Build/Analyze methodology
![Page 12: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/12.jpg)
©2016IBMCorpora6on �
Flight Predict App Architecture
Weather
SimpleDataPipes
Airports
FlightSchedules
FlightStatus
MetadataTrainingSet
TestSet
BlindSet
CustomConnectorrunevery24hours
Notebook
![Page 13: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/13.jpg)
©2016IBMCorpora6on �
Flow Diagram
DataAcquisi6on
DataPrepara6on
DataAnnota6on(GroundTruth)
ModelTraining
• Cleansing• Shaping• Enrichment
ModelTes6ng
TrainingSet
TestSet
BlindSet
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model
• Itera6veinNature:weareneverdone!• Wewillbeusingthisdiagramasaroadmapthroughoutthiscourse
DeployandRunModel
![Page 14: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/14.jpg)
©2016IBMCorpora6on �
Get the data and build the training/test/blind sets Inthisstepwe’lluseSimpleDataPipesopensourceprojecttoacquiredatafromFlightstats,combineitwithWeatherdatafromIBMInsightforWeatherandsavethedatasetsintoaNoSQLCloudantDatabase.
DataAcquisi6on
DataPrepara6on
DataAnnota6on(GroundTruth)
ModelTraining
• Cleansing• Shaping• Enrichment
ModelTes6ng
TrainingSet
TestSet
BlindSet
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model DeployandRunModel
![Page 15: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/15.jpg)
©2016IBMCorpora6on �
Acquiring the data
• Inthenextsec6on,weshowhowtoacquirethetrainingdatabyusingthesimple-data-pipetoolandflightpredictconnector.
• Theflightpredictconnectorcombinehistoricalflightdatafromflightstats.comwithweatherdatafromIBMInsightforWeather
• Ifyouwanttoskipthesesteps,youcanusethealreadybuiltdatasetbyusingthefollowingcreden6als:– cloudantHost:dtaieb.cloudant.com– cloudantUserName:weenesserliffircedinvers– cloudantPassword:72a5c4f939a9e2578698029d2bb041d775d088b5
Deploysimple-data-pipe
![Page 16: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/16.jpg)
©2016IBMCorpora6on �
Deploy simple-data-pipe with flightstats connector • GotohMps://github.com/ibm-cds-labs/simple-data-pipe• ClickonDeploytoBluemixbuMon
ClickbuMonwilltakeyoutoBluemix
![Page 17: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/17.jpg)
©2016IBMCorpora6on �
Complete simple-data-pipe deployment
AddWeatherservice
![Page 18: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/18.jpg)
©2016IBMCorpora6on �
Add an instance of IBM Weather Service on Bluemix • Returntotheapplica6ondashboard• Weatherserviceisrequiredbythe
flightpredictconnectorandmustbeinstalledbefore
• Fromappdashboard,clickonAddaserviceorAPI
![Page 19: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/19.jpg)
©2016IBMCorpora6on �
Create an instance of IBM Weather Service on Bluemix SearchforWeather
Makesuretoselect“premiumplan”tohaveenoughauthorizedAPIcalls
![Page 20: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/20.jpg)
©2016IBMCorpora6on �
Checkpoint: simple data pipe app dashboard • Verifythatyourappiscorrectlyboundtotherightservices
WeatherServiceusedtoenrichflightrecordswithweatherobserva6ons
CloudantServiceusedtostoretraining,testandblinddatasets
You’llneedtoclickonthisbuMonforthesteponthenextpageItisrecommendedtoincrease
theappmemoryto1GB
![Page 21: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/21.jpg)
©2016IBMCorpora6on �
Install flight predict connector • ClickEditCodebuMon,editpackage.jsontoaddflightpredictmodule:
– "simple-data-pipe-connector-flightstats":"git://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats.git"
addflightpredictmoduletodependencies
Saveyourchanges
don’tforgettoaddcommainthelinebeforetokeepjsonvalid
![Page 22: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/22.jpg)
©2016IBMCorpora6on �
Install flight predict connector • ClickFile/Savetosaveyourchanges
Redeploysimpledatapipe
![Page 23: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/23.jpg)
©2016IBMCorpora6on �
Redeploy simple data pipe app • UseliveeditEditortoredeploytheapp
Verifyyoursdpinstall
![Page 24: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/24.jpg)
©2016IBMCorpora6on �
Verify connector install • Inthisstep,weverifythattheflightpredictconnectoriscorrectlyinstalledthroughtheUI
Fightconnectorcorrectlyinstalled
Createnewflightstatspipe
![Page 25: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/25.jpg)
©2016IBMCorpora6on �
Create a new FlightStats pipe • Followeachscreentocreateandconfigureanewpipe
Runthepipe
![Page 26: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/26.jpg)
©2016IBMCorpora6on �
Run the pipe • Skipoverthescheduletab• Intheac6vitytab,clickonRunNowtostartthepipe
Explorethedataset
ClickRunNowThenopenthelogtomonitortheac6vity
![Page 27: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/27.jpg)
©2016IBMCorpora6on �
Explore the data sets • Inthisstep,wetakeamomenttoexplorethedifferentdatasetsthathavebeencreatedbythesimpledatapipetool
• Frombluemixdashboard,clickonthecloudantservice6le,thenontheLaunchbuMon• FromtheCloudantdashboard,openthetrainingdatabase• Openadocumenttolookatthedatastructure
Buildthetestset
![Page 28: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/28.jpg)
©2016IBMCorpora6on �
Run the pipe again to build the test set
Trainthemodels
![Page 29: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/29.jpg)
©2016IBMCorpora6on �
Train the Models • Intheprevioussec6onwehavecreatedthetrainingdataandwearenowreadytotrainthemodels.• Stepsinthissec6on:
– CreateanIPythonNotebook– LoadthedatasetsfromtheCloudantdatabaseintoaSparkCluster– Explorethedataandtrainthemachinelearningmodels
DataAcquisi6on
DataPrepara6on
DataAnnota6on(GroundTruth)
ModelTraining
• Cleansing• Shaping• Enrichment
ModelTes6ng
TrainingSet
TestSet
BlindSet
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model DeployandRunModel
CreateIPythonNotebook
![Page 30: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/30.jpg)
©2016IBMCorpora6on �
Create a new IPython Notebook
![Page 31: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/31.jpg)
©2016IBMCorpora6on �
Notebook tour
![Page 32: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/32.jpg)
©2016IBMCorpora6on �
Notebook tour: Notebook Info
![Page 33: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/33.jpg)
©2016IBMCorpora6on �
Notebook tour: Environment
![Page 34: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/34.jpg)
©2016IBMCorpora6on �
Notebook tour: Sharing
`
![Page 35: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/35.jpg)
©2016IBMCorpora6on �
Agenda
• Pre-requisitestepstobecompletedbeforethesession
• FlightPredictappdescrip6onandarchitecture• TrainthemodelsintheNotebook• AccuracyAnalysisandmodelsrefinement• Deployandrunthemodels
![Page 36: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/36.jpg)
©2016IBMCorpora6on �
Before we start building the app…
• Youcanop6onallyfollowthistutorialfromGithubbyusingafullybuiltnotebook:– hMps://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/blob/master/notebook/Flight%20Predict%20PyCon%202016.ipynb
![Page 37: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/37.jpg)
©2016IBMCorpora6on �
Optional: use prebuilt notebook
ImportrequiredPythonpackages
• CreatenotebookfromURL• UsehMps://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/notebook/Flight%20Predict%20PyCon%202016.ipynb
![Page 38: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/38.jpg)
©2016IBMCorpora6on �
Using Python Packages • Writecodeinlinewithincells• EncapsulatehelperAPIswithinPythonpackage• 2waysofusinghelperPythonpackages
– eggdistribu6onpackage:pipinstallfromPyPiserverorfileserver(e.g.Github)
• Persistentinstallacrosssessions• RecommendedinProduc6on
– SparkContext.addPyFile• Easyaddi6onofapythonmodulefile• Supportmul6plemodulefilesviazipformat• Recommendedduringdevelopmentwherefrequentcodechangesoccur
Manageeggpackages
![Page 39: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/39.jpg)
©2016IBMCorpora6on �
Flight Predict Python Package on Github
SetupscriptforinstallingPythonPackage
FlightPredictPythonlibrary
![Page 40: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/40.jpg)
©2016IBMCorpora6on �
Method 1: Install Flight Predict Package • UsepiptoInstallFlightPredictpackage• Recommendedalterna6ve:buildeggdistribu6onpackageanddeployinPyPi
![Page 41: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/41.jpg)
©2016IBMCorpora6on �
Manage Python packages • Checkstatus• Uninstallpackage
Installpackagesviasc.addPyFilemethod
![Page 42: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/42.jpg)
©2016IBMCorpora6on �
Method 2: Install py modules via sc.addPyFile
• addPyFileinstallindividualpymodulesandmakethemavailabletoallexecutorprocesses
• Workswithmodulesinzippedfiles
Modulecontainingapisfortrainingthemodels
Modulecontainingapisforrunningthemodels
Configurecreden6alsforvariousservices
![Page 43: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/43.jpg)
©2016IBMCorpora6on �
Setup credentials and Import required python modules
Inthisstep,weimportpythonmodulesthatwillbeneededthroughoutthenotebookandsetupcreden6alstovariousservices.
Howtogetcreden6alsforCloudantandWeather
Creden6alforCloudantNoSQLService
Creden6alsforWeatherService
![Page 44: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/44.jpg)
©2016IBMCorpora6on �
Get Credentials for Cloudant Fromtheappdashboard,clickonEnvironmentVariablesfromthelessidebar
![Page 45: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/45.jpg)
©2016IBMCorpora6on �
Get Credentials for Weather
LoadtrainingsetfromCloudant
![Page 46: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/46.jpg)
©2016IBMCorpora6on �
Load training set in Spark SQL DataFrame
…
Inthisstep,weusethecloudant-sparkconnector(hMps://github.com/cloudant-labs/spark-cloudant)toloaddataintoSpark
Makesuretochangethedbnametomatchtheonecreatedforyourtrainingsetbyyourac6vity(opentheCloudantdashboardtofindthename)
![Page 47: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/47.jpg)
©2016IBMCorpora6on �
Loading data: Behind the scene UseSparkSQLconnectortoloaddataintoaDataFrame
connectorid
Op6ons
Cachedataforop6mizedreuse
CreatetempSQLTable
ScaMerPlotVisualiza6on
![Page 48: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/48.jpg)
©2016IBMCorpora6on �
Scatter plot visualization
![Page 49: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/49.jpg)
©2016IBMCorpora6on �
Visualization api
CreateanRDDofLabeledPoint
![Page 50: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/50.jpg)
©2016IBMCorpora6on �
Transform into an RDD of LabeledPoint UseSparkSQLconnectortoloaddataintoaDataFrame
![Page 51: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/51.jpg)
©2016IBMCorpora6on �
loadLabeledDataRDD api
TrainMachineLearningModels
![Page 52: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/52.jpg)
©2016IBMCorpora6on �
Machine Learning Algorithms ConSnuousOutput DiscreteOutput
SupervisedLearning(requireGround-Truth)
• Regression-Linear-Ridge-Lasso-Isotonic• DecisionTree• RandomForest• GradientBoostedTree
• Classifica6on-Logis6cRegression-SVM-NaiveBayes• DecisionTree• RandomForest• GradientBoostedTree• K-NN(availableasadd-onsparkpackage)
UnsupervisedLearning(noGround-Truthdatarequired)
• Clustering-KMeans-GaussianMixture• DimensionalityReduc6on-PCA-SVD
• FP-Growth
TrainLogis6cRegressionModel
![Page 53: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/53.jpg)
©2016IBMCorpora6on �
Train Logistic Regression Model
TrainNaïveBayesModels
![Page 54: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/54.jpg)
©2016IBMCorpora6on �
Train NaiveBayes Model
TraindecisionTreeModel
![Page 55: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/55.jpg)
©2016IBMCorpora6on �
Train Decision Tree Model
TrainRandomForestModel
![Page 56: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/56.jpg)
©2016IBMCorpora6on �
Train Random Forest Model
AccuracyAnalysis
![Page 57: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/57.jpg)
©2016IBMCorpora6on �
Naïve Bayes vs Decision Tree • Probabilis6c:computetheprobability
ofadatainstancetobeinaspecificclass
• Assumethateachfeature(variable)isindependentfromtheothers
• Performancedependsonthepredic6venatureofthefeatures(nonpredic6vefeatureswillaffecttheaccuracy)
• Workswellwithlowamountoftrainingdata.Doesn’tneedallthepossibili6es
• Doesn’tworkwithcategoricalfeatures.
• Non-Probabilistic: partition the data into subsets that best describe the variable
• The deeper the tree, the better the model fits the data
• Watch out for overfiting: need to prune the tree
• Can handle categorical or continuous features
• No need for input to be scaled or standardized: Set you features and go!
• Requires a lot of data covering all possibilities
![Page 58: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/58.jpg)
©2016IBMCorpora6on �
Accuracy Analysis of the Machine Learning Models Inthissec6on,wewillperformaccuracyanalysisonthetestdata.Wewillstartbycompu6ngtheaccuracymetricsforeachmodel,includingtheconfusionmatrix.Wewillthenusehistogramcharttounderstandthedatadistribu6onandrefinehowtoclassesarecomputed.
DataAcquisi6on
DataPrepara6on
DataAnnota6on(GroundTruth)
ModelTraining
• Cleansing• Shaping• Enrichment
ModelTes6ng
TrainingSet
TestSet
BlindSet
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model DeployandRunModel
![Page 59: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/59.jpg)
©2016IBMCorpora6on �
Agenda
• Pre-requisitestepstobecompletedbeforethesession
• FlightPredictappdescrip6onandarchitecture• TrainthemodelsintheNotebook• AccuracyAnalysisandmodelsrefinement• Deployandrunthemodels
![Page 60: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/60.jpg)
©2016IBMCorpora6on �
Load Test data Makesuretochangethedbnametomatchtheonecreatedforyourtestsetbyyourac6vity(opentheCloudantdashboardtofindthename)
![Page 61: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/61.jpg)
©2016IBMCorpora6on �
Accuracy Metrics
![Page 62: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/62.jpg)
©2016IBMCorpora6on �
Confusion Matrix
![Page 63: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/63.jpg)
©2016IBMCorpora6on �
Confusion Matrix
![Page 64: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/64.jpg)
©2016IBMCorpora6on �
Confusion Matrix
![Page 65: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/65.jpg)
©2016IBMCorpora6on �
Confusion Matrix
![Page 66: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/66.jpg)
©2016IBMCorpora6on �
Accuracy metrics API OutputHTML
DisplayresultsHTMLinNotebookCell
ComputeMetricsfromlabeledandpredic6ondata
Gettheconfusionmatrixandbuildhtmltable
![Page 67: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/67.jpg)
©2016IBMCorpora6on �
Understand the distribution of your data with Histograms
![Page 68: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/68.jpg)
©2016IBMCorpora6on �
Training Handler class
• Provideflexibilityandextensibilitytotheapplica6on
• Provideafailfastandtrysomethingelsemechanism
• Enableusertoeasilycustomizeclassesofdatabasedonhowdataisdistributed
• Enableusertoeasilyaddtrainingfeatures
![Page 69: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/69.jpg)
©2016IBMCorpora6on �
Default Training Handler class
Returndescrip6onforeachclasses
Returntotalnumberofclasses:Defaultis5
Re-classifyarecord:defaultusess.classifica6onfieldinJsonrecord
ExtrafeaturesNamestobeadded.Nonebydefault
Extrafeaturestobeadded.ArraymustmatchtheonereturnedbycustomTrainingFeaturesNames
![Page 70: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/70.jpg)
©2016IBMCorpora6on �
Customize Training Handler Providenewclassifica6onandadddayofdepartureasanewfeature
InheritfromdefaultTrainingHandler
Adddayoftheweekusingatechniquecalleddummycoding
![Page 71: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/71.jpg)
©2016IBMCorpora6on �
Re-train the models
![Page 72: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/72.jpg)
©2016IBMCorpora6on �
Re-compute accuracy Models1
Models2BeMeraccuracyforNaiveBayesandLogis6cRegressionWorseforDecisionTreeandRandomForest
![Page 73: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/73.jpg)
©2016IBMCorpora6on �
Agenda
• Pre-requisitestepstobecompletedbeforethesession
• FlightPredictappdescrip6onandarchitecture• TrainthemodelsintheNotebook• AccuracyAnalysisandmodelsrefinement• Deployandrunthemodels
![Page 74: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/74.jpg)
©2016IBMCorpora6on �
Deploy and Run the models Inthelastsec6on,wewillsimulatedeploymentandrunningofthemodelsthroughthenotebookbycallingAPIsfromtherunpackage.
DataAcquisi6on
DataPrepara6on
DataAnnota6on(GroundTruth)
ModelTraining
• Cleansing• Shaping• Enrichment
ModelTes6ng
TrainingSet
TestSet
BlindSet
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model DeployandRunModels
![Page 75: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/75.jpg)
©2016IBMCorpora6on �
Run the predictive model
![Page 76: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/76.jpg)
©2016IBMCorpora6on �
runModel API
![Page 77: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/77.jpg)
©2016IBMCorpora6on �
Get Weather Predictions
![Page 78: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/78.jpg)
©2016IBMCorpora6on �
Show prediction results
![Page 79: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/79.jpg)
©2016IBMCorpora6on �
Resource
• hMps://developer.ibm.com/clouddataservices/• hMps://github.com/ibm-cds-labs/simple-data-pipe• hMps://github.com/ibm-cds-labs/pipes-connector-flightstats• hMp://spark.apache.org/docs/latest/mllib-guide.html• hMps://console.ng.bluemix.net/data/analy6cs/
![Page 80: Spark tutorial pycon 2016 part 1](https://reader030.vdocuments.net/reader030/viewer/2022021500/58e7b8481a28abbb4e8b591d/html5/thumbnails/80.jpg)
©2016IBMCorpora6on �
Thank You