data science notebook webinar 2017-11-16 copy...data science readiness •jupyter: widely used by...

21
Data Science Notebook Guidelines ODPi BI & Data Science SIG: Cupid Chan Moon Soo Lee Frank McQuillan

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

DataScienceNotebookGuidelines

ODPi BI&DataScienceSIG:CupidChan

MoonSooLeeFrankMcQuillan

Page 2: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

• BridgingthegapsothatBItoolscansitharmoniouslyontopofbothHadoopandRDBMS,whileprovidingthesame,orevenmore,businessinsighttotheBIuserswhohavealsoHadoopinthebackend.

• ProvideanobjectiveguidelineforevaluatingtheeffectivenessofaBIsolution,and/orotherrelatedmiddlewaretechnologies

BI&DataScienceSpecialInterestGroup(SIG)

Page 3: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Targetuserpersona

• Jupyter:Datascienceuserwithprogrammingexperienceinoneofthesupportedkernels

• Zeppelin:Dataengineer,datascientistandbusinessusersinthesamedataprocessingpipelineneedtocollaborate

Page 4: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Installation

• Jupyter:EasyinstallationwithAnacondaorpip.Standalone,orHadoopandSpark(viaYARN)clusterssupported.

• Zeppelin:Downloadbinarypackageandstartdaemonscript.IncludedinHDP.

Page 5: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Configuration

• Jupyter:Editconfig filesorusecommandlinetoolfornotebooksettings.Communitymaintainedlanguagekernelshavevariousconfigurationworkflows.

• Zeppelin:Editconfig files.InterpreterscanbeconfiguredthroughGUI.

Page 6: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

UserInterface

• Jupyter:Functionalnotebookuserinterfacethatcanbeusedtocreatereadableanalysescombiningcode,images,comments,formulaeandplots.

• Zeppelin:Notebookinterfacethatusercandocument,runcodes,visualizeoutputswithflexiblelayoutandmultiplelookandfeel.

Page 7: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Supportedlanguages

• Jupyter:Python,R,Juliaanddozensofcommunitymaintainedkernels

• Zeppelin:VariouslanguagesupportsareincludedinthebinarypackagewhichSpark,Python,JDBCandetc.3rdpartyinterpretersareavailablethroughonlineregistry

Page 8: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Multi-usersupport

• Jupyter:NativeJupyter doesnotsupportmulti-user.HoweverJupyterHub canbeusedtoservenotebookstousersworkinginseparatesessions.

• Zeppelin:Multipleuserscancollaborateinreal-timeonanotebook.Multipleuserscanworkwithmultiplelanguagesinthesamenotebook.

Page 9: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Supportandcommunity

• Jupyter:Matureprojectwithactivecommunityandgoodsupport.Jupyter projectbornin2014buthasrootsgoingbackto2001.

• Zeppelin:ApacheZeppelinisoneofthemostactiveprojectinApacheSoftwareFoundation.Projectbornin2013andbecametoplevelprojectofASFin2015.

Page 10: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Architecture

• Jupyter:Thenotebookserversendscodetolanguagekernels,rendersinabrowser,andstorescode/output/MarkdowninJSONfiles.

• Zeppelin:Zeppelinserverdaemonmanagesmultipleinterpreters(backendintegrations).Webapplicationcommunicatestoserverusingwebsocketforreal-timecommunication.

Page 11: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Bigdataecosystem

• Jupyter:Canbeconnectedtoavarietyofbigdataexecutionenginesandframeworks:Spark,massivelyparallelprocessing(MPP)databases,Hadoop,etc.

• Zeppelin:TightlyintegratedwithApacheSparkandotherbigdataprocessingengines.

Page 12: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Security

• Jupyter:Codeexecutedinthenotebookistrusted,likeanyotherPythonprogram.Token-basedauthenticationonbydefault.Rootusedisabledbydefault/

• Zeppelin:Userauthentication(LDAP,ADintegration)NotebookACL.InterpreterACL.SSLconnection.

Page 13: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Datasciencereadiness

• Jupyter:Widelyusedbydatascientistsforavarietyoftasksincludingquickexploration,documentationoffindings,reproducibility,teaching,andpresentations

• Zeppelin:Datascientistscancollaborateeachother.Alsobusinessuserscanloginandcollaboratewithdatascientistsdirectlyonnotebooks.

Page 14: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

JupyterFrankMcQuillan

Page 15: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Agenda

• WhatisaJupyter notebook?• Lightningtutorial- myfirstJupyter notebook• Datascienceexamples

– Python– SQL

• Keystrengthsandpotentialareasofimprovement

Page 16: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

WhatisaJupyter Notebook?

• Tellastorywithyourdata• Programinawebbrowser• “Multimodal”• Favoritetoolofdatascientistsandresearchers

Page 17: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

SupportandCommunity

• 2001- IPythonnotebookproject(FernandoPerez)• 2014- Jupyternotebooklaunched• Opensource(modifiedBSDlicense)• Steeringcouncilof~15membersfromacademiaandcommercialcompanies

• Matureproductwithactivecommunityhttps://stackoverflow.com/search?q=jupyter returns~10,500results

Page 18: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Architecture

● IPython● IRkernel● IJulia● Dozensofcommunity

maintainedkernelshttps://github.com/jupyter/jupyter/wiki/Jupyter-kernels

Page 19: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Demo

Page 20: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

Summary

• Keystrengths– Datasciencefriendly–Matureproject–Widelyused– IntuitiveUI– Nicepresentationofcode,images,comments,formulae

– Lotsofavailablekernels

• Somepotentialimprovements–Multi-usersupport– Celldraganddrop– Hidingcode/output– IDEtypeoperationslikesyntaxchecking,versioncontrol,runningcodeonelineatatime

Page 21: Data Science Notebook Webinar 2017-11-16 copy...Data science readiness •Jupyter: Widely used by data scientists for a variety of tasks including quick exploration, documentation

ZeppelinMoonSooLee

Slide & demo notebook - https://s.apache.org/ZPLN