data science lifecycle with apache zeppelin and spark by moonsoo lee
TRANSCRIPT
Data science lifecycle with Apache Zeppelin And Spark
2015 Spark Summit Amsterdam Moon [email protected] NFLabs www.nflabs.com
Data science lifecycle
Data Science: process
https://en.wikipedia.org/wiki/Data_analysis
Data Science: tools
MLlib
Data Science: peopleEngineerData ScientistDevOpsBusiness
http://aarondavis.design/
Hadoop Landscape
Cloudera-MLML-baseMRQLShark
?
Project TimelineASF Incubation12.201408.2014Started getting adoptionhttp://zeppelin.incubator.apache.org
12.2012Commercial Product for data analysis10.2013Open sourced a single feature
Commercial Product 12.2012
Zeppelin 10.2013
Zeppelin10.2013
Zeppelin08.2014
Zeppelin08.2014
Third-party Products 10.2014
Apache Incubation Proposal11.2014
Acceptance by Incubator 23.12.2014
Current Status1 Release71 Contributors worldwide766 Stars on GH300/900 Emails at users/dev @i.a.o
Interactive Notebooks
Interactive Visualization
Multiple Backends
Interpreter
http://zeppelin.incubator.apache.org/docs/development/writingzeppelininterpreter.html
Writing an Interpreter public abstract void open();
public abstract void close();
public abstract InterpreterResult interpret(String st, InterpreterContext context);
public abstract void cancel(InterpreterContext context);
public abstract int getProgress(InterpreterContext context);
public abstract List completion(String buf, int cursor);
public abstract FormType getFormType();
public Scheduler getScheduler();MusthaveGood to haveAdvanced
Display SystemZeppelin ServerSpark Interpreter
Other Interpreter
Zeppelin webapp
Websocket, RESTTextHtmlTableAngular
Display SystemSelect display system through output
Built in scheduler
Built-in scheduler runs your notebook with cron expression.
Flexible layout
Flexible layout
DEMO
Zeppelin & Friends
Z-Manager
ZeppelinHub
Collaboration/SharingPackaging & DeploymentZeppelin + Full stack on a cloud
PackagesBackend Integration
Z-Manager installer
Deploymenthttps://github.com/hortonworks-gallery/ambari-zeppelin-service
Deployment
As a Service
AWS EMR
https://aws.amazon.com/blogs/aws/amazon-emr-release-4-1-0-spark-1-5-0-hue-3-7-1-hdfs-encryption-presto-oozie-zeppelin-improved-resizing/
Online Viewer
Zeppelin for organizations
An Engineerengineer by http://aarondavis.design/
A Teamengineer by http://aarondavis.design/
An Organizationengineer by http://aarondavis.design/
Thats too many!engineer by http://aarondavis.design/
What is the problem?Too much:InstallConfigureCluster resources
Solution?We have containers+reverse proxy
Z Managerhttp://github.com/NFLabs/z-manager
Apache 2.0 LicenceContainerized deployment per user Reverse proxySingle binarySimple web applicationZ ManagerSGA to ASF coming *
following the destiny of Z:PoC, internal adoption, OSS, ASF
Z Manager
Auto-update
engineer by http://aarondavis.design/
Linux box
go + react :)Z Manager process
Z Manager
ZeppelinHub
https://www.zeppelinhub.comSharing notebooks with access control
Zeppelin
http://aarondavis.design/
Shares Notebook
Provides multi-tenant environment
z-managerZeppelinHub
Data Science: peopleEngineerData ScientistDevOpsBusiness
http://aarondavis.design/
Before
Cloudera-MLML-baseMRQLShark
?
After
Cloudera-MLML-baseMRQLShark
Project roadmap
Helium
People do the similar workwith different data
New visualizationModel & AlgorithmData process pipeline
engineer by http://aarondavis.design/
Package and distribute work
New visualizationModel & AlgorithmData process pipeline
PkgRepo
engineer by http://aarondavis.design/
Heliumhttps://s.apache.org/heliumPlatform foron top of Apache Zeppelin
Data Analytics Application
Helium Application=
+
ViewAlgorithmZeppelin provided Resources
Resources
DataComputingAny java object - Result of last execution - JDBC connection (from JDBC Interpreter)* - SparkContext (from SparkInterpreter) - Flink environment (from FlinkInterpreter)*- Provided by user created Interpreter- Provided by user created Helium application
Application ExamplesDataComputing- ex) get git commit log data https://github.com/Leemoonsoo/zeppelin-gitcommitdataVisualization - ex) run cpu usage monitoring code across spark cluster, using SparkContext https://github.com/Leemoonsoo/zeppelin-sparkmon- ex) display result data as a wordcloud https://github.com/Leemoonsoo/zeppelin-wordcloud
How it worksZeppelin Server
Web browserViewInterpreter ProcessAlgorithmResource pool
Resource pool
Resource pools are connected
Algorithm runs where resource exists
APIclass YourApplication extends org.apache.zeppelin.helium.Application {
@Override public void run(ApplicationArgument arg, InterpreterContext context) { .. }}Easy APIJust extend helium.Application
Application Spec{ mavenArtifact : "groupId:artifactId:version", className : "your.helium.application.Class", icon : "fa fa-cloud", name : "My app name", description : some description", consume : [ "org.apache.spark.SparkContext" ]}SimpleWriting a spec file allow Zeppelin load application
Deploy
PublicRepositoryPrivateRepository
Handy PrivatePublicPackaged to Jar and Distributed through MavenDownloaded on the fly and run when user selects it
Thank youQ & A Moon [email protected]
NFLabs www.nflabs.comhttp://zeppelin.incubator.apache.org/