apache zeppelin and spark for enterprise data science
TRANSCRIPT
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enabling Apache Zeppelin* and Spark* for Data Science in the Enterprise
Bikas Saha@bikassaha
*Apache Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Zeppelin and the Hadoop elephant logo are trademarks of the Apache Software Foundation.
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Making Big Data Science easy to approach
What are the current issues for the enterprise
Making Apache Zeppelin enterprise ready
Future Roadmap
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zeppelin makes Big Data Science Easy to Approach
Zero install – Just connect via a web browser and ready to run Support for multiple execution platforms (Apache Spark, JDBC, Hive…) Support for multiple languages (Scala, SQL, Python…) Support for built-in visualizations Support for reporting Support for sharing and collaborative work
Does NOT have machine learning built-in – that’s where Apache Spark comes in (or your favorite SQL engine Apache Flink/Drill/Hive… and 30+ others)
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zeppelin for Sharing
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Making Big Data Science easy to approach
What are the current issues for the enterprise
Making Apache Zeppelin enterprise ready
Future Roadmap
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current Apache Zeppelin and Spark integration
ZeppelinServer
SparkDriver
User
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architectural Issue with Secure Data Access
ZeppelinServer
SparkDriver
User 1 Spark
Executor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
Zeppelin ServerUser
HDFS
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architectural Issues with Multi-Tenancy – Fault Tolerance
ZeppelinServer
SparkDriver
Us
er1
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
Us
er2
User 1 failure affects User 2
Heavy-weight Spark drivers
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architectural Issues with Multi-Tenancy – Privacy
ZeppelinServer
SparkDriver
Us
er1
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
Us
er2
User 1 can
access User 2Data
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Making Big Data Science easy to approach
What are the current issues for the enterprise
Enterprise Ready Big Data Science
Future Roadmap
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Livy Server as a Session Management Service
LivyServer
Remote Spark Driver
Session Remote Context
Interactive REST API
BatchREST API
Standard Spark Batch Job
SparkExecutor
SparkExecutor
SparkExecutor
SparkExecutor
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Secure Data Access - Solved
ZeppelinServer
LivyInterpreter
User
SparkExecutor
SparkExecutor
LivyServer
Remote Spark Driver
Session
Remote Context
User
HDFS
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Multi Tenancy - Solved
ZeppelinServer
LivyInterpreter
LivyServer
Session 1
Us
er1
Us
er2
LivyInterpreter
Session 2
Remote Spark Driver
Remote Context
SparkExecutor
Remote Spark Driver
Remote Context
SparkExecutor
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Making Big Data Science easy to approach
What are the current issues for the enterprise
Making Apache Zeppelin enterprise ready
Future Roadmap
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Near Term Improvements
Session Management Debuggability Unified session for all languages Better visualizations for Machine Learning Support for Spark 2.0
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Long Term Improvements
Controlled sharing of sessions for collaboration Data exploration and browsing with metadata Taking the model from training to production
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You