different strategies of scaling h2o machine learning on apache spark
TRANSCRIPT
![Page 1: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/1.jpg)
Different Strategies of Scaling H2O Machine
Learning on Apache Spark
Jakub Há[email protected]
H2O & Booking.com, Amsterdam,April 6, 2017
![Page 2: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/2.jpg)
Who am I
• Finishing high-performance cluster monitoring tool for JVM based languages ( JNI, JVMTI, instrumentation )
• Finishing Master’s at Charles Uni in Prague • Software engineer at H2O.ai - Core Sparkling Water
• Climbing & Tea lover, ( doesn’t mean I don’t like beer!)
![Page 3: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/3.jpg)
Distributed Sparkling Team
• Michal - Mt. View, CA • Kuba - Prague, CZ • Mateusz – Tokyo, JP • Vlad - Mt. View, CA
![Page 4: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/4.jpg)
H2O.aiMachine Intelligence
H2O+Spark = Sparkling
Water
![Page 5: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/5.jpg)
Sparkling Water
• Transparent integration of H2O with Spark ecosystem - MLlib and H2O side-by-side
• Transparent use of H2O data structures and algorithms with Spark API
• Platform for building Smarter Applications • Excels in existing Spark workflows requiring advanced
Machine Learning algorithms
Functionality missing in H2O can be replaced by Spark and vice versa
![Page 6: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/6.jpg)
Benefits
• Additional algorithms – NLP
• Powerful data munging • ML Pipelines
• Advanced algorithms
• speed v. accuracy
• advanced parameters
• Fully distributed and parallelised
• Graphical environment
• R/Python interface
![Page 7: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/7.jpg)
H2O.aiMachine Intelligence
How to use Sparkling Water?
![Page 8: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/8.jpg)
Start spark with Sparkling Water
![Page 9: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/9.jpg)
H2O.aiMachine Intelligence
Model Building
Data Source
Data munging Modelling
Deep Learning, GBMDRF, GLM, GLRM
K-Means, PCACoxPH, Ensembles
Prediction processing
![Page 10: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/10.jpg)
H2O.aiMachine Intelligence
Data Munging
Data Source
Data load/munging/ exploration Modelling
![Page 11: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/11.jpg)
H2O.aiMachine Intelligence
Stream processing
DataSourceO
ff-lin
e m
odel
trai
ning
Data munging
Model prediction
Deploy the model
Stre
ampr
oces
sing
Data Stream
Spark Streaming/Storm/Flink
Export modelin a binary format
or as code
Modelling
![Page 12: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/12.jpg)
H2O.aiMachine Intelligence
Features Overview
![Page 13: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/13.jpg)
Scala code in H2O Flow
• New type of cell • Access Spark from Flow UI • Experimenting made easy
![Page 14: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/14.jpg)
H2O Frame as Spark’s Datasource
• Use native Spark API to load and save data • Spark can optimise the queries when loading data
from H2O Frame • Use of Spark query optimiser
![Page 15: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/15.jpg)
Machine learning pipelines
• Wrap our algorithms as Transformers and Estimators • Support for embedding them into Spark ML
Pipelines • Can serialise fitted/unfitted pipelines • Unified API => Arguments are set in the same way
for Spark and H2O Models
![Page 16: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/16.jpg)
MLlib Algorithms in Flow UI
• Can examine them in H2O Flow • Can generate POJO out of them
![Page 17: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/17.jpg)
PySparkling made easy
• PySparkling now in PyPi • Contains all H2O and Sparkling Water
dependencies, no need to worry about them • Just add in on your Python path and that’s it
![Page 18: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/18.jpg)
And others!
• Support for Datasets • RSparkling • Zeppelin notebook support • Integration with TensorFlow ( via DeepWater) • Support for high-cardinality fast joins • Secure Communication - SSL • Bug fixes..
![Page 19: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/19.jpg)
Coming features
• Support for more MLlib algorithms in Flow • Python cell in the H2O Flow • Integration with Steam • …
![Page 20: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/20.jpg)
H2O.aiMachine Intelligence
Internal Backend
![Page 21: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/21.jpg)
Sparkling Water Internal Backend
Sparkling App
jar file
Spark Master
JVM
spark-submit
Spark Worker
JVM
Spark Worker
JVM
Spark Worker
JVM
Sparkling Water Cluster
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Spark Executor JVM
H2O
![Page 22: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/22.jpg)
Internal Backend
Cluster
Worker node
Spark executor Spark executorSpark executor
Driver node
SparkContext
Worker nodeWorker node
H2OContext
![Page 23: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/23.jpg)
Internal Backend
H2O
Ser
vice
sH
2O S
ervi
ces
DataSource
Spar
k Ex
ecut
orSp
ark
Exec
utor
Spar
k Ex
ecut
or
Spark Cluster
DataFrame
H2O
Ser
vice
s
H2OFrame
DataSource
![Page 24: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/24.jpg)
Pros & Cons
• Advantages • Easy to configure • Faster ( no need to send data to different cluster)
• Disadvantages • Spark kills or joins new executor => H2O goes
down • No way how to discover all executors
![Page 25: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/25.jpg)
When to use
• Have a few Spark Executors • Stable environment - not many Spark jobs need to
be repeated • Quick jobs
![Page 26: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/26.jpg)
When not To Use
• Unstable environment • Big number of executors • Big data, long running tasks which often fails in the
middle
![Page 27: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/27.jpg)
How to use
from pysparkling import *
conf = H2OConf(sc) .set_internal_cluster_mode() hc = H2OContext.getOrCreate(sc, conf)
![Page 28: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/28.jpg)
H2O.aiMachine Intelligence
External Backend
![Page 29: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/29.jpg)
Overview
• Sparkling Water is using external H2O cluster instead of starting H2O in each executor
• Spark executors can come and go and H2O won’t be affected
• Start H2O cluster on YARN automatically
![Page 30: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/30.jpg)
Separation Approach
• Separating Spark and H2O• But preserving same API:
val h2oContext = H2OContext.getOrCreate(“http://h2o:54321/)
• Spark and H2O can be submitted as Yarn job and controlled in separation• But H2O still needs non-elastic environment (H2O itself
does not implement HA yet)
![Page 31: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/31.jpg)
Sparkling Water External Backend
Sparkling App
jar file
Spark Master JVM
spark-submit
Spark Worker
JVM
Spark Worker
JVM
Spark Worker
JVM
Sparkling Water Cluster
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Spark Executor JVM
H2O
H2O Cluster
H2O
![Page 32: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/32.jpg)
External Backend
Cluster
Worker node
Spark executor Spark executorSpark executor
Driver node
SparkContext
Worker nodeWorker node
H2OContext
![Page 33: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/33.jpg)
Pros & Cons
• Advantages• H2O does not crash when Spark executor goes down• Better resource management since resources can be
planned per tool
• Disadvantages• Transfer overhead between Spark and H2O processes
• under measurement with cooperation of a customer
![Page 34: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/34.jpg)
Extended H2O Jar
• External cluster requires H2O jar file to be extended by sparkling water classes
• ./gradlew extendJar -PdownloadH2O=cdh5.8
• We are working on making this even easier so the extended Jars will be available online
![Page 35: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/35.jpg)
Modes
• Auto Start mode • Start h2o cluster automatically on YARN
• Manual Start Mode • User is responsible for starting the cluster
manually
![Page 36: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/36.jpg)
Manual Cluster Start Mode
from pysparkling import *
conf = H2OConf(sc) .set_external_cluster_mode() .use_manual_cluster_start() .set_cloud_name(“test”) .set_h2o_driver_path(“path to h2o driver”)
hc = H2OContext.getOrCreate(sc, conf)
![Page 37: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/37.jpg)
Auto Cluster Start Mode
from pysparkling import *
conf = H2OConf(sc) .set_external_cluster_mode() .use_auto_cluster_start() .set_h2o_driver_path(“path_to_extended_driver") .set_num_of_external_h2o_nodes(4) .set_mapper_xmx(“2G”) .set_yarn_queue(“abc”)
hc = H2OContext.getOrCreate(sc, conf)
![Page 38: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/38.jpg)
Multi-network Environment
• Node with Spark driver & H2O client is connected to more networks
• Spark driver choose IP in different network then the external H2O cloud
• We need to manually configure H2O client IP to be on the same network as rest of the H2O cloud
![Page 39: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/39.jpg)
H2O.aiMachine Intelligence
Demo Time
![Page 40: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/40.jpg)
H2O.aiMachine Intelligence
Checkout H2O.ai Training Books
http://h2o.ai/resources
Checkout H2O.ai Blog
http://h2o.ai/blog/
Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata
Checkout GitHub
https://github.com/h2oai/sparkling-water
More info
![Page 41: Different Strategies of Scaling H2O Machine Learning on Apache Spark](https://reader031.vdocuments.net/reader031/viewer/2022030309/58f2d48b1a28abb04d8b45b7/html5/thumbnails/41.jpg)
Learn more at h2o.ai Follow us at @h2oai
Thank you!Sparkling Water is
open-source ML application platform
combining power of Spark and H2O