apache® spark™ mllib 2.x: migrating ml workloads to dataframes
TRANSCRIPT
![Page 1: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/1.jpg)
![Page 2: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/2.jpg)
Webinar Logistics
3
![Page 3: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/3.jpg)
Webinar Logistics
4
![Page 4: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/4.jpg)
About the speaker: Joseph Bradley
Joseph Bradley is a Software Engineer and Apache Spark Committer & PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
5
![Page 5: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/5.jpg)
About the speaker: Jules S. Damji
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies building large-scale distributed systems.
6
![Page 6: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/6.jpg)
DatabricksFounded by the creators of Apache Spark in 2013
Share of Spark code contributed by Databricksin 2014
75%
Data Value
Created Databricks on top of Spark to make big data simple.7
![Page 7: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/7.jpg)
…
Apache Spark Engine
Spark Core
Spark StreamingSpark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, & R APIs
Standard libraries
8
![Page 8: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/8.jpg)
9
![Page 9: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/9.jpg)
10
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO
Source: Slide 5 of Spark Community Update
![Page 10: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/10.jpg)
OutlineIntro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap ahead during 2.x
11
![Page 11: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/11.jpg)
OutlineIntro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap ahead during 2.x
12
![Page 12: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/12.jpg)
A bit of MLlib historySpark 0.8RDD-based API Fast, scale-out ML
Challenges• Expressing complex workflows• Integrating with DataFrames• Developing Java, Python & R APIs
Spark 1.2DataFrame-based API (a.k.a. “Spark ML”)
Major improvements• ML Pipelines with automated tuning• Native DataFrame integration• Standard API across languages
See Xiangrui Meng’s original design & prototype in SPARK-3530.
13
![Page 13: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/13.jpg)
MLlib trajectory
v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.00
200400600800
1000
com
mits
/ re
leas
e
Scala/JavaAPI
Primary API for MLlib
Python API R API
DataFrame-basedAPI for MLlib
14
![Page 14: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/14.jpg)
DataFrame-based API for MLlibDataFrames are the standard ML dataset type.
Uniform APIs for algorithms, hyperparameters, etc.
Pipelines provide utilities for constructing ML workflows + automating hyperparameter tuning.
Learn more about ML Pipelines:http://spark-summit.org/2015/events/practical-machine-learning-pipelines-with-mllib-2http://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html
15
![Page 15: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/15.jpg)
DataFrame-based API for MLlibIn 2.0, the DataFrame-based API became the primary MLlib API.• Voted by community• org.apache.spark.ml, pyspark.ml
The RDD-based API is in maintenance mode.• Still maintained with bug fixes, but no new features• org.apache.spark.mllib, pyspark.mllib
16
![Page 16: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/16.jpg)
OutlineIntro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap during 2.x
17
![Page 17: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/17.jpg)
Why migrate to DataFrames?DataFrames
Language APIs
Pipelines
DataFrames & Datasets are the new “core” API for Spark.• Data sources & ETL• Latest performance improvements (Catalyst & Tungsten)• Structured Streaming
18
![Page 18: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/18.jpg)
Why migrate to DataFrames?DataFrames
Language APIs
Pipelines
Standardized across Scala, Java, Python, and R• Python & R match Scala/Java performance• Cross-language persistence (saving/loading models)
19
![Page 19: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/19.jpg)
Why migrate to DataFrames?DataFrames
Language APIs
Pipelines Specify complex ML workflows• Chain together Transformers, Estimators, & Models• Automated hyperparameter tuning
20
![Page 20: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/20.jpg)
Demo migrationConvert a notebook from the RDD-based API to the DataFrame-based API.
Key points• Work with single models or complex Pipelines• Incremental migration• Many benefits: simpler APIs, SQL integration, tuning• A few gotchas (linear algebra types)
21
WarningDemo for experts!
![Page 21: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/21.jpg)
Demo recap: migration processSeparate 2 migrations:
• Spark 1.6 2.0• RDDs DataFrames
Migrate ML APIs: spark.mllib spark.ml• Gotcha: a few naming changes (from standardizing algorithm APIs)
• Certain Param and model methods• run() fit()
• Tips:• Use explainParams()• Compare the API docs if you hit issues!
Migrate data APIs: RDDs DataFrames• Tip: Get familiar with conversion syntax in both directions.
22
![Page 22: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/22.jpg)
Demo recap: migration processDebugging runtime errors
• Gotcha: Lazy evaluation in Pipelines means bugs appear later than expected.• Tip: Check intermediate results.
• Gotcha: Vector (and Matrix) types in spark.mllib and spark.ml.• Relevant for Spark 1.6 2.0 migration• Tip: Watch for buried errors: MatchError and mentions of “vector”• Tip: Use helper methods for conversion
• org.apache.spark.mllib.linalg.Vector.asML• org.apache.spark.mllib.linalg.Vectors.fromML• http://spark.apache.org/docs/latest/ml-guide.html#migration-guide
23
![Page 23: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/23.jpg)
Future benefits of migrationCurrentlyML training is implemented on RDDs.
GoalPort implementation to DataFrames.Benefit from DataFrame
optimizations (Catalyst, Tungsten).
Spark SQL MLlib
Core
RDDs
DataFramesDatasetsSQL
24
![Page 24: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/24.jpg)
Future benefits of migrationStatusFirst published implementation in GraphFrames (Spark package for graph processing)
Ongoing workDataFrame improvements for iterative algorithms: checkpointing, improved caching, and more.
Spark SQL
MLlib
Core
RDDs
DataFramesDatasetsSQL
25
![Page 25: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/25.jpg)
OutlineIntro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap during 2.x
26
![Page 26: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/26.jpg)
Why ML persistence?Data Science Software Engineering
Prototype (Python/R)Create model
Re-implement model for production (Java)
Deploy model
27
![Page 27: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/27.jpg)
Why ML persistence?Data Science Software Engineering
Prototype (Python/R)Create Pipeline• Extract raw features• Transform features• Select key features• Fit multiple models• Combine results to
make prediction
• Extra implementation work• Different code paths• Synchronization overhead
Re-implement Pipeline for production (Java)
Deploy Pipeline
28
![Page 28: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/28.jpg)
With ML persistence...Data Science Software Engineering
Prototype (Python/R)Create Pipeline
Persist model or Pipeline: model.save(“s3n://...”)
Load Pipeline (Scala/Java) Model.load(“s3n://…”)Deploy in production
29
![Page 29: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/29.jpg)
Model tuning
ML persistence status
Text preprocessing
Feature generation
Random forest
Unfitted Fitted
Model
Pipeline
“recipe” “result”
30
![Page 30: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/30.jpg)
ML persistence statusNear-complete coverage in all Spark language APIs• Scala & Java: complete• Python: complete except for 2 algorithms• R: complete for existing APIs
Single underlying implementation of models
Exchangeable data format• JSON for metadata• Parquet for model data (coefficients, etc.)
31
![Page 31: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/31.jpg)
Demo: ML persistence• Can persist single models & complex workflows
• Easy to move models across Spark deployments
• Share models across teams & languages
32
![Page 32: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/32.jpg)
ML persistence: pending issuesPython tuning: not yet implemented• CrossValidator, TrainValidationSplit
R format: incompatible with Python/Java/Scala• Issue: R wrappers are all special Pipelines.• Working towards a fix• Workaround: Load underlying PipelineModel from subfolder in
saved model directory.
Backwards compatibility: WIP in SPARK-15573
ML persistence blog post: http://databricks.com/blog/2016/05/31
33
![Page 33: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/33.jpg)
OutlineIntro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap during 2.x
34
![Page 34: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/34.jpg)
Goals for MLlib in 2.xMajor initiatives• ML persistence: saving &
loading models & Pipelines• Complete feature parity for
DataFrames-based API. Missing items:• Frequent Pattern Mining• Certain methods in models• Developer APIs
For an overview of MLlib in 2.0, seehttp://spark-summit.org/2016/events/apache-spark-mllib-20-preview-data-science-and-production
35
Other important improvements• Generalized Linear Models• Python & R API parity• Speed & scalability improvements
![Page 35: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/35.jpg)
Coming in 2.1Multiclass logistic regression (SPARK-7159)Locality sensitive hashing (SPARK-5992)More ML in SparkR (SPARK-16442)• ALS• Isotonic Regression• Multilayer Perceptron Classifier• Random Forest• Gaussian Mixture Model• LDA• Multiclass Logistic Regression• Gradient Boosted Trees
Various speed & scalability improvements• Random Forest, Naive Bayes, LDA, Gaussian Mixture, and others
Spark 2.1 status:Release candidates are under QA.
For release schedule, see http://spark.apache.org/versioning-policy.html
36
![Page 36: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/36.jpg)
Get startedGet involved in the community• Events & news https://sparkhub.databricks.com/• User mailing list http://spark.apache.org/
community.html
Get involved in development• Dev mailing list
http://spark.apache.org/community.html• JIRA http://issues.apache.org/jira/browse/SPARK• Contribute http://spark.apache.org/contributing.html
Try out Apache Spark for free on Databricks Community Edition!http://databricks.com/try
Many thanks to the Apache Spark community!
37
![Page 38: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames](https://reader034.vdocuments.net/reader034/viewer/2022051404/586f79d11a28ab10258b70a7/html5/thumbnails/38.jpg)
Thank you!Twitter: @jkbatcmu