recent developments in sparkr for advanced analytics
TRANSCRIPT
![Page 1: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/1.jpg)
Recent Developments in SparkR for Advanced Analytics
Xiangrui Meng [email protected]
2016/06/07 - Spark Summit 2016
![Page 2: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/2.jpg)
About Me
• Software Engineer at Databricks • tech lead of machine learning and data science
• Committer and PMC member of Apache Spark • Ph.D. from Stanford in computational mathematics
2
![Page 3: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/3.jpg)
Outline
• Introduction to SparkR •Descriptive analytics in SparkR •Predictive analytics in SparkR • Future directions
3
![Page 4: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/4.jpg)
Introduction to SparkR
Bridging the gap between R and Big Data
![Page 5: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/5.jpg)
SparkR
• Introduced to Spark since 1.4 • Wrappers over DataFrames and DataFrame-based APIs
• In SparkR, we make the APIs similar to existing ones in R (or R packages), rather than Python/Java/Scala APIs. • R is very convenient for analytics and users love it. • Scalability is the main issue, not the API.
5
![Page 6: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/6.jpg)
DataFrame-based APIs
• Storage: s3 / HDFS / local / … • Data sources: csv / parquet / json / … • DataFrame operations: • select / subset / groupBy / agg / collect / … • rand / sample / avg / var / …
• Conversion to/from R data.frame
6
![Page 7: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/7.jpg)
SparkR Architecture
7
Spark Driver
R JVM
R Backend
JVM
Worker
JVM
Worker
Data Sources
![Page 8: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/8.jpg)
Data Conversion between R and SparkR
8
R JVM
R BackendSparkR::collect()
SparkR::createDataFrame()
![Page 9: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/9.jpg)
Descriptive Analytics
Big Data at a glimpse in SparkR
![Page 10: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/10.jpg)
Summary Statistics
10
• count, min, max, mean, standard deviation, variance describe(df)
df %>% groupBy(“dept”, avgAge = avg(df$age))
• covariance, correlation df %>% select(var_samp(df$x, df$y))
• skewness, kurtosis df %>% select(skewness(df$x), kurtosis(df$x))
![Page 11: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/11.jpg)
Sampling Algorithms
• Bernoulli sampling (without replacement)df %>% sample(FALSE, 0.01)
• Poisson sampling (with replacement)df %>% sample(TRUE, 0.01)
• stratified samplingdf %>% sampleBy(“key”, c(positive = 1.0, negative = 0.1))
11
![Page 12: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/12.jpg)
Approximate Algorithms
• frequent items [Karp03] df %>% freqItems(c(“title”, “gender”), support = 0.01)
• approximate quantiles [Greenwald01] df %>% approxQuantile(“value”, c(0.1, 0.5, 0.9), relErr = 0.01)
• single pass with aggregate pattern • trade-off between accuracy and space
12
![Page 13: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/13.jpg)
Implementation: Aggregation Pattern
split + aggregate + combine in a single pass • split data into multiple partitions • calculate partially aggregated result on each partition • combine partial results into final result
13
![Page 14: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/14.jpg)
Implementation: High-Performance
• new online update formulas of summary statistics • code generation to achieve high performance
kurtosis of 1 billion values on a Macbook Pro (2 cores):
14
scipy.stats 250s
octave 120s
CRAN::moments 70s
SparkR / Spark / PySpark 5.5s
![Page 15: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/15.jpg)
Predictive Analytics
Enabling large-scale machine learning in SparkR
![Page 16: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/16.jpg)
MLlib + SparkR
MLlib and SparkR integration started in Spark 1.5.
API design choices: 1. mimic the methods implemented in R or R packages
• no new method to learn • similar but not the same / shadows existing methods • inconsistent APIs
2. create a new set of APIs
16
![Page 17: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/17.jpg)
Generalized Linear Models (GLMs)
• Linear models are simple but extremely popular. • A GLM is specified by the following: • a distribution of the response (from the exponential family), • a link function g such that
• maximizes the sum of log-likelihoods
17
![Page 18: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/18.jpg)
Distributions and Link Functions
SparkR supports all families supported by R in Spark 2.0.
18
Model Distribution Link
linear least squares normal identity
logistic regression binomial logit
Poisson regression Poisson log
gamma regression gamma inverse
… … …
![Page 19: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/19.jpg)
GLMs in SparkR
# Create the DataFrame for training df <- read.df(sqlContext, “path/to/training”)
# Fit a Gaussian linear model model <- glm(y ~ x1 + x2, data = df, family = “gaussian”) # mimic R model <- spark.glm(df, y ~ x1 + x2, family = “gaussian”)
# Get the model summary summary(model)
# Make predictions predict(model, newDF)
19
![Page 20: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/20.jpg)
Implementation: SparkR::glm
The `SparkR::glm` is a simple wrapper over an ML pipeline that consists of the following stages:
• RFormula, which itself embeds an ML pipeline for feature preprocessing and encoding,
• an estimator (GeneralizedLinearRegression).
20
![Page 21: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/21.jpg)
RWrapper
Implementation: SparkR::glm
21
RFormula
GLM
RWrapper
RFormula
GLM
StringIndexer
VectorAssembler
IndexToString
StringIndexer
![Page 22: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/22.jpg)
Implementation: R Formula
22
• R provides model formula to express models. • We support the following R formula operators in SparkR:
• `~` separate target and terms • `+` concat terms, "+ 0" means removing intercept • `-` remove a term, "- 1" means removing intercept • `:` interaction (multiplication for numeric values, or binarized
categorical values) • `.` all columns except target
• The implementation is in Scala.
![Page 23: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/23.jpg)
Implementation: Test against R
Besides normal tests, we also verify our implementation using R.
/* df <- as.data.frame(cbind(A, b)) for (formula in c(b ~ . -1, b ~ .)) { model <- lm(formula, data=df, weights=w) print(as.vector(coef(model))) } [1] -3.727121 3.009983 [1] 18.08 6.08 -0.60 */ val expected = Seq(Vectors.dense(0.0, -3.727121, 3.009983), Vectors.dense(18.08, 6.08, -0.60))
23
![Page 24: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/24.jpg)
ML Models in SparkR
• generalized linear models (GLMs) • glm / spark.glm (stats::glm)
• accelerated failure time (AFT) model for survival analysis • spark.survreg (survival)
• k-means clustering • spark.kmeans (stats:kmeans)
• Bernoulli naive Bayes • spark.naiveBayes (e1071)
24
![Page 25: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/25.jpg)
Model Persistence in SparkR
• model persistence supported for all ML models in SparkR • thin wrappers over pipeline persistence from MLlib
model <- spark.glm(df, x ~ y + z, family = “gaussian”)
write.ml(model, path)
model <- read.ml(path)
summary(model)
• feasible to pass saved models to Scala/Java engineers
25
![Page 26: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/26.jpg)
Work with R Packages in SparkR
• There are ~8500 community packages on CRAN. • It is impossible for SparkR to match all existing features.
• Not every dataset is large. • Many people work with small/medium datasets.
• SparkR helps in those scenarios by: • connecting to different data sources, • filtering or downsampling big datasets, • parallelizing training/tuning tasks.
26
![Page 27: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/27.jpg)
Work with R Packages in SparkR
df <- sqlContext %>% read.df(…) %>% collect()
points <- data.matrix(df)
run_kmeans <- function(k) {
kmeans(points, centers=k)
}
kk <- 1:6
lapply(kk, run_kmeans) # R’s apply
spark.lapply(sc, kk, run_kmeans) # parallelize the tasks
27
![Page 28: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/28.jpg)
summary(this.talk)
• SparkR enables big data analytics on R • descriptive analytics on top of DataFrames • predictive analytics from MLlib integration
• SparkR works well with existing R packages
Thanks to the Apache Spark community for developing and maintaining SparkR: Alteryx, Berkeley AMPLab, Databricks, Hortonworks, IBM, Intel, etc, and individual contributors!!
28
![Page 29: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/29.jpg)
Future Directions
• CRAN release of SparkR • more consistent APIs with existing R packages: dplyr, etc • better R formula support • more algorithms from MLlib: decision trees, ALS, etc • better integration with existing R packages: gapply / UDFs • integration with Spark packages: GraphFrames, CoreNLP, etc
We’d greatly appreciate feedback from the R community!
29
![Page 30: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/30.jpg)
Try Apache Spark with Databricks
30
http://databricks.com/try
• Download a companion notebook of this talk at: http://dbricks.co/1rbujoD
• Try latest version of Apache Spark and preview of Spark 2.0
![Page 31: Recent Developments In SparkR For Advanced Analytics](https://reader033.vdocuments.net/reader033/viewer/2022051521/586f7a361a28ab10258b725b/html5/thumbnails/31.jpg)
Thank you.• SparkR user guide on Apache Spark website • MLlib roadmap for Spark 2.1 • Office hours:
• 2-3:30pm at Expo Hall Theater; 3:45-6pm at Databricks booth • Databricks Community Edition and blog posts