sparkr under the hood - hossein falaki · making big data simple product unified analytics...
TRANSCRIPT
![Page 1: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/1.jpg)
SparkR Under the Hood
Hossein Falaki June 2017
How to debug your SparkR code
![Page 2: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/2.jpg)
About me
• Software Engineer at Databricks Inc. • Data Scientist at Apple Siri • Started using Spark since 0.6 • Developed first version of Apache Spark CSV data source • Developed Databricks R Notebooks • Currently focusing on R experience at Databricks
![Page 3: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/3.jpg)
About Databricks
TEAMStarted Spark project (now Apache Spark) at UC Berkeley in 2009
MISSION
Making Big Data Simple
PRODUCTUnified Analytics Platform
![Page 4: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/4.jpg)
What this talk IS What this talk is NOT
About this talk
• Introduction to SparkR API • Introducing new features • How to use SparkR
• SparkR architecture • SparkR implementation • Common performance bottlenecks • Common sources of error • How to debug your code
![Page 5: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/5.jpg)
Outline
• Architecture • Implementation • Limitations • Common errors and problems • How to debug your code
![Page 6: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/6.jpg)
What is SparkR
R package distributed with Apache Spark • Provides R front-end to Apache Spark • Exposes Spark DataFrames (inspired by R & Pandas) • Convenient interoperability between R and Spark DataFrames
robustdistributedprocessing,datasource,off-
memorydata
dynamicenvironment,interac6vity,+10Kpackages,
visualiza6ons+
![Page 7: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/7.jpg)
SparkR architecture Spark Driver
JVM
Worker
JVM
Worker
DataSources JVM R
R Backend
JVM
![Page 8: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/8.jpg)
SparkR architecture (2.x) Spark Driver
JVM
Worker
JVM
Worker
Data Sources JVM R
R Backend
R R
R R
![Page 9: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/9.jpg)
Driver implementation
1. RBackend opens a server port and waits for connections
4. RBackendHandler handles and process requests
2. SparkR establishes socket connections
3. Each SparkR call sends serialized data over the socket and waits for response
R JVM Backend
![Page 10: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/10.jpg)
SparkR Serialization
R JVM Backend
R and JVM use a proprietary serialization format as wire protocol.
Basic type type binary data
Lists type element 1, size element 2, element 3, ...
![Page 11: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/11.jpg)
A simple SparkR query
1. serialize method name + arguments
2. Send to backend 3. de-serialize
4. find Spark method
5. invoke method
6. serialize returned value
8. de-serialize and return result to user
R JVM
7. Send to R process
![Page 12: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/12.jpg)
What can go wrong?
1. serialize method name + arguments
2. Send to backend 3. de-serialize
4. find Spark method
5. invoke method
6. serialize returned value
8. de-serialize and return result to user
R JVM
7. Send to R process
![Page 13: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/13.jpg)
Serialization & deserialization
Memory allocation in R Error in writeBin(batch, con, endian = “big”)
attempting to add too many elements to raw vector
De-serialization in JVM ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.NegativeArraySizeException org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110)
at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119)
![Page 14: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/14.jpg)
Serialization & deserialization Corner case with types Lost task 0.3 in stage 52.0 (TID 10114, 10.0.229.211): java.lang.RuntimeException: java.lang.Double is not a valid external type for schema of date
org.apache.spark.SparkException: Job aborted due to stage failure:
java.lang.IllegalArgumentException at java.sql.Date.valueOf(Date.java:143) at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128) at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77)
Corner case with types
![Page 15: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/15.jpg)
Method signature matching and invocation
RBackendHandler: dfToCols on org.apache.spark.sql.api.r.SQLUtils failed
java.lang.Exception: No matched method found for class org.apache.spark.sql.api.r.SQLUtils.dfToCols
![Page 16: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/16.jpg)
A complex SparkR query
R Worker JVM R Driver JVM
1. serialize R closure
4. transfer over local socket
7. serialize result
2. transfer over local socket
8. transfer over local socket
10. transfer over local socket
11. de-serialize result
9. Transfer serialized closure over the network
3. Transfer serialized closure over the network
5. de-serialize closure
6. Execution
![Page 17: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/17.jpg)
A complex SparkR query
R Worker JVM R Driver JVM
1. serialize R closure
4. transfer over local socket
7. serialize result
2. transfer over local socket
8. transfer over local socket
10. transfer over local socket
11. de-serialize result
9. Transfer serialized closure over the network
3. Transfer serialized closure over the network
5. de-serialize closure
6. Execution
![Page 18: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/18.jpg)
Common problems when using UDFs
• Skew in data • Are partitions evenly sized?
• Packing too much data in the closure • Auxiliary data • Can be joined with input DataFrame • Can be distributed to all the workers
• Returned data schema
![Page 19: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/19.jpg)
Practical guide to debug SparkR code
![Page 20: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/20.jpg)
Get used to reading Java stack traces
• Often the root cause is at the bottom of the stack trace • Stack trace includes both driver and executor exceptions • In many cases the R worker error is included in the exception
message
![Page 21: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/21.jpg)
data.frame vs. DataFrame
• ... doesn't know how to deal with data of class SparkDataFrame
• no method for coercing this S4 class to a ...
• Expressions other than filtering predicates are not supported in the first parameter of extract operator.
![Page 22: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/22.jpg)
R function vs. SparkSQL expression Expressions translate to JVM calls, but functions run in R process of driver or workers
• filter(logs$type == “ERROR”)
• ifelse(df$level > 2, “deep”, “shallow”)
• dapply(logs, function(x) {
subset(x, type == “ERROR”)
}, schema(logs))
![Page 23: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/23.jpg)
Special characters in schema names
• ‘.’ is a special character in Spark
• Sometimes SparkR automatically converts ‘.’ to ‘_’ in column names
In FUN(X[[i]], ...) :
Use Sepal_Length instead of Sepal.Length as column name
• Sometimes, names are not transformed and you may end up with ‘.’ in column names
![Page 24: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/24.jpg)
Packing too much into the closure
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...):
org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 29877:0 was 520644552 bytes, which exceeds max allowed: spark.rpc.message.maxSize (268435456 bytes).
![Page 25: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/25.jpg)
Workers returning empty results
Job aborted due to stage failure: java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
...
Caused by: java.lang.ArrayIndexOutOfBoundsException
![Page 26: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/26.jpg)
Try Apache Spark in Databricks!
DATABRICKS RUNTIME 3.0 Apache Spark - optimized for the cloud Caching and optimization layer - DBIO Enterprise security – DBES Support for sparklyr
UNIFIED ANALYTICS PLATFORM Collaborative cloud environment Free version (community edition)
Try for free today. databricks.com
![Page 27: SparkR Under the Hood - Hossein Falaki · Making Big Data Simple PRODUCT Unified Analytics Platform. ... Each SparkR call sends serialized data over the socket and waits for response](https://reader030.vdocuments.net/reader030/viewer/2022041117/5f2c7e408a7dfd161d6c80f0/html5/thumbnails/27.jpg)
Thank You Hossein Falaki @mhfalaki