a parallel r framework - data-intensive distributed...
TRANSCRIPT
![Page 1: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/1.jpg)
A Parallel R Framework for Processing Large Dataset on Distributed Systems
Nov. 17, 2013
This work is initiated and supported by Huawei Technologies
![Page 2: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/2.jpg)
Rise of Data-Intensive Analytics Data Sources
2
Personal data for the internet • query history • click stream logging • tweets • …
Machine Generated Data • Sensor networks • Genome sequencing • Physics experiments • Satellite imaging data
![Page 3: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/3.jpg)
New Challenges
3
Huge demand for data analysts, statisticians & data scientists
Traditional tools work on summary or sampled data
A good tool for large-scale data: • Usability: stick to traditional semantics • Performance: distributed parallelism • Fault Tolerance: MapReduce
GB SAS
MB R
![Page 4: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/4.jpg)
R used by about 30% of data analysts Why R: • Data structures • Functional language • Rich functionality • Graphic visualization • Open source
Why not R: • Single threaded • limited memory
Usability
4
From:
Survey by Revolution Analytics
http://r4stats.com/articles/popularity/
![Page 5: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/5.jpg)
Performance - Spark Framework Distributed Computing with Fault-tolerance
5
Developed at AMP lab, UC Berkeley
Flexible Programming Model • DAG job scheduler
Performance • In-memory • Good for iterative algorithms
Resilient Data Sets • Recover from loss and failures
![Page 6: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/6.jpg)
RABID Package Structure
6
R User Analytics Application
Spark
HDFS/HBase
Optimizer, Scheduler Sto
rage
Se
rve
r
Lower-level Ops
DM functions
Matrix Ops RABID as an R extension package
![Page 7: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/7.jpg)
Runtime Overview
7
Spark Scheduler, optimizer
Fault-tolerant
Task
R worker
…
Master
Slave Slave
Data shuffle
R scripts
Task
R worker
Task
R worker
Task
R worker
Worker
Worker
Web server R driver session Symbol tables
![Page 8: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/8.jpg)
Programming Model
8
R List – Most general data structure • Collection to store elements of any & different types • similar to Python tuples • Very general and flexible
BigList – distributed list structure • Extended R List to be distributed • Override R list functions to support BigList • Building blocks for higher level structures and functions
![Page 9: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/9.jpg)
RABID Example (1) Sample APIs
9
library("Rabid")
DATA <- rb.readLines(“hdfs://…”)
DATA <- lapply(DATA, as.numeric, cache=TRUE)
centroids <- as.list(sample(DATA, 16))
func <- function(a) { ... } DATA2 <- lapply(DATA, func)
Load the Rabid package
Read text into a BigList
Transform back to R list
Apply function in parallel
DATA3 <- aggregate(DATA2, by=id, FUN=mean)
Apply UDF in parallel
Aggregate by user specified keys
![Page 10: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/10.jpg)
RABID Example (2) Sample APIs
10
library("Rabid") mat <- rb.read.matrix(“hdfs://…”, cache=F) mat1 <- mat + mat mat2 <- mat %*% mat t(mat) cor(mat)
![Page 11: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/11.jpg)
Data Blocking for lapply()
11
…
Slave 1
Slave 2 …
…
Slave 1
Slave 2
lapply
lapply
Text input
Blocked in R list
… An HDFS block
A Hadoop Split
Further block the data for better efficiency of transferring and processing
![Page 12: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/12.jpg)
Data Blocking for aggregate()
12
Slave 1
…
Slave 2
…
Slave 1
…
Slave 2
…
Data shuffling
User’s key + aggregated values
hash key 1
hash key 1
hash key 2
hash key 1
hash key 2
hash key 2
hash key 1
hash key 1
hash key 1
hash key 2
hash key 2
hash key 2
…
Slave 1
Slave 2
…
aggregate
aggregate
![Page 13: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/13.jpg)
Distributing Computation
13
Computations are abstracted as R functions, which are serialized to the nodes and evaluated
R has a scoping rule for searching free variables in an enclosing environment
We need to ship the functions together with the values of free variables in its environment
z <- 1 func1 <- function() { y <- 2 func2 <- function(x) { x + y + z } }
![Page 14: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/14.jpg)
Merging Deferred Operations
14
Two major overheads of each RABID operation:
1) data transferring
2) serialization/deserialization
Merge adjacent deferred operations into one reduces the overheads
What kind of operations can be merged: non-aggregation
![Page 15: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/15.jpg)
Merging Deferred Operations
15
x y
a b
f(x) f(y)
g(a, b)
x y
g[f(x), f(y)]
![Page 16: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/16.jpg)
Fault Tolerance
16
Take the advantage of Spark’s fault tolerance feature at the worker side
Detect user code errors that terminate R worker sessions; catch the error and stop Spark job immediately
Zookeeper at the master side
![Page 17: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/17.jpg)
Applications & Benchmarking
17
Logistic Regression: • 10 worker nodes • 1 ~ 100 million records • RABID uses 1/6 LOC of Hadoop
Movie Clustering (K-Means): • 10 worker nodes • 11 ~ 90 million ratings • RABID uses 1/8 LOC of Hadoop
Compare RABID with Hadoop & RHIPE • RHIPE: R and Hadoop Integrated Programming Environment, developed at Purdue University
![Page 18: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/18.jpg)
Logistic Regression Runtime over Data Size
18
RABID
Hadoop
RHIPE
![Page 19: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/19.jpg)
K-Means Movie Clustering Runtime
19
RABID
Hadoop
RHIPE
![Page 20: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/20.jpg)
Logistic Regression Runtime over # nodes
20
![Page 21: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/21.jpg)
Logistic Regression Runtime on iterations
21
![Page 22: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/22.jpg)
Conclusions
22
RABID provides R users with a familiar programming model that scales to large cloud based clusters, allowing larger problems sizes to be efficiently solved.
Preliminary results show RABID outperforms Hadoop and RHIPE on our benchmarks
RABID is cloud-ready to be used as a service
![Page 23: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/23.jpg)
Future Work
23
Optimizing data transferring between R session and Spark
Trade-off between fault tolerance and performance
Benchmarking more applications and at a larger scale
![Page 24: A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store](https://reader030.vdocuments.net/reader030/viewer/2022041113/5f1fd60b2c473b15bc630f3c/html5/thumbnails/24.jpg)
Thank You!
Nov 17, 2013
Also thank Prof. Michael Franklin and Matei Zaharia at UC Berkeley, for discussing the ideas over Spark project