use r tutorial part1, introduction to sparkr
TRANSCRIPT
![Page 1: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/1.jpg)
Introduction to SparkR
Shivaram Venkataraman, Hossein Falaki
![Page 2: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/2.jpg)
Big Data & R
DataFramesVisualization
Libraries Data
+
![Page 3: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/3.jpg)
Big Data & R: ChallengesData access HDFS, Hive Capacity
Single machine memory
ParallelismSingle Thread
![Page 4: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/4.jpg)
Apache SparkEngine for large-scale data processing
Fast, Easy to Use
Runs EverywhereEC2, clusters, laptop etc.
![Page 5: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/5.jpg)
Speed
Scalable
Flexible
Statistics
Visualization
DataFrames
SparkR
![Page 6: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/6.jpg)
Big Data & R: PatternsBig Data Small Learning Partition
AggregateLarge ScaleMachine Learning
![Page 7: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/7.jpg)
1. Big Data, Small Learning
DataCleaningFilteringAggregat
ion
Collect
SubsetDataFramesVisualizationLibraries
![Page 8: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/8.jpg)
1. Big Data, Small Learningsongs <- read.df(
“songs.json”,“json”)
newSongs <- filter( songs, songs$year > 2000)
ggplot(collect(newSongs))
DataCleaningFiltering
Aggregation
Collect
Subset
![Page 9: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/9.jpg)
2. Partition Aggregate
Data Best Mode
lParam
s
Parameter Tuning
![Page 10: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/10.jpg)
params<-c(1e-3,1e-1,1e2) data <- read.csv(“t.csv”)
train <- function(prm) { lm.ridge(“y ~ x+z”, data, prm)}
lapply(params, train)
2. Partition Aggregate
DataBest Model
Params
![Page 11: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/11.jpg)
3. Large Scale Machine Learning
Data Featurize Learning Model
![Page 12: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/12.jpg)
3. Large Scale Machine Learning
Data Featurize Learning Model
training <- read.csv(“t.csv”)
model <- glm(delay~Distance+Des
t,family =
“gaussian”,data=data)
summary(model)
![Page 13: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/13.jpg)
Big Data & RBig Data Small LearningPartitionAggregateLarge ScaleMachine Learning
SparkR:Unified approach
![Page 14: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/14.jpg)
SparkR DataFramespeople <- read.df( “people.json”, “json”)
avgAge <- select( df, avg(df$age))
head(avgAge)
Number of data sources
Column Functions, SQL
Support for R UDFs
![Page 15: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/15.jpg)
Large Scale Machine Learning
Integration with MLLib
Key FeaturesR-like formulas
Model statistics
model <- glm(a ~ b + c,
data = df)
summary(model)
![Page 16: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/16.jpg)
Partition Aggregatespark.lapply: Simple, parallel
API Ex: Parameter tuning, Model
Averaging
Include existing R packages
![Page 17: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/17.jpg)
SparkR StatusOpen source -- Part of Apache Spark
> 60 committers from UC Berkeley, Databricks, IBM, Intel, Alteryx etc.
Contributions welcome !
![Page 18: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/18.jpg)
Tutorial Outline Part 1: Data Exploration• ETL: Data loading, schema • Exploration: Filter, clean, aggregate
etc.• Visualization: Integration with ggplot
Part 2: Advanced Analytics (After the break)
![Page 19: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/19.jpg)
Tutorial Setup
Each user gets a dedicated micro cluster• Cluster is terminated after 1 hour of inactivity• Multiple users can collaborate on a notebook
Notebooks can be exported/imported Examples and tutorials in R/Python/Scala
Free online service for learning Apache Spark
![Page 20: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/20.jpg)
Tutorial SetupDatabricks Notebooks • Interactive workspace•Markdown + R, Python, Scala, SQL
Sign up at http://databricks.com/ce
![Page 22: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/22.jpg)
SparkRBig data processing from R
DataFrames for ETL, data exploration
Support for advanced analytics
![Page 23: Use r tutorial part1, introduction to sparkr](https://reader036.vdocuments.net/reader036/viewer/2022062412/586f79e91a28ab10258b70f7/html5/thumbnails/23.jpg)
Tutorial Next StepsSign up at http://databricks.com/ce
Part 1: tiny.cc/sparkr-tutorial-part1
Fill out our survey at tiny.cc/sparkr-user-survey