introduc+ontoparallel compu+ngwithapachespark · 7 spark"background" •...
TRANSCRIPT
![Page 1: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/1.jpg)
1
Introduc+on to Parallel Compu+ng with Apache Spark
![Page 2: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/2.jpg)
2
Why Parallel Processing? Divide and Conquer
Problem Single machine cannot complete the computa+on at hand Solu+on Parallelize the job and distribute work among a network of machines
![Page 3: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/3.jpg)
3
Issues Arise in Parallel Processing View the world from the eyes of a single worker
• How do I parallelize an algorithm? • How do I par))on my dataset? • How do I maintain a single consistent view of a shared state?
• How do I recover from machine failures? • How do I allocate cluster resources? • …..
![Page 4: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/4.jpg)
4
Finding majority element in a single machine
List(20, 18, 20, 18, 20)
Think distributed
![Page 5: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/5.jpg)
5
Finding majority element in a distributed dataset Think distributed
List(1, 18, 1, 18, 1)
List(2, 18, 2, 18, 2)
List(3, 18, 3, 18, 3)
List(4, 18, 4, 18, 4)
List(5, 18, 5, 18, 5)
![Page 6: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/6.jpg)
6
Finding majority element in a distributed dataset Think distributed
List(1, 18, 1, 18, 1)
List(2, 18, 2, 18, 2)
List(3, 18, 3, 18, 3)
List(4, 18, 4, 18, 4)
List(5, 18, 5, 18, 5)
18
![Page 7: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/7.jpg)
7
Spark Background
• Amplab UC Berkeley
• Project Lead: Dr. Matei Zaharia
• First paper published on RDD’s was in 2012 • Open sourced from day one, growing number of contributors
• Released its 1.0 version May 2014. Currently in 1.4.1
• Databricks company established to support Spark and all its related technologies. Matei currently sits as its CTO
• Amazon, Alibaba, Baidu, eBay, Groupon, Ooyala, OpenTable, Box, Shopify, TechBase, Yahoo!
Arose from an academic sebng
![Page 8: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/8.jpg)
8
Example: Log Mining Load error messages from a log into memory, then interac+vely search for various paderns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
. . .
tasks
results
Msgs. 1
Msgs. 2
Msgs. 3
Base RDD Transformed RDD
Ac+on
Result: full-‐text search of Wikipedia in <1 sec (vs 20 sec for on-‐disk data)
Result: scaled to 1 TB data in 5-‐7 sec (vs 170 sec for on-‐disk data)
Slide courtesy of Matei Zaharia, Introduction to Spark
![Page 9: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/9.jpg)
9
Resilient Distributed Datasets (RDDs)
• Main object in Spark’s universe
• Think of it as represen+ng the data at that stage in the opera+on
• Allows for coarse-‐grained transforma+ons (e.g. map, group-‐by, join)
• Allows for efficient fault recovery using lineage ‒ Log one opera+on to apply to many elements ‒ Recompute lost par++ons of dataset on failure ‒ No cost if nothing fails
![Page 10: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/10.jpg)
10
RDD Ac+ons and Transforma+ons
• Transforma+ons ‒ Lazy opera+ons applied on an RDD ‒ Creates a new RDD from an exis+ng RDD ‒ Allows Spark to perform op+miza+ons ‒ e.g. map, filter, flatMap, union, intersec+on, dis+nct, reduceByKey, groupByKey
• Ac+ons ‒ Returns a value to the driver program aner computa+on ‒ e.g. reduce, collect, count, first, take, saveAsFile
Transforma+ons are realized when an ac+on is called
![Page 11: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/11.jpg)
11
RDD Representa+on
• Simple common interface: – Set of par++ons – Preferred loca+ons for each par++on – List of parent RDDs – Func+on to compute a par++on given parents – Op+onal par++oning info
• Allows capturing wide range of transforma+ons
Slide courtesy of Matei Zaharia, Introduction to Spark
![Page 12: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/12.jpg)
12
Spark Cluster
Driver • Entry point of Spark applica+on • Main Spark applica+on is ran here • Results of “reduce” opera+ons are aggregated here
Driver
![Page 13: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/13.jpg)
13
Spark Cluster
Driver Master
Master • Distributed coordina+on of Spark workers including:
§ Health checking workers § Reassignment of failed tasks § Entry point for job and cluster metrics
![Page 14: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/14.jpg)
14
Spark Cluster
Driver Master
Worker
Worker
Worker
Worker
Worker • Spawns executors to perform tasks on par++ons of data
![Page 15: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/15.jpg)
15
The Spark Family Cheaper by the dozen
• Aside from its performance and API, the diverse tool set available in Spark is the reason for its wide adop+on 1. Spark SQL 2. Spark Streaming 3. MLlib 4. GraphX
![Page 16: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/16.jpg)
16
Disk Based vs Memory Based Frameworks
• Disk Based Frameworks ‒ Persists intermediate results to disk ‒ Data is reloaded from disk with every query ‒ Easy failure recovery ‒ Best for ETL like work-‐loads ‒ Examples: Hadoop, Dryad
Acyclic data flow
Map
Map
Map
Reduce
Reduce
Input Output
Image courtesy of Matei Zaharia, Introduction to Spark
![Page 17: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/17.jpg)
17
Disk Based vs Memory Based Frameworks
• Memory Based Frameworks ‒ Circumvents heavy cost of I/O by keeping intermediate results in memory
‒ Sensi+ve to availability of memory ‒ Remembers opera+onsapplied to dataset ‒ Best for itera+ve workloads ‒ Examples: Spark, Flink
Reuse working data set in memory
iter. 1 iter. 2 . . .
Input Image courtesy of Matei Zaharia, Introduction to Spark
![Page 18: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/18.jpg)
18
Spark versus Scalding (Hadoop) Clear win for itera+ve applica+ons
![Page 19: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/19.jpg)
19
Ad-‐hoc batch queries
![Page 20: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/20.jpg)
20
Lambda Architecture
![Page 21: Introduc+ontoParallel Compu+ngwithApacheSpark · 7 Spark"Background" • AmplabUCBerkeley" • ProjectLead:"Dr." Matei&Zaharia& • Firstpaper"published"on"RDD’s"was"in"2012" •](https://reader034.vdocuments.net/reader034/viewer/2022042311/5ed9aadfa5d2fa37887930ce/html5/thumbnails/21.jpg)
21
Lambda Architecture Unified Framework
Spark Core
Spark Streaming
Spark SQL