cs535 big data 2/5/2020 week 3-b sangmi lee pallickaracs535/slides/week3-b-6.pdfcs535 big data |...
TRANSCRIPT
CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1
CS535 Big Data | Computer Science | Colorado State University
CS535 BIG DATA
PART A. BIG DATA TECHNOLOGY3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTINGSECTION 2: IN-MEMORY CLUSTER COMPUTING
Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535
CS535 Big Data | Computer Science | Colorado State University
FAQs• PA1
• GEAR Session 1 signup is available:• See the announcement in canvas
• Feedback policy• Quiz, TP proposal: 1week• Email- 24hrs
CS535 Big Data | Computer Science | Colorado State University
Topics of Todays Class• 3. Distributed Computing Models for Scalable Batch Computing
• Introduction to Spark• Operations: transformations, actions, persistence
CS535 Big Data | Computer Science | Colorado State University
In-Memory Cluster Computing: Apache SparkRDD (Resilient Distributed Dataset)
CS535 Big Data | Computer Science | Colorado State University
RDD (Resilient Distributed Dataset)• Read-only, memory resident partitioned collection of records
• A fault-tolerant collection of elements that can be operated on in parallel
• RDDs are the core unit of data in Spark• Most Spark programming involves performing operations on RDDs
CS535 Big Data | Computer Science | Colorado State University
Creating RDDs [1/3]
• Loading an external dataset
• Parallelizing a collection in your driver program
val lines = sc.parallelize(List("pandas", "i like pandas"))
val lines = sc.textFile("/path/to/README.md")
https://spark.apache.org/docs/latest/rdd-programming-guide.html
CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 2
CS535 Big Data | Computer Science | Colorado State University
Creating RDDs [2/3]
• Line 1: defines a base RDD from an external file• This dataset is not loaded in memory
• Line 2: defines lineLengths as the result of map transformation• It is not immediately computed
• Line 3: performs reduce and compute the results
1: val lines = sc.textFile("data.txt")2: val lineLengths = lines.map(s => s.length)3: val totalLength = lineLengths.reduce((a, b) => a + b)
CS535 Big Data | Computer Science | Colorado State University
Creating RDDs [3/3]
• If you want to use lineLengths again later
lineLengths.persist()
CS535 Big Data | Computer Science | Colorado State University
Spark Programming Interface to RDD [1/3]
• transformations• Operations that create RDDs
• Return pointers to new RDDs
• e.g. map, filter, and join
• RDDs can only be created through deterministic operations on either
• Data in stable storage
• Other RDDs
1: val lines = sc.textFile("data.txt")2: val lineLengths = lines.map(s => s.length)3: val totalLength = lineLengths.reduce((a, b) => a + b)
Don’t be confused with “transformation” of the Scala language
CS535 Big Data | Computer Science | Colorado State University
Spark Programming Interface to RDD [2/3]
•actions• Operations that return a value to the application or export data to a storage system
• e.g. count: returns the number of elements in the dataset
• e.g. collect: returns the elements themselves
• e.g. save: outputs the dataset to a storage system
1: val lines = sc.textFile("data.txt")2: val lineLengths = lines.map(s => s.length)3: val totalLength = lineLengths.reduce((a, b) => a + b)
Don’t be confused with “transformation” of the Scala language
CS535 Big Data | Computer Science | Colorado State University
Spark Programming Interface to RDD [3/3]
•persistance• Indicates which RDDs they want to reuse in future operations
• Spark keeps persistent RDDs in memory by default
• If there is not enough RAM• It can spill them to disk
• Users are allowed to, • store the RDD only on disk• replicate the RDD across machines
• specify a persistence priority on each RDD
lineLengths.persist()
CS535 Big Data | Computer Science | Colorado State University
In-Memory Cluster Computing: Apache SparkRDD: Transformations
RDD: ActionsRDD: Persistence
CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 3
CS535 Big Data | Computer Science | Colorado State University
What are the Transformations?• RDD’s transformations create a new dataset from an existing one
• Simple transformations• Transformations with multiple RDDs• Transformations with the Pair RDDs
CS535 Big Data | Computer Science | Colorado State University
In-Memory Cluster Computing: Apache SparkRDD: Transformations
1. Simple transformations2. Transformations with multiple RDDs3. Transformations with the Pair RDDs
CS535 Big Data | Computer Science | Colorado State University
Simple Transformations• These transformations create a new RDD from an existing RDD
• E.g. map(), filter(), flatMap(), sample()
CS535 Big Data | Computer Science | Colorado State University
map() vs. filter() [1/2]• The map(func) transformation takes in a function
• applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD
• The filter(func) transformation takes in a function and returns an RDD that only has elements that pass the filter() function
inputRDD{1,2,3,4}
MappedRDD{1,4,9,16}
filteredRDD{2,3,4}
map x=> x*x filter x !=1
CS535 Big Data | Computer Science | Colorado State University
map() vs. filter() [1/2]• map() that squares all of the numbers in an RDD
val input = sc.parallelize( List( 1, 2, 3, 4)) val result = input.map( x = > x * x) println(result.collect().mkString(","))
CS535 Big Data | Computer Science | Colorado State University
map() vs. flatMap() [1/2]• As results of flatMap(), we have an RDD of the elements
• Instead of RDD of lists of elements
RDD1{“coffee panda”, “happy
panda”, ”happiest panda party”}
mappedRDD{[“coffee”,”panda”],[“happy”,”panda”],[“happiest”,”panda”,”party
”]}
RDD1.map(tokenize)
flatMappedRDD{“coffee”,”panda”,“happy”,”pand
a”,“happiest”,”panda”,”party”}
RDD1.flatMap(tokenize)
CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 4
CS535 Big Data | Computer Science | Colorado State University
map() vs. flatMap() [2/2]• Using flatMap() that splits lines to multiple words
val lines = sc.parallelize( List(" hello world", "hi")) val words = lines.flatMap( line = > line.split(" ")) words.first() // returns "hello"
CS535 Big Data | Computer Science | Colorado State University
Example [1/2]
• Using map
• Using flatMap
Colorado State UniversityOhio State UniversityWashington State UniversityBoston University
>>> wc = data.map(lambda line:line.split(" ")); >>> llst = wc.collect() # print the listfor line in llist:
print line
Output?
>>> fm = data.flatMap(lambda line:line.split(" ")); >>> fm.collect() # print the listfor line in llist:
print line
Output?
CS535 Big Data | Computer Science | Colorado State University
Example -- Answers [2/2]
• Using map
• Using flatMap
Colorado State UniversityOhio State University
Washington State UniversityBoston University
Colorado state universityOhio state university
Washington state universityBoston university
ColoradoState
UniversityOhioStateUniversityWashingtonStateUniversity
BostonUniversity
CS535 Big Data | Computer Science | Colorado State University
map() vs. mapPartition() [2/2]• map(func) converts each element of the source RDD into a single element of the
result RDD by applying a function• mapPartitions(func) converts each partition of the source RDD into multiple
elements of the result (possibly none)• Similar to map, but runs separately on each partition (block) of the RDD• func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.
val sc = new SparkContext(master,"BasicAvgMapPartitions”, System.getenv("SPARK_HOME"))
val input = sc.parallelize(List(1, 2, 3, 4))val result = input.mapPartitions(partition =>
Iterator(AvgCount(0, 0).merge(partition))).reduce((x,y) => x.merge(y))
println(result)
CS535 Big Data | Computer Science | Colorado State University
map() vs. mapPartition(): Performance• Does map() perform faster than mapPartition()?
• Assume that they are performed over the same RDD with the same number of partitions
CS535 Big Data | Computer Science | Colorado State University
repartition() vs. coalesce() • repartition(numParitions)
• Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them
• This always shuffles all data over the network
• coalesce(numPartitions)• Decrease the number of partitions in the RDD to numPartitions• Useful for running operations more efficiently after filtering down a large dataset
CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 5
CS535 Big Data | Computer Science | Colorado State University
In-Memory Cluster Computing: Apache SparkRDD: Transformations
1. Simple transformations2. Transformations with multiple RDDs3. Transformations with the Pair RDDs
CS535 Big Data | Computer Science | Colorado State University
Two-RDD transformations on RDDs containing RDD1 = {1, 2, 3} and RDD2 = {3, 4, 5}
name purpose results
union() union {1.2.3.3.4.5}
intersection() intersection {3}
subtract() Remove the contents of one RDD (e.g. remove training data)
{1,2}
cartesian() Cartesian product {(1,3),(1,4),(1,5), (2,3), (2,4), (2,5),(3,4),(3,5),(3,6)}
CS535 Big Data | Computer Science | Colorado State University
In-Memory Cluster Computing: Apache SparkRDD: Transformations
1. Simple transformations2. Transformations with multiple RDDs3. Transformations with the Pair RDDs
CS535 Big Data | Computer Science | Colorado State University
Why Key/Value Pairs?• Pair RDDs
• Spark provides special operations on RDDs containing key/value pairs• Pair RDDS allow you to act on each key in parallel or regroup data across the network
• reduceByKey()• Aggregates data separately for each key
• join()• Merge two RDDs by grouping elements with the same key
CS535 Big Data | Computer Science | Colorado State University
Creating Pair RDDs• Running map() function
• Returns key/value pairs
val pairs = lines.map(x => (x.split(" ")(0), x))
Creating a pair RDD using the first word as the key in Scala
CS535 Big Data | Computer Science | Colorado State University
Transformations on a single pair RDD
(example: RDD1 = {(1, 2), (3, 4), (3, 6)})
• Pair RDDs are allowed to use all the transformations available to standard RDDs.
Function purpose Example Result
reduceByKey() Combine values with the
same key
rdd.reduceByKey((x, y) = > x + y)
{(1,2),(3,10)}
groupByKey() Group values with the
same key
rdd.groupByKey() {(1,[2]), (3,[4,6])}
combineByKey( createCombiner, mergeValue, mergeCombiners, partitioner)
Combine values with the
same key using a
different result type
We will re-visit this function
later.
CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 6
CS535 Big Data | Computer Science | Colorado State University
Function purpose Example Result
mapValues(func) Apply a function to each value of a pair RDD without changing the key
rdd.mapValues(x=>x+1) {(1,3),(3,5),(3,7)}
flatMapValues(func) Apply a function that returns an iterator
rdd.flatMapValues(x=>(x to 5)
{(1,2),(1,3),(1,4),(1,5),(3,4),(3,5)}
keys() Return an RDD of just the keys
rdd.keys() {1,3,3}
values() Return an RDD of just the values
rdd.values() {2,4,6}
sortByKey() Return an RDD sorted by the key
rdd.sortByKey() {(1,2),(3,4),(3,5)}
Transformations on a single pair RDD (example: RDD1 = {(1, 2), (3, 4), (3, 6)})
CS535 Big Data | Computer Science | Colorado State University
Transformations on two pair RDDs (example: {rdd = (1, 2), (3, 4), (3, 6)}) other={(3,9)}
Function purpose Example Result
subtractByKey Remove elements with a key present in the other RDD
rdd.subtractByKey(other)
{(1,2)}
join Inner join rdd.join(other) {(3,(4,9)),(3,(6,9))}
rightOuterJoin Perform a join where the key must be present in the other RDD
rdd.rightOuterJoin(other)
(3,(Some(4),9)),(3,(Some(6),9))}
leftOuterJoin() Perform a join where the key must be present in the first RDD
rdd.leftOuterJoin(other)
{(1,(2,None)), (3, (4,Some(9))), (3,(6,Some(9)))}
coGroup Group data from both RDDs sharing the same key
rdd.cogroup(other) {(1,([2],[])), (3,([4,6],[9]))}
CS535 Big Data | Computer Science | Colorado State University
Pair RDDs are still RDDs• Supports the same functions as RDDs
pairs.filter{case (key, value) => value.length < 20}
Access only the “value” partmapValues(func)
Same as map{case (x, y):(x,func(y))}
CS535 Big Data | Computer Science | Colorado State University
Aggregations with Pair RDDs• Aggregate statistics across all elements with the same key
• reduceByKey()• Similar to reduce()• Takes a function and use it to combine values• Runs several parallel reduce operations
• One for each key in the dataset• Each operations combines values that have the same keys
• reduceByKey() is NOT implemented as an action that returns a value to the user program• It returns a new RDD consisting of each key and the reduced value for that key
CS535 Big Data | Computer Science | Colorado State University
Example• Per-key average with reduceByKey()and mapValues()
rdd.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
Results of reducebyKey include(sum, count)
CS535 Big Data | Computer Science | Colorado State University
combineByKey()• The most general of the per-key aggregation functions
• Most of the other per-key combiners are implemented using it
• Allows the user to return values that are not the same type as the input data• createCombiner()
• If combineByKey() finds a new key• This happens the first time a key is found in each partition, rather than only the first time the key is
found in the RDD• mergeValue()
• If it is not a new value in that partition• mergeCombiners()
• Merging the results from each partition
CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 7
CS535 Big Data | Computer Science | Colorado State University
Per-key average using combineByKey()
• createCombiner()• mergeValue()• mergeCombiners()
val result = input.combineByKey( (v) => (v, 1), (acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1), (acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 +
acc2._2)).map{ case (key, value) => (key, value._1 / value._2.toFloat) }
result.collectAsMap().map(println(_))
CS535 Big Data | Computer Science | Colorado State University
combineByKey() sample data flow• Partition 1 trace:(coffee, 1) à new keyaccumulators[coffee] = createCombiner(1) (coffee, 2)à existing keyaccumulators[coffee]=mergeValue(accumulators[coffee],2)(panda, 3)à new key
accumulators[panda] à new keyaccumulators[panda]=createCombiner(3)
• Partition 2 trace:(coffee, 9) à new keyaccumulators[coffee] = createCombiner(9)
• Merge Partitions:mergeCombiners(partition1.accumulators[coffee], partition2.accumulators[coffee])
def createCombiner(value):
(value, 1)
def mergeValue(acc, value):(acc[0]+value,acc[1]+1)
def mergeCombiners(acc1, acc2):(acc1[0]+acc2[0], acc1[1] + acc2[1])
CS535 Big Data | Computer Science | Colorado State University
Tuning the level of parallelism• When performing aggregations or grouping operations, we can ask Spark to use a
specific number of partitions• reduceByKey((x, y) = > x + y, 10)
• repartition()• Shuffles the data across the network to create a new set of partitions
• Expensive operation• Optimized version: coalesce()
• Avoids data movement
CS535 Big Data | Computer Science | Colorado State University
groupByKey()• Group our data using the key in our RDD• On an RDD consisting of keys of type K and values of type V
• Results will be RDD of type [K, Iterable[V]]
• cogroup()• Grouping data from multiple RDDs• Over two RDDs sharing the same key type, K, with the respective value types V and W gives us back RDD[(K, (Iterable[V], Iterable[W]))]
CS535 Big Data | Computer Science | Colorado State University
joins(other dataset, [numPartitions])• Inner join
• Only keys that are present in both pair RDDs are output• leftOuterJoin(other) and rightOuterJoin(other)
• One of the pair RDDs can be missing the key
• leftOuterJoin(other) • The resulting pair RDD has entries for each key in the source RDD
• rightOuterJoin(other)• The resulting pair RDD has entries for each key in the other RDD
CS535 Big Data | Computer Science | Colorado State University
In-Memory Cluster Computing: Apache SparkRDD: Transformations
RDD: ActionsRDD: Persistence
CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 8
CS535 Big Data | Computer Science | Colorado State University
Actions [1/2]
println(" Input had " + badLinesRDD.count() + " concerning lines") println(" Here are 10 examples:")
badLinesRDD.take(10).foreach(println)
• Returns a final value to the driver program• Or writes data to an external storage system
• Log file analysis example is continued• take() retrieves a small number of elements in the RDD at the driver program• Iterates over them locally to print out information at the driver
CS535 Big Data | Computer Science | Colorado State University
Actions [2/2]• collect()
• Retrieves entire RDD to the driver• Entire dataset (RDD) should fit in memory on single machine
• If the RDD is filtered down to a very small dataset, it is useful
• For very large RDD• Store them in the external storage (e.g. S3, or HDFS)• saveAsTextFile() action
CS535 Big Data | Computer Science | Colorado State University
reduce()• Takes a function that operates on two elements of the type in your RDD and returns a
new element of the same type
• The function should be commutative and associative so that it can be computed correctly in parallel
val rdd1 = sc.parallelize(List(1, 2, 5)) val sum = rdd1.reduce{ (x, y) => x + y}//results: sum: Int = 8
CS535 Big Data | Computer Science | Colorado State University
reduce() vs. fold()• Similar to reduce() but it takes ‘zero value’ (initial value)
• The function should be commutative and associative so that it can be computed correctly in parallel
val rdd1 = sc.parallelize(List( ("maths", 80), ("science", 90) ))rdd1.partitions.length // result: res8: Int = 8val additionalMarks = ("extra", 4)val sum = rdd1.fold(additionalMarks){(acc, marks) => val sum = acc._2 + marks._2 ("total", sum)}
What will be the result(sum)?
CS535 Big Data | Computer Science | Colorado State University
reduce() vs. fold()• Similar to reduce() but it takes ‘zero value’ (initial value)
• The function should be commutative and associative so that it can be computed correctly in parallel
val rdd1 = sc.parallelize(List( ("maths", 80), ("science", 90) ))rdd1.partitions.length // result: res8: Int = 8val additionalMarks = ("extra", 4)val sum = rdd1.fold(additionalMarks){(acc, marks) => val sum = acc._2 + marks._2 ("total", sum)}// result: sum: (String, Int) = (total,206)// (4x8)+80+90 = 206
CS535 Big Data | Computer Science | Colorado State University
take(n) • returns n elements from the RDD and attempts to minimize the number of partitions it
accesses• It may represent a biased collection• It does not return the elements in the order you might expect• Useful for unit testing
CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 9
CS535 Big Data | Computer Science | Colorado State University
In-Memory Cluster Computing: Apache SparkRDD: Transformations
RDD: ActionsRDD: Persistence
CS535 Big Data | Computer Science | Colorado State University
Persistence• Caches dataset across operations
• Nodes store any partitions of results from previous operation(s) in memory reuse them in other actions
• An RDD to be persisted can be specified by persist() or cache()
• The persisted RDD can be stored using a different storage level• Using a StorageLevel object• Passing StorageLevel object to persist()
CS535 Big Data | Computer Science | Colorado State University
Benefits of RDDs as a distributed memory abstraction [1/3]
• RDDs can only be created (“written”) through coarse-grained transformations
• Distributed shared memory (DSM) allows reads and writes to each memory location• Reads on RDDs can still be fine-grained
• A large read-only lookup table
• Applications perform bulk writes
• More efficient fault tolerance• Lineage based bulk recovery
CS535 Big Data | Computer Science | Colorado State University
Persistence levelslevel Space
usedCPU time
In memory/On disk
Comment
MEMORY_ONLY High Low Y/N
MEMORY_ONLY_SER
Low High Y/N Store RDD as serialized Java objects (one byte array per partition).
MEMORY_AND_DISK
High Medium Some/Some Spills to disk if there is too much data to fit in memory
MEMORY_AND_DISK_SER
Low High Some/Some Spills to disk if there is too much data to fit in memory. Stores serialized representation in memory
DISK_ONLY Low High N/Y
CS535 Big Data | Computer Science | Colorado State University
• RDDs’ immutable data
• System can mitigate slow nodes (Stragglers)
• Creates backup copies of slow tasks• without accessing the same memory
• Spark distributes the data over different working nodes that run computations in parallel• Orchestrates communicating between nodes to Integrate intermediate results and combine them for the final
result
Benefits of RDDs as a distributed memory abstraction [2/3]CS535 Big Data | Computer Science | Colorado State University
• Runtime can schedule tasks based on data locality • To improve performance
• RDDs degrade gracefully when there is insufficient memory• Partitions that do not fit in the RAM are stored on disk
Benefits of RDDs as a distributed memory abstraction [3/3]
CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 10
CS535 Big Data | Computer Science | Colorado State University
Questions?