cs535 big data 2/5/2020 week 3-b sangmi lee pallickaracs535/slides/week3-b-6.pdfcs535 big data |...

10
CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1 CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER COMPUTING Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University FAQs PA1 GEAR Session 1 signup is available: See the announcement in canvas Feedback policy Quiz, TP proposal: 1week Email- 24hrs CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class 3. Distributed Computing Models for Scalable Batch Computing Introduction to Spark Operations: transformations, actions, persistence CS535 Big Data | Computer Science | Colorado State University In-Memory Cluster Computing: Apache Spark RDD (Resilient Distributed Dataset) CS535 Big Data | Computer Science | Colorado State University RDD (Resilient Distributed Dataset) Read-only, memory resident partitioned collection of records A fault-tolerant collection of elements that can be operated on in parallel RDDs are the core unit of data in Spark Most Spark programming involves performing operations on RDDs CS535 Big Data | Computer Science | Colorado State University Creating RDDs [1/3] Loading an external dataset Parallelizing a collection in your driver program val lines = sc.parallelize(List("pandas", "i like pandas")) val lines = sc.textFile("/path/to/README.md") https://spark.apache.org/docs/latest/rdd-programming-guide.html

Upload: others

Post on 29-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS535 Big Data 2/5/2020 Week 3-B Sangmi Lee Pallickaracs535/slides/week3-B-6.pdfCS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA ... Creating a pair RDD

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1

CS535 Big Data | Computer Science | Colorado State University

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTINGSECTION 2: IN-MEMORY CLUSTER COMPUTING

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

CS535 Big Data | Computer Science | Colorado State University

FAQs• PA1

• GEAR Session 1 signup is available:• See the announcement in canvas

• Feedback policy• Quiz, TP proposal: 1week• Email- 24hrs

CS535 Big Data | Computer Science | Colorado State University

Topics of Todays Class• 3. Distributed Computing Models for Scalable Batch Computing

• Introduction to Spark• Operations: transformations, actions, persistence

CS535 Big Data | Computer Science | Colorado State University

In-Memory Cluster Computing: Apache SparkRDD (Resilient Distributed Dataset)

CS535 Big Data | Computer Science | Colorado State University

RDD (Resilient Distributed Dataset)• Read-only, memory resident partitioned collection of records

• A fault-tolerant collection of elements that can be operated on in parallel

• RDDs are the core unit of data in Spark• Most Spark programming involves performing operations on RDDs

CS535 Big Data | Computer Science | Colorado State University

Creating RDDs [1/3]

• Loading an external dataset

• Parallelizing a collection in your driver program

val lines = sc.parallelize(List("pandas", "i like pandas"))

val lines = sc.textFile("/path/to/README.md")

https://spark.apache.org/docs/latest/rdd-programming-guide.html

Page 2: CS535 Big Data 2/5/2020 Week 3-B Sangmi Lee Pallickaracs535/slides/week3-B-6.pdfCS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA ... Creating a pair RDD

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 2

CS535 Big Data | Computer Science | Colorado State University

Creating RDDs [2/3]

• Line 1: defines a base RDD from an external file• This dataset is not loaded in memory

• Line 2: defines lineLengths as the result of map transformation• It is not immediately computed

• Line 3: performs reduce and compute the results

1: val lines = sc.textFile("data.txt")2: val lineLengths = lines.map(s => s.length)3: val totalLength = lineLengths.reduce((a, b) => a + b)

CS535 Big Data | Computer Science | Colorado State University

Creating RDDs [3/3]

• If you want to use lineLengths again later

lineLengths.persist()

CS535 Big Data | Computer Science | Colorado State University

Spark Programming Interface to RDD [1/3]

• transformations• Operations that create RDDs

• Return pointers to new RDDs

• e.g. map, filter, and join

• RDDs can only be created through deterministic operations on either

• Data in stable storage

• Other RDDs

1: val lines = sc.textFile("data.txt")2: val lineLengths = lines.map(s => s.length)3: val totalLength = lineLengths.reduce((a, b) => a + b)

Don’t be confused with “transformation” of the Scala language

CS535 Big Data | Computer Science | Colorado State University

Spark Programming Interface to RDD [2/3]

•actions• Operations that return a value to the application or export data to a storage system

• e.g. count: returns the number of elements in the dataset

• e.g. collect: returns the elements themselves

• e.g. save: outputs the dataset to a storage system

1: val lines = sc.textFile("data.txt")2: val lineLengths = lines.map(s => s.length)3: val totalLength = lineLengths.reduce((a, b) => a + b)

Don’t be confused with “transformation” of the Scala language

CS535 Big Data | Computer Science | Colorado State University

Spark Programming Interface to RDD [3/3]

•persistance• Indicates which RDDs they want to reuse in future operations

• Spark keeps persistent RDDs in memory by default

• If there is not enough RAM• It can spill them to disk

• Users are allowed to, • store the RDD only on disk• replicate the RDD across machines

• specify a persistence priority on each RDD

lineLengths.persist()

CS535 Big Data | Computer Science | Colorado State University

In-Memory Cluster Computing: Apache SparkRDD: Transformations

RDD: ActionsRDD: Persistence

Page 3: CS535 Big Data 2/5/2020 Week 3-B Sangmi Lee Pallickaracs535/slides/week3-B-6.pdfCS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA ... Creating a pair RDD

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 3

CS535 Big Data | Computer Science | Colorado State University

What are the Transformations?• RDD’s transformations create a new dataset from an existing one

• Simple transformations• Transformations with multiple RDDs• Transformations with the Pair RDDs

CS535 Big Data | Computer Science | Colorado State University

In-Memory Cluster Computing: Apache SparkRDD: Transformations

1. Simple transformations2. Transformations with multiple RDDs3. Transformations with the Pair RDDs

CS535 Big Data | Computer Science | Colorado State University

Simple Transformations• These transformations create a new RDD from an existing RDD

• E.g. map(), filter(), flatMap(), sample()

CS535 Big Data | Computer Science | Colorado State University

map() vs. filter() [1/2]• The map(func) transformation takes in a function

• applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD

• The filter(func) transformation takes in a function and returns an RDD that only has elements that pass the filter() function

inputRDD{1,2,3,4}

MappedRDD{1,4,9,16}

filteredRDD{2,3,4}

map x=> x*x filter x !=1

CS535 Big Data | Computer Science | Colorado State University

map() vs. filter() [1/2]• map() that squares all of the numbers in an RDD

val input = sc.parallelize( List( 1, 2, 3, 4)) val result = input.map( x = > x * x) println(result.collect().mkString(","))

CS535 Big Data | Computer Science | Colorado State University

map() vs. flatMap() [1/2]• As results of flatMap(), we have an RDD of the elements

• Instead of RDD of lists of elements

RDD1{“coffee panda”, “happy

panda”, ”happiest panda party”}

mappedRDD{[“coffee”,”panda”],[“happy”,”panda”],[“happiest”,”panda”,”party

”]}

RDD1.map(tokenize)

flatMappedRDD{“coffee”,”panda”,“happy”,”pand

a”,“happiest”,”panda”,”party”}

RDD1.flatMap(tokenize)

Page 4: CS535 Big Data 2/5/2020 Week 3-B Sangmi Lee Pallickaracs535/slides/week3-B-6.pdfCS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA ... Creating a pair RDD

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 4

CS535 Big Data | Computer Science | Colorado State University

map() vs. flatMap() [2/2]• Using flatMap() that splits lines to multiple words

val lines = sc.parallelize( List(" hello world", "hi")) val words = lines.flatMap( line = > line.split(" ")) words.first() // returns "hello"

CS535 Big Data | Computer Science | Colorado State University

Example [1/2]

• Using map

• Using flatMap

Colorado State UniversityOhio State UniversityWashington State UniversityBoston University

>>> wc = data.map(lambda line:line.split(" ")); >>> llst = wc.collect() # print the listfor line in llist:

print line

Output?

>>> fm = data.flatMap(lambda line:line.split(" ")); >>> fm.collect() # print the listfor line in llist:

print line

Output?

CS535 Big Data | Computer Science | Colorado State University

Example -- Answers [2/2]

• Using map

• Using flatMap

Colorado State UniversityOhio State University

Washington State UniversityBoston University

Colorado state universityOhio state university

Washington state universityBoston university

ColoradoState

UniversityOhioStateUniversityWashingtonStateUniversity

BostonUniversity

CS535 Big Data | Computer Science | Colorado State University

map() vs. mapPartition() [2/2]• map(func) converts each element of the source RDD into a single element of the

result RDD by applying a function• mapPartitions(func) converts each partition of the source RDD into multiple

elements of the result (possibly none)• Similar to map, but runs separately on each partition (block) of the RDD• func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.

val sc = new SparkContext(master,"BasicAvgMapPartitions”, System.getenv("SPARK_HOME"))

val input = sc.parallelize(List(1, 2, 3, 4))val result = input.mapPartitions(partition =>

Iterator(AvgCount(0, 0).merge(partition))).reduce((x,y) => x.merge(y))

println(result)

CS535 Big Data | Computer Science | Colorado State University

map() vs. mapPartition(): Performance• Does map() perform faster than mapPartition()?

• Assume that they are performed over the same RDD with the same number of partitions

CS535 Big Data | Computer Science | Colorado State University

repartition() vs. coalesce() • repartition(numParitions)

• Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them

• This always shuffles all data over the network

• coalesce(numPartitions)• Decrease the number of partitions in the RDD to numPartitions• Useful for running operations more efficiently after filtering down a large dataset

Page 5: CS535 Big Data 2/5/2020 Week 3-B Sangmi Lee Pallickaracs535/slides/week3-B-6.pdfCS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA ... Creating a pair RDD

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 5

CS535 Big Data | Computer Science | Colorado State University

In-Memory Cluster Computing: Apache SparkRDD: Transformations

1. Simple transformations2. Transformations with multiple RDDs3. Transformations with the Pair RDDs

CS535 Big Data | Computer Science | Colorado State University

Two-RDD transformations on RDDs containing RDD1 = {1, 2, 3} and RDD2 = {3, 4, 5}

name purpose results

union() union {1.2.3.3.4.5}

intersection() intersection {3}

subtract() Remove the contents of one RDD (e.g. remove training data)

{1,2}

cartesian() Cartesian product {(1,3),(1,4),(1,5), (2,3), (2,4), (2,5),(3,4),(3,5),(3,6)}

CS535 Big Data | Computer Science | Colorado State University

In-Memory Cluster Computing: Apache SparkRDD: Transformations

1. Simple transformations2. Transformations with multiple RDDs3. Transformations with the Pair RDDs

CS535 Big Data | Computer Science | Colorado State University

Why Key/Value Pairs?• Pair RDDs

• Spark provides special operations on RDDs containing key/value pairs• Pair RDDS allow you to act on each key in parallel or regroup data across the network

• reduceByKey()• Aggregates data separately for each key

• join()• Merge two RDDs by grouping elements with the same key

CS535 Big Data | Computer Science | Colorado State University

Creating Pair RDDs• Running map() function

• Returns key/value pairs

val pairs = lines.map(x => (x.split(" ")(0), x))

Creating a pair RDD using the first word as the key in Scala

CS535 Big Data | Computer Science | Colorado State University

Transformations on a single pair RDD

(example: RDD1 = {(1, 2), (3, 4), (3, 6)})

• Pair RDDs are allowed to use all the transformations available to standard RDDs.

Function purpose Example Result

reduceByKey() Combine values with the

same key

rdd.reduceByKey((x, y) = > x + y)

{(1,2),(3,10)}

groupByKey() Group values with the

same key

rdd.groupByKey() {(1,[2]), (3,[4,6])}

combineByKey( createCombiner, mergeValue, mergeCombiners, partitioner)

Combine values with the

same key using a

different result type

We will re-visit this function

later.

Page 6: CS535 Big Data 2/5/2020 Week 3-B Sangmi Lee Pallickaracs535/slides/week3-B-6.pdfCS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA ... Creating a pair RDD

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 6

CS535 Big Data | Computer Science | Colorado State University

Function purpose Example Result

mapValues(func) Apply a function to each value of a pair RDD without changing the key

rdd.mapValues(x=>x+1) {(1,3),(3,5),(3,7)}

flatMapValues(func) Apply a function that returns an iterator

rdd.flatMapValues(x=>(x to 5)

{(1,2),(1,3),(1,4),(1,5),(3,4),(3,5)}

keys() Return an RDD of just the keys

rdd.keys() {1,3,3}

values() Return an RDD of just the values

rdd.values() {2,4,6}

sortByKey() Return an RDD sorted by the key

rdd.sortByKey() {(1,2),(3,4),(3,5)}

Transformations on a single pair RDD (example: RDD1 = {(1, 2), (3, 4), (3, 6)})

CS535 Big Data | Computer Science | Colorado State University

Transformations on two pair RDDs (example: {rdd = (1, 2), (3, 4), (3, 6)}) other={(3,9)}

Function purpose Example Result

subtractByKey Remove elements with a key present in the other RDD

rdd.subtractByKey(other)

{(1,2)}

join Inner join rdd.join(other) {(3,(4,9)),(3,(6,9))}

rightOuterJoin Perform a join where the key must be present in the other RDD

rdd.rightOuterJoin(other)

(3,(Some(4),9)),(3,(Some(6),9))}

leftOuterJoin() Perform a join where the key must be present in the first RDD

rdd.leftOuterJoin(other)

{(1,(2,None)), (3, (4,Some(9))), (3,(6,Some(9)))}

coGroup Group data from both RDDs sharing the same key

rdd.cogroup(other) {(1,([2],[])), (3,([4,6],[9]))}

CS535 Big Data | Computer Science | Colorado State University

Pair RDDs are still RDDs• Supports the same functions as RDDs

pairs.filter{case (key, value) => value.length < 20}

Access only the “value” partmapValues(func)

Same as map{case (x, y):(x,func(y))}

CS535 Big Data | Computer Science | Colorado State University

Aggregations with Pair RDDs• Aggregate statistics across all elements with the same key

• reduceByKey()• Similar to reduce()• Takes a function and use it to combine values• Runs several parallel reduce operations

• One for each key in the dataset• Each operations combines values that have the same keys

• reduceByKey() is NOT implemented as an action that returns a value to the user program• It returns a new RDD consisting of each key and the reduced value for that key

CS535 Big Data | Computer Science | Colorado State University

Example• Per-key average with reduceByKey()and mapValues()

rdd.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))

Results of reducebyKey include(sum, count)

CS535 Big Data | Computer Science | Colorado State University

combineByKey()• The most general of the per-key aggregation functions

• Most of the other per-key combiners are implemented using it

• Allows the user to return values that are not the same type as the input data• createCombiner()

• If combineByKey() finds a new key• This happens the first time a key is found in each partition, rather than only the first time the key is

found in the RDD• mergeValue()

• If it is not a new value in that partition• mergeCombiners()

• Merging the results from each partition

Page 7: CS535 Big Data 2/5/2020 Week 3-B Sangmi Lee Pallickaracs535/slides/week3-B-6.pdfCS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA ... Creating a pair RDD

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 7

CS535 Big Data | Computer Science | Colorado State University

Per-key average using combineByKey()

• createCombiner()• mergeValue()• mergeCombiners()

val result = input.combineByKey( (v) => (v, 1), (acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1), (acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 +

acc2._2)).map{ case (key, value) => (key, value._1 / value._2.toFloat) }

result.collectAsMap().map(println(_))

CS535 Big Data | Computer Science | Colorado State University

combineByKey() sample data flow• Partition 1 trace:(coffee, 1) à new keyaccumulators[coffee] = createCombiner(1) (coffee, 2)à existing keyaccumulators[coffee]=mergeValue(accumulators[coffee],2)(panda, 3)à new key

accumulators[panda] à new keyaccumulators[panda]=createCombiner(3)

• Partition 2 trace:(coffee, 9) à new keyaccumulators[coffee] = createCombiner(9)

• Merge Partitions:mergeCombiners(partition1.accumulators[coffee], partition2.accumulators[coffee])

def createCombiner(value):

(value, 1)

def mergeValue(acc, value):(acc[0]+value,acc[1]+1)

def mergeCombiners(acc1, acc2):(acc1[0]+acc2[0], acc1[1] + acc2[1])

CS535 Big Data | Computer Science | Colorado State University

Tuning the level of parallelism• When performing aggregations or grouping operations, we can ask Spark to use a

specific number of partitions• reduceByKey((x, y) = > x + y, 10)

• repartition()• Shuffles the data across the network to create a new set of partitions

• Expensive operation• Optimized version: coalesce()

• Avoids data movement

CS535 Big Data | Computer Science | Colorado State University

groupByKey()• Group our data using the key in our RDD• On an RDD consisting of keys of type K and values of type V

• Results will be RDD of type [K, Iterable[V]]

• cogroup()• Grouping data from multiple RDDs• Over two RDDs sharing the same key type, K, with the respective value types V and W gives us back RDD[(K, (Iterable[V], Iterable[W]))]

CS535 Big Data | Computer Science | Colorado State University

joins(other dataset, [numPartitions])• Inner join

• Only keys that are present in both pair RDDs are output• leftOuterJoin(other) and rightOuterJoin(other)

• One of the pair RDDs can be missing the key

• leftOuterJoin(other) • The resulting pair RDD has entries for each key in the source RDD

• rightOuterJoin(other)• The resulting pair RDD has entries for each key in the other RDD

CS535 Big Data | Computer Science | Colorado State University

In-Memory Cluster Computing: Apache SparkRDD: Transformations

RDD: ActionsRDD: Persistence

Page 8: CS535 Big Data 2/5/2020 Week 3-B Sangmi Lee Pallickaracs535/slides/week3-B-6.pdfCS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA ... Creating a pair RDD

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 8

CS535 Big Data | Computer Science | Colorado State University

Actions [1/2]

println(" Input had " + badLinesRDD.count() + " concerning lines") println(" Here are 10 examples:")

badLinesRDD.take(10).foreach(println)

• Returns a final value to the driver program• Or writes data to an external storage system

• Log file analysis example is continued• take() retrieves a small number of elements in the RDD at the driver program• Iterates over them locally to print out information at the driver

CS535 Big Data | Computer Science | Colorado State University

Actions [2/2]• collect()

• Retrieves entire RDD to the driver• Entire dataset (RDD) should fit in memory on single machine

• If the RDD is filtered down to a very small dataset, it is useful

• For very large RDD• Store them in the external storage (e.g. S3, or HDFS)• saveAsTextFile() action

CS535 Big Data | Computer Science | Colorado State University

reduce()• Takes a function that operates on two elements of the type in your RDD and returns a

new element of the same type

• The function should be commutative and associative so that it can be computed correctly in parallel

val rdd1 = sc.parallelize(List(1, 2, 5)) val sum = rdd1.reduce{ (x, y) => x + y}//results: sum: Int = 8

CS535 Big Data | Computer Science | Colorado State University

reduce() vs. fold()• Similar to reduce() but it takes ‘zero value’ (initial value)

• The function should be commutative and associative so that it can be computed correctly in parallel

val rdd1 = sc.parallelize(List( ("maths", 80), ("science", 90) ))rdd1.partitions.length // result: res8: Int = 8val additionalMarks = ("extra", 4)val sum = rdd1.fold(additionalMarks){(acc, marks) => val sum = acc._2 + marks._2 ("total", sum)}

What will be the result(sum)?

CS535 Big Data | Computer Science | Colorado State University

reduce() vs. fold()• Similar to reduce() but it takes ‘zero value’ (initial value)

• The function should be commutative and associative so that it can be computed correctly in parallel

val rdd1 = sc.parallelize(List( ("maths", 80), ("science", 90) ))rdd1.partitions.length // result: res8: Int = 8val additionalMarks = ("extra", 4)val sum = rdd1.fold(additionalMarks){(acc, marks) => val sum = acc._2 + marks._2 ("total", sum)}// result: sum: (String, Int) = (total,206)// (4x8)+80+90 = 206

CS535 Big Data | Computer Science | Colorado State University

take(n) • returns n elements from the RDD and attempts to minimize the number of partitions it

accesses• It may represent a biased collection• It does not return the elements in the order you might expect• Useful for unit testing

Page 9: CS535 Big Data 2/5/2020 Week 3-B Sangmi Lee Pallickaracs535/slides/week3-B-6.pdfCS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA ... Creating a pair RDD

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 9

CS535 Big Data | Computer Science | Colorado State University

In-Memory Cluster Computing: Apache SparkRDD: Transformations

RDD: ActionsRDD: Persistence

CS535 Big Data | Computer Science | Colorado State University

Persistence• Caches dataset across operations

• Nodes store any partitions of results from previous operation(s) in memory reuse them in other actions

• An RDD to be persisted can be specified by persist() or cache()

• The persisted RDD can be stored using a different storage level• Using a StorageLevel object• Passing StorageLevel object to persist()

CS535 Big Data | Computer Science | Colorado State University

Benefits of RDDs as a distributed memory abstraction [1/3]

• RDDs can only be created (“written”) through coarse-grained transformations

• Distributed shared memory (DSM) allows reads and writes to each memory location• Reads on RDDs can still be fine-grained

• A large read-only lookup table

• Applications perform bulk writes

• More efficient fault tolerance• Lineage based bulk recovery

CS535 Big Data | Computer Science | Colorado State University

Persistence levelslevel Space

usedCPU time

In memory/On disk

Comment

MEMORY_ONLY High Low Y/N

MEMORY_ONLY_SER

Low High Y/N Store RDD as serialized Java objects (one byte array per partition).

MEMORY_AND_DISK

High Medium Some/Some Spills to disk if there is too much data to fit in memory

MEMORY_AND_DISK_SER

Low High Some/Some Spills to disk if there is too much data to fit in memory. Stores serialized representation in memory

DISK_ONLY Low High N/Y

CS535 Big Data | Computer Science | Colorado State University

• RDDs’ immutable data

• System can mitigate slow nodes (Stragglers)

• Creates backup copies of slow tasks• without accessing the same memory

• Spark distributes the data over different working nodes that run computations in parallel• Orchestrates communicating between nodes to Integrate intermediate results and combine them for the final

result

Benefits of RDDs as a distributed memory abstraction [2/3]CS535 Big Data | Computer Science | Colorado State University

• Runtime can schedule tasks based on data locality • To improve performance

• RDDs degrade gracefully when there is insufficient memory• Partitions that do not fit in the RAM are stored on disk

Benefits of RDDs as a distributed memory abstraction [3/3]

Page 10: CS535 Big Data 2/5/2020 Week 3-B Sangmi Lee Pallickaracs535/slides/week3-B-6.pdfCS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA ... Creating a pair RDD

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 10

CS535 Big Data | Computer Science | Colorado State University

Questions?