apache spark rdd join to reduce job run time

APACHE SPARKTM

RDD JOIN TO REDUCE JOB RUN TIME

Insights from Imaginea

THE PROBLEM

Over 1 TB data would be received as rows of tuple [ ID, v1, v2, … vn ] from S3

The need was to find the aggregate of values by their IDs and store it into HDFS

As we performed join between the existing data and the incremental data on Spark, there would be

HUGE AMOUNT OF DATA SHUFFLE across the cluster

We wanted to reduce this shuffle and thus reduce the job run time

THE APPROACH

SAME ID GOES TO SAME PARTITION

So, given the fact that aggregation of each ID in one dataset has to be matched with the same ID in the other data set, we partitioned the data sets in such a way that the rows with same ID go to the same partition and thus on the same Spark worker

With this approach, rows were joined locally and the costly shuffle over network was avoided

HASHPARTITIONER ON THE RDDs

THE HURDLE

1. Even though the HashPartitioner will divide data based on keys, it will not enforce the node affinity

2. Thus, the amount of data shuffle did not reduce

The approach is to use an HashPartitioner on the RDDs which partition data based on the key hash

So, we dug into the Spark code to figure out

a way to reduce data shuffle

OUR SOLUTION

STEP 1: OVERRIDE TASKSCHEDULER PROCESS TO ENSURE DATA ENTER ONE SINGLE NODE

This can be implemented as following (delegate everything to underlying RDD but

re-implement getPrefferedLocations)

class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) {

val nodeIPs = Array("192.168.2.140","192.168.2.157","192.168.2.77")

override def getPreferredLocations(split: Partition): Seq[String] =

Seq(nodeIPs(split.index % nodeIPs.length))

TaskScheduler assigns worker nodes to partition. Override this process in the new wrapper RDD to ensure that the data always goes to the same node.

STEP 2: WRITE SPARK CODES TO RUN A JOB AND MAKE EDITS TO EXECUTE A JOIN

1. Take a trial dataset

2. Move them to HDFS

3. Run a job which is very simple, do a couple of transformations and finally do dsRdd.join(devRdd)

val r1 = sc.textFile("hdfs://<a

href="http://192.168.2.145:9000/todelete/partitions/random1"

target="_blank">192.168.2.145:9000/todelete/partitions/random1</a>")

val r2 = sc.textFile("hdfs://<a

href="http://192.168.2.145:9000/todelete/partitions/random2"

target="_blank">192.168.2.145:9000/todelete/partitions/random2</a>")

val dsRdd = r1.map(line => <some transformation>).map(tokens

=> <some more>)

val devRDD = r2.map(line => <some

transformation>).map(tokens => <some more>)

// finally join and materialize

dsRdd.join(devRDD, dummy).count

UNDERSTANDING THE JOB RUN

WHY A THREE STAGE PROCESS?

After the job is run, this is the result that Spark UI delivered.

There are 3 stages. Stage 1 and stage 2 produce shuffle data which is consumed as a whole in stage 3.

DAG FOR STAGE 0

It starts with reading the “random1” file, and then applies the two map functions on that RDD.

The same will be done with file “random2” in stage 1, which will have the same DAG.

DAG FOR STAGE 2

Only one block corresponding to the “join” method call on dsRdd. So this corresponds to the dsRdd.join(devRDD, dummy).

Here we observe that there are shuffle boundaries involved in the job when ideally there’s no reason for them to be.

And hence a look at CoGrouped RDD is given to see what is causing the shuffle boundary between the join and the previous stages.

ANALYSING COGROUPED RDD

If the RDD’s being joined do not have exact same partitioner as the one for this RDD, then they are marked as ShuffleDependency (in else block).

Clearly the dsRdd and devRdd went through this path — and both were marked as separate stages coming into “join”. Hence we get three stages.

override def getDependencies: Seq[Dependency[_]] = {

rdds.map { rdd: RDD[_] =>

if (rdd.partitioner == Some(part)) {

logDebug("Adding one-to-one dependency with " + rdd)

new OneToOneDependency(rdd)

} else {

logDebug("Adding shuffle dependency with " + rdd)

new ShuffleDependency[K, Any, CoGroupCombiner](

rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)

WRAP TWO RDDs WITHOUT REPARTITIONING

Without repartitioning the two RDD’s can be wrapped. Delegate everything to the underlying RDD’s by plugging in the dummy partitioner. This will make CoGroupedRdd to report that there are no stage boundaries and the DAG Scheduler will schedule everything locally on each worker.

Here’s the (very small) code for WrapRDD:

class WrapRDD[T: ClassTag](rdd: RDD[T], part: Partitioner)

extends RDD[T](rdd.sparkContext, rdd.dependencies) {

@DeveloperApi

override def compute(split: Partition,

context: TaskContext): Iterator[T] = rdd.compute(split, context)

override protected def getPartitions: Array[Partition] =

rdd.partitions

// ********* main thing/hack ******* ///

override val partitioner = Some(part)

THE RESULT

ONE STAGE JOB RUN

After the changes are made, it can be seen that there is only one stage. The whole data is read (51 X 2 = 102 MB) and

nothing is written or read from shuffle.

EXECUTION TIME IMPROVED BY 25%, FROM 4 SECOND TO 3 SECONDS

The two RDD’s that were computed in separate stages (0 and 1) earlier are now part of the same stage because they were wrapped in WrapRDD.

Also, the difference in the execution time: In the first case it was 4 seconds and now it is 3 seconds. A good 25% improvement. This is because the process doesn’t have to go to the disk for writing the shuffle and then reading it back immediately. Instead it works on intermediate data right away.

IN SUMMARY …

Good workaround was found to substantiate a process with a large amount of data that needs to be

analyzed and segregated

This in turn improves the quality of work and delivery time while using Spark RDD’s

Execution time improved by 25%, from 4 second to 3 seconds

EXPERIENCE THE POWER OF

APACHE SPARK WITH IMAGINEA

Imaginea is among the top contributor to Spark code

Building products on Spark since 2014

Opensource contributors to Apache Hadoop and Zeppelin

To find out more, visit http://www.imaginea.com/apache-spark

ABOUT THE AUTHOR

SACHIN TYAGIHead – Data Engineering, Imaginea

Sachin heads the Data Engineering & Analytics practice at Imaginea. With over 10 years of IT experience, he brings in both Data Science & Data Engineering expertise to solve complex problems in Big Data & Machine Learning. At Imaginea, Sachin has been pivotal in implementing Apache Spark solutions to several FAST 500 companies in the areas such as Predictive Recommendation, Anomaly Detection & Contextual Search.

Disclaimer

This document may contain forward-looking statements concerning products and strategies. These statements are based on management's current expectations and actual results may differ materially from those projected, as a result of certain risks, uncertainties and assumptions, including but not limited to: the growth of the markets addressed by our products and our customers' products, the demand for and market acceptance of our products; our ability to successfully compete in the markets in which we do business; our ability to successfully address the cost structure of our offerings; the ability to develop and implement new technologies and to obtain protection for the related intellectual property; and our ability to realize financial and strategic benefits of past and future transactions. These forward-looking statements are made only as of the date indicated, and the company disclaims any obligation to update or revise the information contained in any forward-looking statements, whether as a result of new information, future events or otherwise.

All Trademarks and other registered marks belong to their respective owners.

Credits

Images under Creative Commons Zero license.

apache spark rdd join to reduce job run time

Data & Analytics

apache spark notes - jin,...

what is apache spark? - the world's leading software ... the...

big data with apache spark - wunca · 2017-07-21 · -...

developing apache spark applications · apache spark...

apache spark rdd 101

apache spark rdd join to reduce job run time

spark programming with rdd

spark rdd : transformations & actions

apache spark rdd api examples

apache spark

pair rdd - spark

apache spark introduction - seoul...

spark introduction rdd building and running spark...

spark sql, spark streaming - cvut.cz · spark sql a...

budapest spark meetup - apache spark @enbrite.ly

performance comparison of apache spark and tez for entity...

anatomy of rdd : deep dive into spark rdd abstraction

7 steps for a developer to learn apache spark · learning...

preprocessing the data in apache...

spark rdd part 2