apache spark rdd join to reduce job run time
Post on 14-Jan-2017
798 Views
Preview:
TRANSCRIPT
Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve.
APACHE SPARKTM
RDD JOIN TO REDUCE JOB RUN TIME
Insights from Imaginea
THE PROBLEM
Over 1 TB data would be received as rows of tuple [ ID, v1, v2, … vn ] from S3
The need was to find the aggregate of values by their IDs and store it into HDFS
As we performed join between the existing data and the incremental data on Spark, there would be
HUGE AMOUNT OF DATA SHUFFLE across the cluster
We wanted to reduce this shuffle and thus reduce the job run time
THE APPROACH
SAME ID GOES TO SAME PARTITION
So, given the fact that aggregation of each ID in one dataset has to be matched with the same ID in the other data set, we partitioned the data sets in such a way that the rows with same ID go to the same partition and thus on the same Spark worker
With this approach, rows were joined locally and the costly shuffle over network was avoided
HASHPARTITIONER ON THE RDDs
THE HURDLE
1. Even though the HashPartitioner will divide data based on keys, it will not enforce the node affinity
2. Thus, the amount of data shuffle did not reduce
The approach is to use an HashPartitioner on the RDDs which partition data based on the key hash
So, we dug into the Spark code to figure out
a way to reduce data shuffle
Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve.
OUR SOLUTION
STEP 1: OVERRIDE TASKSCHEDULER PROCESS TO ENSURE DATA ENTER ONE SINGLE NODE
This can be implemented as following (delegate everything to underlying RDD but
re-implement getPrefferedLocations)
1
2
3
4
5
6
7
class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) {
val nodeIPs = Array("192.168.2.140","192.168.2.157","192.168.2.77")
override def getPreferredLocations(split: Partition): Seq[String] =
Seq(nodeIPs(split.index % nodeIPs.length))
}
TaskScheduler assigns worker nodes to partition. Override this process in the new wrapper RDD to ensure that the data always goes to the same node.
STEP 2: WRITE SPARK CODES TO RUN A JOB AND MAKE EDITS TO EXECUTE A JOIN
1. Take a trial dataset
2. Move them to HDFS
3. Run a job which is very simple, do a couple of transformations and finally do dsRdd.join(devRdd)
1
2
3
4
5
6
7
8
val r1 = sc.textFile("hdfs://<a
href="http://192.168.2.145:9000/todelete/partitions/random1"
target="_blank">192.168.2.145:9000/todelete/partitions/random1</a>")
val r2 = sc.textFile("hdfs://<a
href="http://192.168.2.145:9000/todelete/partitions/random2"
target="_blank">192.168.2.145:9000/todelete/partitions/random2</a>")
val dsRdd = r1.map(line => <b><some transformation></b>).map(tokens
=> <b><some more></b>)
val devRDD = r2.map(line => <b><some
transformation></b>).map(tokens => <b><some more></b>)
// finally join and materialize
dsRdd.join(devRDD, dummy).count
Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve.
UNDERSTANDING THE JOB RUN
WHY A THREE STAGE PROCESS?
After the job is run, this is the result that Spark UI delivered.
There are 3 stages. Stage 1 and stage 2 produce shuffle data which is consumed as a whole in stage 3.
DAG FOR STAGE 0
It starts with reading the “random1” file, and then applies the two map functions on that RDD.
The same will be done with file “random2” in stage 1, which will have the same DAG.
DAG FOR STAGE 2
Only one block corresponding to the “join” method call on dsRdd. So this corresponds to the dsRdd.join(devRDD, dummy).
Here we observe that there are shuffle boundaries involved in the job when ideally there’s no reason for them to be.
And hence a look at CoGrouped RDD is given to see what is causing the shuffle boundary between the join and the previous stages.
ANALYSING COGROUPED RDD
If the RDD’s being joined do not have exact same partitioner as the one for this RDD, then they are marked as ShuffleDependency (in else block).
Clearly the dsRdd and devRdd went through this path — and both were marked as separate stages coming into “join”. Hence we get three stages.
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
override def getDependencies: Seq[Dependency[_]] = {
rdds.map { rdd: RDD[_] =>
if (rdd.partitioner == Some(part)) {
logDebug("Adding one-to-one dependency with " + rdd)
new OneToOneDependency(rdd)
} else {
logDebug("Adding shuffle dependency with " + rdd)
new ShuffleDependency[K, Any, CoGroupCombiner](
rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
}
}
}
WRAP TWO RDDs WITHOUT REPARTITIONING
Without repartitioning the two RDD’s can be wrapped. Delegate everything to the underlying RDD’s by plugging in the dummy partitioner. This will make CoGroupedRdd to report that there are no stage boundaries and the DAG Scheduler will schedule everything locally on each worker.
Here’s the (very small) code for WrapRDD:
1
2
3
4
5
6
7
8
9
10
11
12
class WrapRDD[T: ClassTag](rdd: RDD[T], part: Partitioner)
extends RDD[T](rdd.sparkContext, rdd.dependencies) {
@DeveloperApi
override def compute(split: Partition,
context: TaskContext): Iterator[T] = rdd.compute(split, context)
override protected def getPartitions: Array[Partition] =
rdd.partitions
// ********* main thing/hack ******* ///
override val partitioner = Some(part)
}
Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve.
THE RESULT
ONE STAGE JOB RUN
After the changes are made, it can be seen that there is only one stage. The whole data is read (51 X 2 = 102 MB) and
nothing is written or read from shuffle.
EXECUTION TIME IMPROVED BY 25%, FROM 4 SECOND TO 3 SECONDS
The two RDD’s that were computed in separate stages (0 and 1) earlier are now part of the same stage because they were wrapped in WrapRDD.
Also, the difference in the execution time: In the first case it was 4 seconds and now it is 3 seconds. A good 25% improvement. This is because the process doesn’t have to go to the disk for writing the shuffle and then reading it back immediately. Instead it works on intermediate data right away.
IN SUMMARY …
Good workaround was found to substantiate a process with a large amount of data that needs to be
analyzed and segregated
This in turn improves the quality of work and delivery time while using Spark RDD’s
Execution time improved by 25%, from 4 second to 3 seconds
EXPERIENCE THE POWER OF
APACHE SPARK WITH IMAGINEA
Imaginea is among the top contributor to Spark code
Building products on Spark since 2014
Opensource contributors to Apache Hadoop and Zeppelin
To find out more, visit http://www.imaginea.com/apache-spark
ABOUT THE AUTHOR
SACHIN TYAGIHead – Data Engineering, Imaginea
Sachin heads the Data Engineering & Analytics practice at Imaginea. With over 10 years of IT experience, he brings in both Data Science & Data Engineering expertise to solve complex problems in Big Data & Machine Learning. At Imaginea, Sachin has been pivotal in implementing Apache Spark solutions to several FAST 500 companies in the areas such as Predictive Recommendation, Anomaly Detection & Contextual Search.
Disclaimer
This document may contain forward-looking statements concerning products and strategies. These statements are based on management's current expectations and actual results may differ materially from those projected, as a result of certain risks, uncertainties and assumptions, including but not limited to: the growth of the markets addressed by our products and our customers' products, the demand for and market acceptance of our products; our ability to successfully compete in the markets in which we do business; our ability to successfully address the cost structure of our offerings; the ability to develop and implement new technologies and to obtain protection for the related intellectual property; and our ability to realize financial and strategic benefits of past and future transactions. These forward-looking statements are made only as of the date indicated, and the company disclaims any obligation to update or revise the information contained in any forward-looking statements, whether as a result of new information, future events or otherwise.
All Trademarks and other registered marks belong to their respective owners.
Copyright © 2012-2015, Imaginea Technologies, Inc. and/or its affiliates. All rights reserved.
Credits
Images under Creative Commons Zero license.
Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve.
top related