why your apache spark job is failing

1© Cloudera, Inc. All rights reserved.

Why your Spark Job is FailingKostas Sakellis


Me

• Software Engineering at Cloudera•Contributor to Apache Spark•Before that, worked on Cloudera Manager


com.esotericsoftware.kryo.KryoException: Unable to find class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$4$$anonfun$apply$3


We go about our day ignoring manholes until…

Courtesy of: http://www.independent.co.uk/incoming/article9127706.ece/binary/original/maholev23.jpg


… something goes wrong.

Courtesy of: http://greenpointers.com/wp-content/uploads/2015/03/Manhole-Explosion1.jpg


org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, kostas-4.vpc.cloudera.com): java.lang.NumberFormatException: For input string: "3.9166,10.2491,-4.0926,-4.4659,0"

at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)

at java.lang.Double.parseDouble(Double.java:540)at scala.collection.immutable.StringLike[...]

Driver stacktrace:at

org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

[...]








[...]


Job? What now?

Courtesy of:http://calvert.lib.md.us/jobs_pic.jpg


Examplesc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()


Then what the heck is a stage?

Courtesy of: https://writinginadeadworld.files.wordpress.com/2014/03/rock1.jpeg


Partitionssc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

HDFS

Partition 1

Partition 2

Partition 3

Partition 4


RDDssc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

…RDD1

HDFS

Partition 1

Partition 2

Partition 3

Partition 4



…RDD1 …RDD2

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4



…RDD1 …RDD2

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD3

Partition 1

Partition 2

Partition 3

Partition 4


…RDD1 …RDD2

RDDs

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD3

Partition 1

Partition 2

Partition 3

Partition 4

Sum


…RDD1 …RDD2

RDD Lineage

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD3

Partition 1

Partition 2

Partition 3

Partition 4

Sum

Lineage


RDD Dependencies

…RDD1 …RDD2

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD3

Partition 1

Partition 2

Partition 3

Partition 4

Sum

Narrow Dependencies

•Narrow and Wide Dependencies


Wide Dependencies

• Sometimes records need to be grouped together• Examples• join•groupByKey

• Stages created at wide dependency boundaries


A more Interesting Spark Job

val rdd1 = sc.textFile(“hdfs://...”) .map(someFunc) .filter(filterFunc)

val rdd2 = sc.hadoopFile(“hdfs://...”) .groupByKey() .map(someOtherFunc)

val rdd3 = rdd1.join(rdd2) .map(someFunc)

rdd3.collect()



val rdd1 = sc.textFile(“hdfs://...”) .map(someFunc) .filter(filterFunc)

maptextFile filter



val rdd2 = sc.hadoopFile(“hdfs://...”) .groupByKey() .map(someOtherFunc)

groupByKeyhadoopFile map



val rdd3 = rdd1.join(rdd2) .map(someFunc)

join map



rdd3.collect()

maptextFile filter

groupByKey

hadoopFile map

join map

1

Wide Dependencies

1

2 3

4


Get to the point before I stop caring!








[...]


What was the failure?

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, kostas-4.vpc.cloudera.com): java.lang.NumberFormatException: For input string: "3.9166,10.2491,-4.0926,-4.4659,0” [...]



StageTask Task

Task Task



StageTask Task

Task Task

spark.task.maxFailures=4








[...]


ERROR executor.Executor: Exception in task ID 2866 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:565) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:648)

at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:706) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:206) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:45) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164) [...]


Spark Architecture


YARN Architecture

Resource Manager

Node Manager

Container Container

Node Manager

Container Container

Application Master

Client

Process Process


Spark on YARN Architecture

Resource Manager

Node Manager

Container Container

Node Manager

Container ContainerClient

Process Process


Spark on YARN Architecture

Resource Manager

Node Manager

Container Container

Node Manager

Container Container

Application Master

Client

Process Process


spark-submit --executor-memory 2g

--master yarn-client

--num-executors 2

--num-cores 2


Container [pid=63375,containerID=container_1388158490598_0001_01_000003] is running beyond physical memory limits. Current usage: 2.2 GB of 2.1 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used. Killing container. [...]


spark-submit --executor-memory 2g

--master yarn-client

--num-executors 2

--num-cores 2


yarn.nodemanager.resource.memory-mb

Executor Container

spark.yarn.executor.memoryOverhead (7%) (10% in 1.4)

spark.executor.memory

spark.shuffle.memoryFraction (0.4) spark.storage.memoryFraction (0.6)

Memory allocation


Sometimes jobs run slow or even…

Courtesy of: http://blog.sdrock.com/pastors/files/2013/06/time-clock.jpg


java.lang.OutOfMemoryError: GC overhead limit exceeded at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1986) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) [...]


GC Stalls


Too much spilling!

Courtesy of: http://tgnp.me/wp-content/uploads/2014/05/spilled-starbucks.jpg


Shuffle Boundaries

maptextFile filter

groupByKey

hadoopFile map

join map

Shuffle


Most performance issues are in shuffles!


Inside a Task: Fetch & Aggregate

ExternalAppendOnlyMapBlock

Block

deserialize

deserialize

key1 -> valueskey2 -> valueskey3 -> valueskey4 -> values

Sort & Spill

key1 -> valueskey2 -> valueskey3 -> values


rdd.reduceByKey(reduceFunc, numPartitions=1000)

Inside a Task: Specify partitions


Why not set partitions to ∞ ?


Excessive parallelism

•Overwhelming scheduler overhead•More fetches -> more disk seeks•Driver needs to track state per-task


So how to choose?

• Easy answer:•Keep multiplying by 1.5 and see what works


Is Spark bad?

Courtesy of: https://theferkel.files.wordpress.com/2015/04/250474-breaking-bad-quotes.jpg


Thank you