why your apache spark job is failing

56
1 © Cloudera, Inc. All rights reserved. Why your Spark Job is Failing Kostas Sakellis

Upload: cloudera-inc

Post on 16-Apr-2017

1.841 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Why Your Apache Spark Job is Failing

1© Cloudera, Inc. All rights reserved.

Why your Spark Job is FailingKostas Sakellis

Page 2: Why Your Apache Spark Job is Failing

2© Cloudera, Inc. All rights reserved.

Me

• Software Engineering at Cloudera•Contributor to Apache Spark•Before that, worked on Cloudera Manager

Page 3: Why Your Apache Spark Job is Failing

3© Cloudera, Inc. All rights reserved.

com.esotericsoftware.kryo.KryoException: Unable to find class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$4$$anonfun$apply$3

Page 4: Why Your Apache Spark Job is Failing

4© Cloudera, Inc. All rights reserved.

We go about our day ignoring manholes until…

Courtesy of: http://www.independent.co.uk/incoming/article9127706.ece/binary/original/maholev23.jpg

Page 5: Why Your Apache Spark Job is Failing

5© Cloudera, Inc. All rights reserved.

… something goes wrong.

Courtesy of: http://greenpointers.com/wp-content/uploads/2015/03/Manhole-Explosion1.jpg

Page 6: Why Your Apache Spark Job is Failing

6© Cloudera, Inc. All rights reserved.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, kostas-4.vpc.cloudera.com): java.lang.NumberFormatException: For input string: "3.9166,10.2491,-4.0926,-4.4659,0"

at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)

at java.lang.Double.parseDouble(Double.java:540)at scala.collection.immutable.StringLike[...]

Driver stacktrace:at

org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

[...]

Page 7: Why Your Apache Spark Job is Failing

7© Cloudera, Inc. All rights reserved.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, kostas-4.vpc.cloudera.com): java.lang.NumberFormatException: For input string: "3.9166,10.2491,-4.0926,-4.4659,0"

at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)

at java.lang.Double.parseDouble(Double.java:540)at scala.collection.immutable.StringLike[...]

Driver stacktrace:at

org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

[...]

Page 8: Why Your Apache Spark Job is Failing

8© Cloudera, Inc. All rights reserved.

Job? What now?

Courtesy of:http://calvert.lib.md.us/jobs_pic.jpg

Page 9: Why Your Apache Spark Job is Failing

9© Cloudera, Inc. All rights reserved.

Examplesc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

Page 10: Why Your Apache Spark Job is Failing

10© Cloudera, Inc. All rights reserved.

Examplesc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

Page 11: Why Your Apache Spark Job is Failing

11© Cloudera, Inc. All rights reserved.

Examplesc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

Page 12: Why Your Apache Spark Job is Failing

12© Cloudera, Inc. All rights reserved.

Then what the heck is a stage?

Courtesy of: https://writinginadeadworld.files.wordpress.com/2014/03/rock1.jpeg

Page 13: Why Your Apache Spark Job is Failing

13© Cloudera, Inc. All rights reserved.

Partitionssc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Page 14: Why Your Apache Spark Job is Failing

14© Cloudera, Inc. All rights reserved.

RDDssc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

…RDD1

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Page 15: Why Your Apache Spark Job is Failing

15© Cloudera, Inc. All rights reserved.

RDDssc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

…RDD1 …RDD2

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

Page 16: Why Your Apache Spark Job is Failing

16© Cloudera, Inc. All rights reserved.

RDDssc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

…RDD1 …RDD2

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD3

Partition 1

Partition 2

Partition 3

Partition 4

Page 17: Why Your Apache Spark Job is Failing

17© Cloudera, Inc. All rights reserved.

…RDD1 …RDD2

RDDs

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD3

Partition 1

Partition 2

Partition 3

Partition 4

Sum

Page 18: Why Your Apache Spark Job is Failing

18© Cloudera, Inc. All rights reserved.

…RDD1 …RDD2

RDD Lineage

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD3

Partition 1

Partition 2

Partition 3

Partition 4

Sum

Lineage

Page 19: Why Your Apache Spark Job is Failing

19© Cloudera, Inc. All rights reserved.

RDD Dependencies

…RDD1 …RDD2

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD3

Partition 1

Partition 2

Partition 3

Partition 4

Sum

Narrow Dependencies

•Narrow and Wide Dependencies

Page 20: Why Your Apache Spark Job is Failing

20© Cloudera, Inc. All rights reserved.

Wide Dependencies

• Sometimes records need to be grouped together• Examples• join•groupByKey

• Stages created at wide dependency boundaries

Page 21: Why Your Apache Spark Job is Failing

21© Cloudera, Inc. All rights reserved.

A more Interesting Spark Job

val rdd1 = sc.textFile(“hdfs://...”) .map(someFunc) .filter(filterFunc)

val rdd2 = sc.hadoopFile(“hdfs://...”) .groupByKey() .map(someOtherFunc)

val rdd3 = rdd1.join(rdd2) .map(someFunc)

rdd3.collect()

Page 22: Why Your Apache Spark Job is Failing

22© Cloudera, Inc. All rights reserved.

A more Interesting Spark Job

val rdd1 = sc.textFile(“hdfs://...”) .map(someFunc) .filter(filterFunc)

maptextFile filter

Page 23: Why Your Apache Spark Job is Failing

23© Cloudera, Inc. All rights reserved.

A more Interesting Spark Job

val rdd2 = sc.hadoopFile(“hdfs://...”) .groupByKey() .map(someOtherFunc)

groupByKeyhadoopFile map

Page 24: Why Your Apache Spark Job is Failing

24© Cloudera, Inc. All rights reserved.

A more Interesting Spark Job

val rdd3 = rdd1.join(rdd2) .map(someFunc)

join map

Page 25: Why Your Apache Spark Job is Failing

25© Cloudera, Inc. All rights reserved.

A more Interesting Spark Job

rdd3.collect()

maptextFile filter

groupByKey

hadoopFile map

join map

1

Wide Dependencies

1

2 3

4

Page 26: Why Your Apache Spark Job is Failing

26© Cloudera, Inc. All rights reserved.

Get to the point before I stop caring!

Page 27: Why Your Apache Spark Job is Failing

27© Cloudera, Inc. All rights reserved.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, kostas-4.vpc.cloudera.com): java.lang.NumberFormatException: For input string: "3.9166,10.2491,-4.0926,-4.4659,0"

at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)

at java.lang.Double.parseDouble(Double.java:540)at scala.collection.immutable.StringLike[...]

Driver stacktrace:at

org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

[...]

Page 28: Why Your Apache Spark Job is Failing

28© Cloudera, Inc. All rights reserved.

What was the failure?

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, kostas-4.vpc.cloudera.com): java.lang.NumberFormatException: For input string: "3.9166,10.2491,-4.0926,-4.4659,0” [...]

Page 29: Why Your Apache Spark Job is Failing

29© Cloudera, Inc. All rights reserved.

What was the failure?

StageTask Task

Task Task

Page 30: Why Your Apache Spark Job is Failing

30© Cloudera, Inc. All rights reserved.

What was the failure?

StageTask Task

Task Task

Page 31: Why Your Apache Spark Job is Failing

31© Cloudera, Inc. All rights reserved.

What was the failure?

StageTask Task

Task Task

spark.task.maxFailures=4

Page 32: Why Your Apache Spark Job is Failing

32© Cloudera, Inc. All rights reserved.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, kostas-4.vpc.cloudera.com): java.lang.NumberFormatException: For input string: "3.9166,10.2491,-4.0926,-4.4659,0"

at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)

at java.lang.Double.parseDouble(Double.java:540)at scala.collection.immutable.StringLike[...]

Driver stacktrace:at

org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

[...]

Page 33: Why Your Apache Spark Job is Failing

33© Cloudera, Inc. All rights reserved.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, kostas-4.vpc.cloudera.com): java.lang.NumberFormatException: For input string: "3.9166,10.2491,-4.0926,-4.4659,0"

at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)

at java.lang.Double.parseDouble(Double.java:540)at scala.collection.immutable.StringLike[...]

Driver stacktrace:at

org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

[...]

Page 34: Why Your Apache Spark Job is Failing

34© Cloudera, Inc. All rights reserved.

ERROR executor.Executor: Exception in task ID 2866 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:565) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:648)

at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:706) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:206) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:45) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164) [...]

Page 35: Why Your Apache Spark Job is Failing

35© Cloudera, Inc. All rights reserved.

Spark Architecture

Page 36: Why Your Apache Spark Job is Failing

36© Cloudera, Inc. All rights reserved.

YARN Architecture

Resource Manager

Node Manager

Container Container

Node Manager

Container Container

Application Master

Client

Process Process

Page 37: Why Your Apache Spark Job is Failing

37© Cloudera, Inc. All rights reserved.

Spark on YARN Architecture

Resource Manager

Node Manager

Container Container

Node Manager

Container ContainerClient

Process Process

Page 38: Why Your Apache Spark Job is Failing

38© Cloudera, Inc. All rights reserved.

Spark on YARN Architecture

Resource Manager

Node Manager

Container Container

Node Manager

Container Container

Application Master

Client

Process Process

Page 39: Why Your Apache Spark Job is Failing

39© Cloudera, Inc. All rights reserved.

spark-submit --executor-memory 2g

--master yarn-client

--num-executors 2

--num-cores 2

Page 40: Why Your Apache Spark Job is Failing

40© Cloudera, Inc. All rights reserved.

Container [pid=63375,containerID=container_1388158490598_0001_01_000003] is running beyond physical memory limits. Current usage: 2.2 GB of 2.1 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used. Killing container. [...]

Page 41: Why Your Apache Spark Job is Failing

41© Cloudera, Inc. All rights reserved.

Container [pid=63375,containerID=container_1388158490598_0001_01_000003] is running beyond physical memory limits. Current usage: 2.2 GB of 2.1 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used. Killing container. [...]

Page 42: Why Your Apache Spark Job is Failing

42© Cloudera, Inc. All rights reserved.

spark-submit --executor-memory 2g

--master yarn-client

--num-executors 2

--num-cores 2

Page 43: Why Your Apache Spark Job is Failing

43© Cloudera, Inc. All rights reserved.

yarn.nodemanager.resource.memory-mb

Executor Container

spark.yarn.executor.memoryOverhead (7%) (10% in 1.4)

spark.executor.memory

spark.shuffle.memoryFraction (0.4) spark.storage.memoryFraction (0.6)

Memory allocation

Page 44: Why Your Apache Spark Job is Failing

44© Cloudera, Inc. All rights reserved.

Sometimes jobs run slow or even…

Courtesy of: http://blog.sdrock.com/pastors/files/2013/06/time-clock.jpg

Page 45: Why Your Apache Spark Job is Failing

45© Cloudera, Inc. All rights reserved.

java.lang.OutOfMemoryError: GC overhead limit exceeded at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1986) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) [...]

Page 46: Why Your Apache Spark Job is Failing

46© Cloudera, Inc. All rights reserved.

GC Stalls

Page 47: Why Your Apache Spark Job is Failing

47© Cloudera, Inc. All rights reserved.

Too much spilling!

Courtesy of: http://tgnp.me/wp-content/uploads/2014/05/spilled-starbucks.jpg

Page 48: Why Your Apache Spark Job is Failing

48© Cloudera, Inc. All rights reserved.

Shuffle Boundaries

maptextFile filter

groupByKey

hadoopFile map

join map

Shuffle

Page 49: Why Your Apache Spark Job is Failing

49© Cloudera, Inc. All rights reserved.

Most performance issues are in shuffles!

Page 50: Why Your Apache Spark Job is Failing

50© Cloudera, Inc. All rights reserved.

Inside a Task: Fetch & Aggregate

ExternalAppendOnlyMapBlock

Block

deserialize

deserialize

key1 -> valueskey2 -> valueskey3 -> valueskey4 -> values

Sort & Spill

key1 -> valueskey2 -> valueskey3 -> values

Page 51: Why Your Apache Spark Job is Failing

51© Cloudera, Inc. All rights reserved.

rdd.reduceByKey(reduceFunc, numPartitions=1000)

Inside a Task: Specify partitions

Page 52: Why Your Apache Spark Job is Failing

52© Cloudera, Inc. All rights reserved.

Why not set partitions to ∞ ?

Page 53: Why Your Apache Spark Job is Failing

53© Cloudera, Inc. All rights reserved.

Excessive parallelism

•Overwhelming scheduler overhead•More fetches -> more disk seeks•Driver needs to track state per-task

Page 54: Why Your Apache Spark Job is Failing

54© Cloudera, Inc. All rights reserved.

So how to choose?

• Easy answer:•Keep multiplying by 1.5 and see what works

Page 55: Why Your Apache Spark Job is Failing

55© Cloudera, Inc. All rights reserved.

Is Spark bad?

Courtesy of: https://theferkel.files.wordpress.com/2015/04/250474-breaking-bad-quotes.jpg

Page 56: Why Your Apache Spark Job is Failing

56© Cloudera, Inc. All rights reserved.

Thank you