the spark debugger

ArthurAnkur Dave, Matei Zaharia, Murphy McCauley,Scott Shenker, Ion Stoica

UC BERKELEY

The Spark Debugger

MotivationDebugging large parallel jobs is hard

Current approaches to debugging:• Repeatedly modify and rerun the

program• Run isolated code in Spark shell

Introducing ArthurInteractive replay debugger for Sparkprograms• Reconstruct and query intermediate

datasets• Visualize the program’s data flow• Rerun any task in a single-process

debugger• Trace records across transformations• Aggregate exceptions at the master

Spark Programming Model

map(_.split(‘\t’)(3))

articlesResilient Distributed

Datasets (RDDs) filter(_.contains( “Berkeley”))

matchescount()

10,000

HDFS file

Deterministic transformations

Example: Find how many Wikipedia articles match a

search term

Approach

Log

Master

tasksresults,

Workers

lineage,checksums,events

checksums,events

Approach

Log

Master

tasks results,checksums

Workers

lineage

user input

Detecting Nondeterministic Transformations

Re-running a nondeterministic transformation may yield different resultsArthur checksums RDD contents and alerts the user if necessary

Demo

Example dataset: 1 GB partial Wikipedia dump• Reconstruct and query intermediate

datasets• Visualize the program’s data flow• Rerun any task in a single-process

debugger

Record Tracing

map(_.split(‘\t’))

users

groupCounts

HDFS file A

Example: query a databaseof users and groups

HDFS file B

map(_.split(‘\t’))

groupsjoin()

PerformanceEvent logging introduces minimal overhead

PageRank Logistic regression

k-means0.000.200.400.600.801.001.201.40

1.04 1.02 1.02

No debugging Debugging

Nor

mal

ized

run

time

Future Plans• More analyses like backward tracing

and culprit detection• Profiling tools for GC and memory• Real bugs

Ankur [email protected]

http://ankurdave.com

Arthur is in development athttps://github.com/mesos/spark, branch

arthur

Documentation:https://github.com/mesos/spark/wiki/Spark-

Debugger

https://github.com/mesos/spark/wiki/Spark-Debugger



the spark debugger

Documents

spark shell

werein spark

mainline spark

spark programming modelsuppose

programs execution hard

programs data flowrerun

input data

complex programs