the spark debugger
DESCRIPTION
Arthur. Ankur Dave , Matei Zaharia , Murphy McCauley, Scott Shenker , Ion Stoica. The Spark Debugger. UC BERKELEY. Motivation. Debugging large parallel jobs is hard Current approaches to debugging: Repeatedly modify and rerun the program Run isolated code in Spark shell. - PowerPoint PPT PresentationTRANSCRIPT
ArthurAnkur Dave, Matei Zaharia, Murphy McCauley,Scott Shenker, Ion Stoica
UC BERKELEY
The Spark Debugger
MotivationDebugging large parallel jobs is hard
Current approaches to debugging:• Repeatedly modify and rerun the
program• Run isolated code in Spark shell
Introducing ArthurInteractive replay debugger for Sparkprograms• Reconstruct and query intermediate
datasets• Visualize the program’s data flow• Rerun any task in a single-process
debugger• Trace records across transformations• Aggregate exceptions at the master
Spark Programming Model
map(_.split(‘\t’)(3))
articlesResilient Distributed
Datasets (RDDs) filter(_.contains( “Berkeley”))
matchescount()
10,000
HDFS file
Deterministic transformations
Example: Find how many Wikipedia articles match a
search term
Approach
Log
Master
tasksresults,
Workers
lineage,checksums,events
checksums,events
Approach
Log
Master
tasks results,checksums
Workers
lineage
user input
Detecting Nondeterministic Transformations
Re-running a nondeterministic transformation may yield different resultsArthur checksums RDD contents and alerts the user if necessary
Demo
Example dataset: 1 GB partial Wikipedia dump• Reconstruct and query intermediate
datasets• Visualize the program’s data flow• Rerun any task in a single-process
debugger
Record Tracing
map(_.split(‘\t’))
users
groupCounts
HDFS file A
Example: query a databaseof users and groups
HDFS file B
map(_.split(‘\t’))
groupsjoin()
PerformanceEvent logging introduces minimal overhead
PageRank Logistic regression
k-means0.000.200.400.600.801.001.201.40
1.04 1.02 1.02
No debugging Debugging
Nor
mal
ized
run
time
Future Plans• More analyses like backward tracing
and culprit detection• Profiling tools for GC and memory• Real bugs
Ankur [email protected]
http://ankurdave.com
Arthur is in development athttps://github.com/mesos/spark, branch
arthur
Documentation:https://github.com/mesos/spark/wiki/Spark-
Debugger