the spark debugger

12
Arthur Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY The Spark Debugger

Upload: jack

Post on 22-Feb-2016

63 views

Category:

Documents


0 download

DESCRIPTION

Arthur. Ankur Dave , Matei Zaharia , Murphy McCauley, Scott Shenker , Ion Stoica. The Spark Debugger. UC BERKELEY. Motivation. Debugging large parallel jobs is hard Current approaches to debugging: Repeatedly modify and rerun the program Run isolated code in Spark shell. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Spark Debugger

ArthurAnkur Dave, Matei Zaharia, Murphy McCauley,Scott Shenker, Ion Stoica

UC BERKELEY

The Spark Debugger

Page 2: The Spark Debugger

MotivationDebugging large parallel jobs is hard

Current approaches to debugging:• Repeatedly modify and rerun the

program• Run isolated code in Spark shell

Page 3: The Spark Debugger

Introducing ArthurInteractive replay debugger for Sparkprograms• Reconstruct and query intermediate

datasets• Visualize the program’s data flow• Rerun any task in a single-process

debugger• Trace records across transformations• Aggregate exceptions at the master

Page 4: The Spark Debugger

Spark Programming Model

map(_.split(‘\t’)(3))

articlesResilient Distributed

Datasets (RDDs) filter(_.contains( “Berkeley”))

matchescount()

10,000

HDFS file

Deterministic transformations

Example: Find how many Wikipedia articles match a

search term

Page 5: The Spark Debugger

Approach

Log

Master

tasksresults,

Workers

lineage,checksums,events

checksums,events

Page 6: The Spark Debugger

Approach

Log

Master

tasks results,checksums

Workers

lineage

user input

Page 7: The Spark Debugger

Detecting Nondeterministic Transformations

Re-running a nondeterministic transformation may yield different resultsArthur checksums RDD contents and alerts the user if necessary

Page 8: The Spark Debugger

Demo

Example dataset: 1 GB partial Wikipedia dump• Reconstruct and query intermediate

datasets• Visualize the program’s data flow• Rerun any task in a single-process

debugger

Page 9: The Spark Debugger

Record Tracing

map(_.split(‘\t’))

users

groupCounts

HDFS file A

Example: query a databaseof users and groups

HDFS file B

map(_.split(‘\t’))

groupsjoin()

Page 10: The Spark Debugger

PerformanceEvent logging introduces minimal overhead

PageRank Logistic regression

k-means0.000.200.400.600.801.001.201.40

1.04 1.02 1.02

No debugging Debugging

Nor

mal

ized

run

time

Page 11: The Spark Debugger

Future Plans• More analyses like backward tracing

and culprit detection• Profiling tools for GC and memory• Real bugs

Page 12: The Spark Debugger

Ankur [email protected]

http://ankurdave.com

Arthur is in development athttps://github.com/mesos/spark, branch

arthur

Documentation:https://github.com/mesos/spark/wiki/Spark-

Debugger