the spark debugger

Post on 22-Feb-2016

65 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Arthur. Ankur Dave , Matei Zaharia , Murphy McCauley, Scott Shenker , Ion Stoica. The Spark Debugger. UC BERKELEY. Motivation. Debugging large parallel jobs is hard Current approaches to debugging: Repeatedly modify and rerun the program Run isolated code in Spark shell. - PowerPoint PPT Presentation

TRANSCRIPT

ArthurAnkur Dave, Matei Zaharia, Murphy McCauley,Scott Shenker, Ion Stoica

UC BERKELEY

The Spark Debugger

MotivationDebugging large parallel jobs is hard

Current approaches to debugging:• Repeatedly modify and rerun the

program• Run isolated code in Spark shell

Introducing ArthurInteractive replay debugger for Sparkprograms• Reconstruct and query intermediate

datasets• Visualize the program’s data flow• Rerun any task in a single-process

debugger• Trace records across transformations• Aggregate exceptions at the master

Spark Programming Model

map(_.split(‘\t’)(3))

articlesResilient Distributed

Datasets (RDDs) filter(_.contains( “Berkeley”))

matchescount()

10,000

HDFS file

Deterministic transformations

Example: Find how many Wikipedia articles match a

search term

Approach

Log

Master

tasksresults,

Workers

lineage,checksums,events

checksums,events

Approach

Log

Master

tasks results,checksums

Workers

lineage

user input

Detecting Nondeterministic Transformations

Re-running a nondeterministic transformation may yield different resultsArthur checksums RDD contents and alerts the user if necessary

Demo

Example dataset: 1 GB partial Wikipedia dump• Reconstruct and query intermediate

datasets• Visualize the program’s data flow• Rerun any task in a single-process

debugger

Record Tracing

map(_.split(‘\t’))

users

groupCounts

HDFS file A

Example: query a databaseof users and groups

HDFS file B

map(_.split(‘\t’))

groupsjoin()

PerformanceEvent logging introduces minimal overhead

PageRank Logistic regression

k-means0.000.200.400.600.801.001.201.40

1.04 1.02 1.02

No debugging Debugging

Nor

mal

ized

run

time

Future Plans• More analyses like backward tracing

and culprit detection• Profiling tools for GC and memory• Real bugs

Ankur Daveankurd@eecs.berkeley.edu

http://ankurdave.com

Arthur is in development athttps://github.com/mesos/spark, branch

arthur

Documentation:https://github.com/mesos/spark/wiki/Spark-

Debugger

top related