unraveling hadoop meltdown mysteries

Meltdown MysteriesSean Suchter

Disks are thrashing!

Solution

• Make job author aware of surprising behavior.

• Modify job code & settings to be nicer to disks.

Nodes are dying!

Initial diagnosis…• Nodes abruptly started swapping and

becoming non-responsive. (Required physical power cycling)

• Job submitters report “I didn’t change anything”

• Question: What’s doing this to the cluster?

Cause & solution• While the job didn’t change, its input data did.

• Stop that user’s jobs immediately.

• Better use of capacity scheduler virtual memory controls.

• Use Pepperdata protection to limit physical memory as well.

Take-away

• You see problems at the node level.

• You see the root causes at the task level.

Pepperdata meetup tomorrow!

• War Stories from the Hadoop Trenches

• Allen Wittenauer (Apache Hadoop committer and former LinkedIn)

• Eric Baldeschwieler (former Hortonworks CEO / CTO)

• Todd Nemet (Looker; former Altiscale, ClearStory Data, Cloudera)

• 6pm Wed 6/25

• Firehouse Brewery, 111 S Murphy, Sunnyvale

• http://www.meetup.com/pepperdata/

Data & Analytics