troubleshooting hadoop: distributed debugging

24
1 © Cloudera, Inc. All rights reserved. Troubleshooting Hadoop: Distributed Debugging Dustin Cote | Customer Operations Engineer

Upload: great-wide-open

Post on 19-Jan-2017

227 views

Category:

Technology


8 download

TRANSCRIPT

Page 1: Troubleshooting Hadoop: Distributed Debugging

1© Cloudera, Inc. All rights reserved.

Troubleshooting Hadoop: Distributed DebuggingDustin Cote | Customer Operations Engineer

Page 2: Troubleshooting Hadoop: Distributed Debugging

2© Cloudera, Inc. All rights reserved.

Roadmap

•The Hadoop Ecosystem

•What is Hadoop?

•What are some clear challenge areas?

•Debugging tools

•How do built-in Linux tools help?

•Where do we look for typical problems?

•Custom tooling to facilitate problem solving

•Deep dive example

•Application with intermittent failure

•Some data is bigger than others

Page 3: Troubleshooting Hadoop: Distributed Debugging

3© Cloudera, Inc. All rights reserved.

The Hadoop Ecosystem

Page 4: Troubleshooting Hadoop: Distributed Debugging

4© Cloudera, Inc. All rights reserved.

What is Hadoop?

•Top level Apache project for storing and processing large data sets•Originally an implementation of Google’s Mapreduce and Google File System papers•Since evolved to be the general platform for working with petabyte scale datasets• Specifically relevant for this presentation•Mostly implemented in Java•Users generally expand to 20+ other components that work with Hadoop•Master-Slave architecture•Commonly used “in the cloud”

Page 5: Troubleshooting Hadoop: Distributed Debugging

5© Cloudera, Inc. All rights reserved. 5© Cloudera, Inc. All rights reserved.

Page 6: Troubleshooting Hadoop: Distributed Debugging

6© Cloudera, Inc. All rights reserved. 6© Cloudera, Inc. All rights reserved.

Page 7: Troubleshooting Hadoop: Distributed Debugging

7© Cloudera, Inc. All rights reserved.

Challenge Areas

• Infrastructure•Network sensitivity•Disk contention

• JVM Scaling•Garbage collection•Memory sizing

•Configuration management•Host inconsistencies•Platform config inconsistencies•Version tracking

Page 8: Troubleshooting Hadoop: Distributed Debugging

8© Cloudera, Inc. All rights reserved.

Debugging tools

Page 9: Troubleshooting Hadoop: Distributed Debugging

9© Cloudera, Inc. All rights reserved.

Linux-based utilities

•Hadoop runs on Linux

• Leverage existing skillsets

• Log parsing

•grep, sed, awk, perl, etc.

•Network health

• ifconfig, telnet, traceroute, tcpdump, etc.

•Process health

• top, ps, etc.

•System health

•dmesg, messages, etc.

Page 10: Troubleshooting Hadoop: Distributed Debugging

10© Cloudera, Inc. All rights reserved.

Extending Linux-based utilities

•My application logs are 80GB!

• split, filter, slice, but how?

•ERROR is a good place to start

• zgrep when you have time

•Keywords for YARN applications

•ApplicationMaster, MRAppMAster

•FAIL, KILL, timed out

•Map those container IDs (container_XXXXX_XX)

Page 11: Troubleshooting Hadoop: Distributed Debugging

11© Cloudera, Inc. All rights reserved.

JVM tools

•Mostly all Java means a mostly familiar toolkit

• jstack, jmap, jconsole, jps

•Careful with heap dumps, data processing JVMs can have 10+ GB heap sizes

•Garbage collection logging (-XX:+PrintGCDetails)

• Lots of different users, make sure you are running as the right user collecting JVM metrics

•Do not just run as root everywhere

•Sudo to the JVM owner when collecting jstacks and jmaps

Page 12: Troubleshooting Hadoop: Distributed Debugging

12© Cloudera, Inc. All rights reserved.

Source code!

•Most of the code base is open source!

•Found a NullPointerException? Hop on github and find the line.

•https://github.com/apache/hadoop

•Even better, JIRA is available to see known issues

•Hadoop Common

•HDFS

•Mapreduce

•YARN

Page 13: Troubleshooting Hadoop: Distributed Debugging

13© Cloudera, Inc. All rights reserved.

Log analysis helps identify anomalies

•Word counts are simple but powerful

•Tracking service logging overtime shows patterns

•Master tracking helps drill into which slaves may be unhealthy

Custom tooling

Page 14: Troubleshooting Hadoop: Distributed Debugging

14© Cloudera, Inc. All rights reserved.

Configuration management is hard

•Validating configuration is in lock-step across all instances is ideal

•Keep configuration simple and logical

•At Cloudera, we pull whole cluster configurations for validation

Custom tooling

Page 15: Troubleshooting Hadoop: Distributed Debugging

15© Cloudera, Inc. All rights reserved.

Deep dive examples

Page 16: Troubleshooting Hadoop: Distributed Debugging

16© Cloudera, Inc. All rights reserved.

Example

• Initial complaint

•Mapreduce job shows “SUCCESSFUL” but does not generate an output

• Job was known to produce output on smaller datasets

•User environment

•~100 node cluster

•Running YARN with Mapreduce v2

• Job uses Kite SDK and Apache Crunch API (also open source)

• Job runs for several hours, reproducing is painful

Page 17: Troubleshooting Hadoop: Distributed Debugging

17© Cloudera, Inc. All rights reserved.

Example

•Debugging the environment

•Searching on errors, first this was found

• 2015-04-20 15:40:04,938 WARN [Readahead Thread #1] org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile EINVAL: Invalid argument

• Bad disk? Probably not -- this job runs if the data is batch smaller

• User mailing lists confirm this is a false positive!

• File a JIRA and move on

• Other node problems? Probably not -- no indication of other jobs failing

Page 18: Troubleshooting Hadoop: Distributed Debugging

18© Cloudera, Inc. All rights reserved.

Example

•Debugging the application

• Logging obtained through hadoop commands

• yarn logs -applicationId APP_ID > out.file

• logs are huge, need a strategy

• first check if a write-out failure is ignored -- was not

• check if any output data is created at all -- yes!

•output data is then destroyed when moving to final location -- bad, but why?

Page 19: Troubleshooting Hadoop: Distributed Debugging

19© Cloudera, Inc. All rights reserved.

Example

•Debugging the application

•Need more information, let’s get DEBUG level logging

• Logs are already 80GB

• now we have even more data to sift through, let’s try to focus on the final move stage

• org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$MergeOutputCommitter makes that happen, let’s just raise that class to DEBUG

• success! we see this class toss aside the dataset, but why?

• code shows an int is being used to count records to output :(

Page 20: Troubleshooting Hadoop: Distributed Debugging

20© Cloudera, Inc. All rights reserved.

Example 2

• Initial complaint

•Hive query that used to run in 10 minutes, now is not complete after 10 hours

•nothing has changed (!)

•User environment

•~50 node cluster

•Hive tables on scale of several hundred GB

•Query with JOIN operations

Page 21: Troubleshooting Hadoop: Distributed Debugging

21© Cloudera, Inc. All rights reserved.

Example 2

•User may not be aware of changes, but what do the logs say?

•Hive generates mapreduce jobs deterministically based on:

•Table structure

•optimization flags

•HQL (SQL-like) query structure

•User shows it’s the same query and no properties have changed

•Back to those challenge areas (Infrastructure, JVM, Config management)

Page 22: Troubleshooting Hadoop: Distributed Debugging

22© Cloudera, Inc. All rights reserved.

Example 2

•Config management is easiest

•Running from another client machine?

• cluster side default changes? (upgrades, patches, etc.)

• JVM is next easiest

• let’s pull in the Mapreduce logs again

•yarn logs -applicationId APP_ID > out.log

• 2015-11-24 17:56:46,324 INFO [main] org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 3628000 rows for join key [00011]

Page 23: Troubleshooting Hadoop: Distributed Debugging

23© Cloudera, Inc. All rights reserved.

Example 2

•Why so many rows for one join key?

•How many rows overall? ~1.8 billion!

• Just disk write throughput will take several hours (considering size)

•So, what changed?

•Configurations would not create more rows in the output

• JVM settings and memory management doesn’t seem likely

• Infrastructure was never going to be fast enough to do this in 10 minutes

•UAT testing was being used for performance baseline!

•Hadoop scales linearly only if you scale your data linearly :)

Page 24: Troubleshooting Hadoop: Distributed Debugging

24© Cloudera, Inc. All rights reserved.

Thank you