hadoop world 2010: productionizing hadoop: lessons learned

22
Productionizing Hadoop : Lessons Learned Eric Sammer - Solution Architect - Cloudera email: [email protected] twitter: @esammer, @cloudera

Upload: cloudera-inc

Post on 10-May-2015

1.817 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

Productionizing Hadoop : Lessons Learned

Eric Sammer - Solution Architect - Clouderaemail: [email protected]: @esammer, @cloudera

Page 2: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

2Copyright 2010 Cloudera Inc. All rights reserved

Starting Out

(You)

http://www.iccs.inf.ed.ac.uk/~miles/code.html

“Let’s build a Hadoop cluster!”

Page 3: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

3Copyright 2010 Cloudera Inc. All rights reserved

Starting Out

(You)

http://www.iccs.inf.ed.ac.uk/~miles/code.html

Page 4: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

4Copyright 2010 Cloudera Inc. All rights reserved

Where you want to be

(You)

Yahoo! Hadoop Cluster (2007)

Page 5: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

5Copyright 2010 Cloudera Inc. All Rights Reserved.

What is Hadoop?

• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)

• Core Hadoop has two main components• Hadoop Distributed File System (HDFS): self-healing high-bandwidth

clustered storage• MapReduce: fault-tolerant distributed processing

• Key value• Flexible -> store data without a schema and add it later as needed• Affordable -> cost / TB at a fraction of traditional options• Broadly adopted -> a large and active ecosystem• Proven at scale -> dozens of petabyte + implementations in production today

Page 6: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

Cloudera’s Distribution for Hadoop, Version 3

• Open source – 100% Apache licensed• Simplified – Component versions & dependencies managed for you• Integrated – All components & functions interoperate through standard API’s• Reliable – Patched with fixes from future releases to improve stability• Supported – Employs project founders and committers for >70% of components

Copyright 2010 Cloudera Inc. All Rights Reserved. 6

Hue Hue SDK

OozieOozie

HBaseFlume, Sqoop

Zookeeper

Hive

Pig/Hive

The Industry’s Leading Hadoop Distribution

Page 7: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

7Copyright 2010 Cloudera Inc. All rights reserved

Overview

• Proper planning• Data Ingestion• ETL and Data Processing Infrastructure• Authentication, Authorization, and Sharing• Monitoring

Page 8: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

8Copyright 2010 Cloudera Inc. All rights reserved

The production data platform

• Data storage• ETL / data processing / analysis infrastructure• Data ingestion infrastructure• Integration with tools• Data security and access control• Health and performance monitoring

Page 9: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

9Copyright 2010 Cloudera Inc. All rights reserved

Proper planning

• Know your use cases!• Log transformation, aggregation• Text mining, IR• Analytics• Machine learning

• Critical to proper configuration• Hadoop• Network• OS

• Resource utilization, deep job insight will tell you more

Page 10: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

10Copyright 2010 Cloudera Inc. All rights reserved

HDFS Concerns

• Name node availability• HA is tricky• Consider where Hadoop lives in the system• Manual recovery can be simple, fast, effective

• Backup Strategy• Name node metadata – hourly, ~2 day retention• User data

• Log shipping style strategies• DistCp• “Fan out” to multiple clusters on ingestion

Page 11: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

11Copyright 2010 Cloudera Inc. All rights reserved

Data Ingestion

• Many data sources• Streaming data sources (log files, mostly)• RDBMS• EDW• Files (usually exports from 3rd party)

• Common place we see DIY• You probably shouldn’t• Sqoop, Flume, Oozie (but I’m biased)

• No matter what - fault tolerant, performant, monitored

Page 12: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

12Copyright 2010 Cloudera Inc. All rights reserved

ETL and Data Processing

• Non-interactive jobs• Establish a common directory structure for processes• Need tools to handle complex chains of jobs• Workflow tools support

• Job dependencies, error handling• Tracking• Invocation based on time or events

• Most common mistake: depending on jobs always completing successfully or within a window of time.• Monitor for SLA rather than pray• Defensive coding practices apply just as they do everywhere else!

Page 13: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

13Copyright 2010 Cloudera Inc. All rights reserved

Metadata Management

• Tool independent metadata about…• Data sets we know about and their location (on HDFS)• Schemata• Authorization (currently HDFS permissions only)• Partitioning• Format and compression• Guarantees (consistency, timeliness, permits duplicates)

• Currently still DIY in many ways, tool-dependent• Most people rely on prayer and hard coding• (H)OWL is interesting

Page 14: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

14Copyright 2010 Cloudera Inc. All rights reserved

Authentication and authorization

• Authentication• Don’t talk to strangers• Should integrate with existing IT infrastructure• Yahoo! security (Kerberos) patches now part of CDH3b3

• Authorization• Not everyone can access everything

• Ex. Production data sets are read-only to quants / analysts. Analysts have home or group directories for derived data sets.

• Mostly enforced via HDFS permissions; directory structure and organization is critical

• Not as fine grained as column level access in EDW, RDBMS

• HUE as a gateway to the cluster

Page 15: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

15Copyright 2010 Cloudera Inc. All rights reserved

Resource Sharing

• Prefer one large cluster to many small clusters (unless maybe you’re Facebook)

• “Stop hogging the cluster!”• Cluster resources• Disk space (HDFS size quotas)• Number of files (HDFS file count quotas)• Simultaneous jobs• Tasks – guaranteed capacity, full utilization, SLA enforcement

• Monitor and track resource utilization across all groups

Page 16: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

16Copyright 2010 Cloudera Inc. All rights reserved

Monitoring

• Critical for keeping things running• Cluster health• Duh.• Traditional monitoring tools: Nagios, Hyperic, Zenoss• Host checks, service checks• When to alert? It’s tricky.

• Cluster performance• Overall utilization in aggregate• 30,000ft view of utilization and performance; macro level

Page 17: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

17Copyright 2010 Cloudera Inc. All rights reserved

Monitoring

• Hadoop aware cluster monitoring• Traditional tools don’t cut it; Hadoop monitoring is inherently

Hadoop specific• Analogous to RDBMS monitoring tools

• Job level “monitoring”• More like analysis• “What resources does this job use?”• “How does this run compare to last run?”• “How can I make this run faster, more resource efficient?”• Two views we care about

• Job perspective• Resource perspective (task slots, scheduler pool)

Page 18: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

18Copyright 2010 Cloudera Inc. All rights reserved

Wrapping it up

• Hadoop proper is awesome, but is only part of the picture

• Much of Professional Services time is filling in the blanks• There’s still a way to go• Metadata management• Operational tools and support• Improvements to Hadoop core to improve stability, security,

manageability

• Adoption and feedback drive progress• CDH provides the infrastructure for a complete system

Page 19: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

19Copyright 2010 Cloudera Inc. All Rights Reserved.

Cloudera Makes Hadoop Safe For the Enterprise

Software Services Training

Page 20: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

20Copyright 2010 Cloudera Inc. All Rights Reserved.

• Increases reliability and consistency of the Hadoop platform• Improves Hadoop’s conformance to important IT policies and procedures• Lowers the cost of management and administration

Cloudera EnterpriseEnterprise Support and Management Tools

Page 21: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

21Copyright 2010 Cloudera Inc. All rights reserved

References / Resources

• Cloudera documentation - http://docs.cloudera.com• Cloudera Groups – http://groups.cloudera.org• Cloudera JIRA – http://issues.cloudera.org• Hadoop the Definitive Guide

[email protected]• irc.freenode.net #cloudera, #hadoop• @esammer

Page 22: Hadoop World 2010: Productionizing Hadoop: Lessons Learned

22Copyright 2010 Cloudera Inc. All rights reserved