CLOUDERA: CDH4 MELBOURNE HUG
Presented by Angus Klein, Sr. Director of Support Linden Hillenbrand, Field Support Engineer
ABOUT CLOUDERA 1. STARTED IN 08’ 2. > 250 EMPLOYEES 3. > 1,200 PATCHES IN 2011 4. PARTNER PROGRAM 5. CERTIFIED & COMPATIBLE
3
CLOUDERA
TIMELINE 2008 CLOUDERA FOUNDED BY MIKE OLSON, AMR AWADALLAH & JEFF HAMMERBACHER
2009 HADOOP CREATOR
DOUG CUTTING JOINS CLOUDERA
2009 CDH: FIRST COMMERCIAL APACHE HADOOP DISTRIBUTION
2010 CLOUDERA MANAGER:
FIRST MANAGEMENT APPLICATION FOR
HADOOP
2011 CLOUDERA REACHES 100 PRODUCTION CUSTOMERS
2011 CLOUDERA UNIVERSITY
EXPANDS TO 140 COUNTRIES
2012 CLOUDERA ENTERPRISE 4: THE STANDARD FOR HADOOP IN THE ENTERPRISE
2012 CLOUDERA CONNECT
REACHES 300 PARTNERS
BEYOND… TRANSFORMING
HOW COMPANIES THINK ABOUT
DATA
CDH CLOUDERA MANAGER
CLOUDERA ENTERPRISE
4
CHANGING THE WORLD ONE PETABYTE AT A TIME
San Francisco, CA
©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited. 4
Cloudera’s proactive, production-level support gives you the expertise and responsiveness you need to run Apache Hadoop in Production
Chennai, India
Tokyo Palo Alto, CA RTP, NC
Boston New York
Austin, TX Tucson, AZ
San Francisco, CA
Stuttgart, GE
Support Center
Support Presence
Seattle, WA
Melbourne, Aus.
UK
THE FUTURE: CDH4 What does the enterprise need?
CDH4
THE STANDARD FOR HADOOP IN THE ENTERPRISE
100% OPEN SOURCE HADOOP DISTRIBUTION
CLOUDERA ENTERPRISE 4.0
CLOUDERA MANAGER 4
PRODUCTION SUPPORT
Scalable and extensible
1 2 3
High Availability Integration with the rest of IT
Secure
6
Simplified configuration and deployment
Global Support and services
5 4
Highly Available Namenode Function: A secondary namenode as a hot standby, ready for fail-over Benefit: Eliminates the only remaining single point of failure in HDFS
Heterogeneous Clusters Function: Users can run different nodes on different Hadoop versions Benefit: Lower downtime by gradually staging code changes into select nodes of a cluster
Increased usability for mission-critical use cases and applications.
Store more sensitive data in CDH and get granular access control to facilitate multi-tenancy.
HBase Table & Column Permissions Function: Secure which users and groups have access to HBase tables and columns Benefit: Allows sensitive data to be stored in HBase. Facilitates multi-tenancy
Fair Scheduler ACLs Function: Secure which groups can administer or submit jobs into different Fair Scheduler pools Benefit: Makes it easier to administrate a multi-tenant cluster.
CDH extends its leadership and ability to solve a broader range of problems than other data management systems.
Co-processors Function: Create and run custom programs that operate on data as it changes in real-time Benefit: Developers can create more sophisticated real-time applications on top of HBase
Open Resource Management (a.k.a. MR2) Function: Enable multiple data processing frameworks to run on the same Hadoop cluster Benefit: Save cost by running more applications on the same storage & cluster resources
! Common compression codec (Snappy) ! Common file format (Avro) ! REST over HTTP access to HDFS ! Manage limitless file counts with NameNode federation ! Web shell for Pig, HBase and Flume ! Slot-less resource manager ! Faster, more useable user web access to Hadoop systems ! Support for concurrent queries over Hive ! 100% gain in filesystem I/O performance ! 100% speedup in HBase random reads ! 200% improvement in Flume data ingest rate ! 30% faster MapReduce shuffle
BUT HOW DO I MANAGE THIS? CM4
Cloudera Enterprise is the easiest Hadoop solution to deploy and manage.
3-Step HA Configuration Function: Enable high availability for the NameNode in 3 steps Benefit: Guided set up reduces more than a dozen manual procedures into 3 simple steps
Backward Compatible Function: Support CDH3 and CDH4 clusters with Cloudera Manager 4 Benefit: Flexibility in management
Multi-Cluster Management Function: Manage multiple clusters from a single instance of Cloudera Manager Benefit: Central administration for your entire CDH environment
Cloudera Manager provides rich visualizations and sophisticated automation to reliably manage large-scale clusters.
Heatmaps Function: Visualizes health status and metrics across the cluster Benefit: Quickly identify problem nodes within large clusters and take action
Federated NameNode Management Function: Configure and manage NameNode federation Benefit: Simplifies the process of growing CDH to billions of files across thousands of nodes
Cloudera Enterprise fits seamlessly into existing infrastructures and processes.
Cloudera Manager API Function: Provides application calls for all features in Cloudera Manager Benefit: Easily integrate Cloudera Manager with your existing enterprise-wide management and monitoring tools
Broader Support & Packaging Function: Cloudera Manager packages for Debian and Ubuntu / Support for Oracle 11g and PostgreSQL as backend databases Benefit: Increased flexibility in deployment
LDAP Authentication Function: Authenticate administrator logins against Active Directory Benefit: Single set of login credentials for administrators
! Syncing of client configurations ! More granular host monitoring ! Simpler interface for alert management ! Improved quota management ! Improved logic for auto configuration and validation checks ! New health checks for various services ! Manage and configure multiple processing frameworks ! Better support integration ! Support for webhdfs/httpfs
CLOUDERA USE CASES
17
©2011 Cloudera, Inc. All Rights Reserved. 18
! Customer Risk Analysis ! Surveillance and Fraud Detection ! Central Data Repository ! Personalization and Asset Management ! Market Risk Modeling ! Trade Performance Analytics
! Genomics ! Utilities and Power Grid ! Smart Meters ! Biodiversity Indexing ! Preventing Network Failures ! Seismic Data
! Customer Churn Analysis ! Brand and Sentiment Analysis ! POS Transaction Analysis ! Pricing Models ! Customer Loyalty ! Targeted Offers
! Online Media ! Mobile ! Online Gaming ! Search Quality ! Recommendations ! Influence
Financial Data is messy due to many interacting systems Personal data is obfuscated for security and records get out of sync Trades need to be “sessionized” into accounts and products Discrepancies are difficult to reconcile, need to track corrections
Hadoop is a centralized platform for data collection Single source for data, processing happens on the platform Metadata used to track information lifecycle Workflows run and monitor data transformation pipelines
Data served via APIs or in Batch Single version of the truth, data processed and cleansed centrally Clear audit trail of data dependencies and usage
Copyright 2010 Cloudera Inc. All rights reserved 19
Power grid is aging and maintained incrementally Failures hard to predict and can have cascading effects Looking at vibration of transformers over time to find patterns
Predicting failure of grid equipment Supervised learning to scan time series data for fuzzy patterns Identify likely faulting equipment for targeted replacement
Hadoop based tools to model equipment behavior openPDC project: http://openpdc.codeplex.com Lumberyard - indexing time series data for low latency fuzzy queries
Copyright 2010 Cloudera Inc. All rights reserved 20
OPERATIONAL ROADMAP ARCHITECTING CENTER OF EXCELLENCE
22
! Pace of Business Offers Little Room for Risk
! Disruptive Technologies Require New Skills
! New Processes are Error Prone
23
! Identify Technologies ! Learn New Skills ! Develop and Test Processes to
Lower Risk
A Center of Excellence (COE) is where organizations:
24
Learn Develop
Operate
Deploy
Publish
Research
An integrated, turn-key system for Big Data management
Continually first to market with ground-breaking features
CDH is 100% Apache licensed with no forks or proprietary underpinnings
More enterprises run and more vendors certify with Cloudera than all other solutions combined.
4th generation product with multi-year track record of predictable releases, low-risk upgrades and strong compatibility guarantees
cloudera.com +1 (888) 789-1488
twitter.com/ cloudera
facebook.com/ cloudera
Thank You!
For questions, please contact: Angus Klein: [email protected] Linden Hillenbrand: [email protected]
HUGs All Around
! Apache-compatible licensed compression
! Web shell for Hue
! Flume/HBase integration ! HBase bulk loading
Q3 2011 ! Added HBase for real-
time read/write access ! Added Zookeeper and
Oozie for coordination ! Added Flume and Sqoop
for data integration ! Added Hue for web-
based access
! Integrated authentication throughout the stack
Q2 2011
2009 2010 2011 2012
! Improved stability and performance
! Added packages for CentOS and Ubuntu
! Rackspace and Softlayer support
Q1 2010 ! The first commercial
Apache Hadoop distro ! HDFS and MapReduce
for storage and computation
! Hive and Pig for data processing/analysis
! Packaged for Red Hat Enterprise Linux
! Amazon EC2 support
Q3 2009
3 Years of Steady Innovation.
! Added Mahout for machine learning
! Improved Avro file format support in Sqoop, Flume, MapReduce, Pig, Hive and Hue
Q4 2011 ! NameNode high
availability ! HBase table and column
permissions for increased security
! HBase co-processors for real-time applications
! Support for multiple data processing frameworks (MR1 and MR2)
AVAILABLE NOW ! Performance
improvements to HBase and HDFS
! Faster HBase recoverability from region failure
! Improved HDD failure handling
Q1 2012
Installs the complete Hadoop stack in minutes via a wizard-based interface
Gives you complete, end-to-end visibility and control over your Hadoop cluster from a single interface
Allows you to manage multiple clusters from a single instance of Cloudera Manager
Integrate Cloudera Manager with Active Directory
Establishes the time context globally for almost all views
Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis
Set server roles, configure services and manage security across the cluster
Gracefully start, stop and restart of services as needed
Supports Administrator and Read-Only users
Maintains a complete record of configuration changes with the ability to roll back to previous states
Monitors dozens of service performance metrics and alerts you when you approach critical thresholds
Gather, view and search Hadoop logs collected from across the cluster
Scans Hadoop logs for irregularities and warns you before they impact the cluster
Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching
Generates email alerts when certain events occur
Consolidates all cluster activity into a single, real-time view
View information pertaining to hosts in your cluster including status, resident memory, virtual memory and roles
Visualize health status and metrics across the cluster to quickly identify problem nodes and take action
Visualize current and historical disk usage by user, group and directory Track MapReduce activity on the cluster by job or user
Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution
Easily integrate Cloudera Manager with your existing enterprise-wide management and monitoring tools