savanna: hadoop on openstack
DESCRIPTION
More details about the project and its current state could be found there: http://savanna.readthedocs.orgTRANSCRIPT
Savanna - Hadoop onOpenStack
Mirantis, 2013Ilya EltermanDmitry Mescheryakov
● Savanna Overview● Roadmap● Phase 1 Live Demo● Phase 2 Features and Architecture
Agenda
Goal is to create native OpenStack component to provision and operate Hadoop clusters on top of OpenStack. Key characteristics:
● Open source● Native for OpenStack● Support for different Hadoop distributions● Solves both bare cluster provisioning use case
and "analytics as a service"
Savanna - Elastic Hadoop on OpenStack
● Designed as an OpenStack component● Managed through REST API with UI available as
part of Horizon● Pluggable system of Hadoop installation engines● Integration with Hadoop vendor specific
management tools● Predefined templates of Hadoop configurations
with ability to modify parameters
Savanna Architecture Principles
● Administrators - centralized cluster management and monitoring
● Dev and QA teams - fast clusters provisioning ● Data Scientists/Analysts - API to run the analytic
jobs with infrastructure provisioning happening under the hood
● Making resources dedicated to IaaS cloud available for Hadoop workload
Use Cases
● Central point of control over infrastructure● Enables self-service capabilities, including choice
of Hadoop distribution to be used● Integration with vendor tooling
○ Ambari for Apache/HortonWorks○ Cloudera Management Console
● Utilization of free IaaS capacity for Hadoop tasks
Administrators Use Case
● Fast on-demand provisioning of the environments
● Increase agility and speed of innovation ● Controlled access to data from production
Dev and QA Use Cases
● Simplified tasks execution - complexity of provisioning and managing cluster hidden under the hood○ Access to higher level interfaces (e.g. pig, hive)
● Bursty workload: ad-hoc queries requiring a significant resource only for short time period
● Utilization of free IaaS capacity for Hadoop tasks
Analytics Use Cases
● Savanna Overview● Roadmap● Phase 1 Live Demo● Phase 2 Features and Architecture
Agenda
Roadmap for Hadoop in Cloud
Phase 1 Basic cluster provisioning
Phase 2Cluster operation support and integration with tooling
Phase 3"Analytics as a service": job execution framework, support different scripting languages
Phase 1 - Basic Cluster Operation
● Cluster provisioning● Deployment Engine implementation for pre-
installed images● Templates for Hadoop cluster configuration● REST API for cluster startup and operations● UI integrated into Horizon
Phase 1 - Current Status
● All code and documentation open sourced● Phase 1 completed, v 0.1 released on 04/10● Launchpad home page
○ https://launchpad.net/savanna
● Code on stackforge○ Integrated with OpenStack CI/CD
○ https://github.com/stackforge/savanna
● New contributors: RedHat and Hortonworks
Phase 2 - Advanced Configuration
● Hadoop cluster configuration support:○ Solutions for HDFS data reliability issue○ Configurable DN storage location○ Configurable topology of DN, NN, TT, JT ○ Add/remove nodes○ More Hadoop parameters
● Integration with vendor deployment/management tooling
● Basic monitoring support
Phase 3 - Analytics as a Service
● API to execute Map/Reduce jobs without exposing details of underlying infrastructure (similar to AWS EMR)
● User-friendly UI for ad-hoc analytics queries based on Hive or Pig
Further Roadmap
● Autoscaling● HBase support● HA for NameNode● HDFS and Swift integration
○ Caching of Swift data on HDFS● Mahout as a service ● Integration with logging and error handling
How to Contribute
● Download and install Savanna● Provide feedback and report bugs● Share more ideas via IRC sessions or mailing
list
More details: https://wiki.openstack.org/wiki/Savanna/HowToParticipate
● Savanna Overview● Roadmap● Phase 1 Live Demo● Phase 2 Features and Architecture
Agenda
● Savanna Overview● Roadmap● Phase 1 Live Demo● Phase 2 Features and Architecture
Agenda
Architecture Overview
Savanna Python Client
RE
ST
AP
I
Cluster Configuration
Manager
Horizon
Keystone
Auth
DAL
Nova
Glance
Swift
Savanna Pages
HadoopVM
Provisioning Plugin
HadoopVM
HadoopVM
HadoopVM
VMManager
ImageRegistry
Extensible Provisioning
● get extra configs● validate input● launch/terminate
cluster● add/remove nodes - launch/terminate VMs
- get VM status- ssh/scp to VM
VM manager
- register image in Savanna- add/remove tags- get image by tag
Image registry
PluginSavanna
get extra parameters
add/remove nodes
Provisioning Interaction
launch cluster
launch cluster
get extra parametersfor the plugin
Savanna
User
Plugin
validate cluster parameters
add/remove nodes
launch cluster
add/remove nodes
Provisioning: Launching a Cluster
launch VMs
Plugin
ImageRegistry
VMManager
get image by tag
launch VMs
install and configureHadoop
HadoopVM
HadoopVM
HadoopVM
HadoopVM
passcommandsvia ssh, scp
Q&A
HDFS Reliability: the issue
Compute
DN DN
DN
DN DN
DN
Data Block
Compute
HDFS Reliability: the issue
Compute
DN DN
DN
DN DN
DN
Data Block
Compute
HDFS Reliability: the issue
Compute
DN DN
DN
DN DN
DN
Data Block
Compute
HDFS Reliability: single DN per host
DN
Compute
TT | DN
Compute
DN
Compute
DN
Cluster A Cluster B
HDFS Reliability: Hadoop-8468hypervisor-awareness for HDFS scheduler
DN
Compute
DN DN
Compute
DN DN
Compute
DN
HDFSData Block
HDFS Reliability: Hadoop-8545enables Swift for Hadoop
Swift
HadoopJob #1
HDFSHadoopJob #2
...HadoopJob #N
initial input
final output
HDFS Placement Options
● Ephemeral drive/var/lib/nova/instances/instance-xxx/disk -> /mnt/ephemeral
● Block storage volumeCinder Volume -> /mnt/volume
● Bare drive support/dev/sdb -> /mnt/sdb
● Master node(s)
● Worker nodes
Configurable topology of DN, NN, TT, JT
JT | NN JT NN+
TTTT | DN DN
10 6 8