hadoop and openstack - hadoop summit san jose 2014
DESCRIPTION
Merging the insightful power of Hadoop with the management capabilities of OpenStack via SaharaTRANSCRIPT
![Page 1: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/1.jpg)
Hadoop and OpenStackMatthew Farrellee, @spinningmatt, Red HatSumit Mohanty, @smohanty, Hortonworks
![Page 2: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/2.jpg)
What is OpenStack?
![Page 3: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/3.jpg)
OpenStack isA cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface.
![Page 4: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/4.jpg)
An ecosystem of projects● Compute - Nova● Networking - Neutron● Object Storage - Swift● Block Storage - Cinder● Identity - Keystone● Image Service - Glance● Dashboard - Horizon● Telemetry - Ceilometer● Orchestration - Heat● Data Processing - Sahara
![Page 5: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/5.jpg)
Sahara is combining use cases
![Page 6: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/6.jpg)
Trends
HadoopEC2
OpenStack
www.google.com/trends/explore#q=hadoop,ec2,openstack
EC2 beta Aug 25 2006 (http://aws.typepad.com/aws/2006/08/amazon_ec2_beta.html)
![Page 7: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/7.jpg)
Data analysis is hard
![Page 8: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/8.jpg)
Data analysis is hard...● Come up w/ a relevant question
○ The question you answer won’t be the question you set out to ask
○ Mine: Can I predict doctor specialty from what procedures they perform?
● Find the data○ Tons, little consistency, unknown origin, horded○ Data w/o a dictionary is worse than code w/o
comments. Run away!
![Page 9: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/9.jpg)
Data analysis is hard...● Data usability
○ Acceptable license? (Even for Gov’t sets)■ Mine: Metadata copyrighted by AMA!
○ Private is often highly protected, no/narrow DMZ● Explore and clean
○ Two of the oldest people in the medical profession working with medicare
○ Stephen Glasser graduated in 1773○ Cheryl Palma graduated in 1776
![Page 10: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/10.jpg)
Data analysis is hard...● You got some answer to a question you
approximately asked● You must refine the question and process● Repeat
This is hard enough without having to manage tools and infrastructure!
![Page 11: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/11.jpg)
Sahara’s goal
Make managing Hadoop+ infrastructure and tools so simple that they get out of your way
![Page 12: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/12.jpg)
Sahara provides
● Apache Hadoop cluster and workload management○ Cluster - construct and manage the lifecycle of a
Hadoop cluster○ Workload - workflow for big data processing with
Hadoop (AWS EMR-like)● Through a Python library, REST API, Web
UI, command line interface
![Page 13: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/13.jpg)
Sahara’s architecture
Data Sources
Sahara Python Client RE
ST A
PI
Cluster Configuration
Manager
Horizon
Keystone
Auth
Data Access Layer
Swift
Sahara Pages
HadoopVM
Vendors Plugins
HadoopVM
HadoopVM
HadoopVM
Resources Orchestration
Manager
Job Sources Job
Manager
Heat
Nova
Glance
Cinder
Neutron
Trove DB
Sahara Service
![Page 14: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/14.jpg)
Sahara’s features● Plugin mechanism - distro choice● Cluster scaling - elasticity● Swift integration - data storage● Cinder integration - persistent HDFS● Network management with Nova and Neutron● Anti-affinity, separate services on physical hardware● Data locality with Swift● Repeatable cluster creation w/ template mechanism● http://docs.openstack.
org/developer/sahara/userdoc/features.html
![Page 15: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/15.jpg)
Storage considerations
● Swift○ Input/output through Swift HCFS plugin○ Intermediate data stored in HDFS on cluster○ Locality when co-locating swift & nova-compute
● HDFS○ Local (long lived cluster) and remote (copy in)
● HDFS backed by ephemeral disk or Cinder○ Ephemeral - /var/lib/nova/instances on compute host○ Cinder - persistent block devices attached to instances
![Page 16: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/16.jpg)
Sahara’s plugin architecture● This is important!● It’s where Hadoop distribution vendors
integrate their management software● It’s how users pick different software
versions● Currently: Vanilla (reference impl. w/ Apache
versions), HDP (via Ambari), IDH (via Intel Manager), and Spark (w/ minimal CDH)
![Page 17: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/17.jpg)
HDP Plugin Overview● Full support for all Sahara Functionality
● Nova and Neutron network● Cluster Scaling● Scale Up● Swift Integration● Cinder Support● Data Locality● EDP
● Apache Ambari REST API’s used for clusterprovisioning
● Monitoring/Management of clusters via Ambari● Full support for multiple HDP stacks● HDP pre-installed or generic VM images
![Page 18: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/18.jpg)
HDP 1.3● NameNode● Secondary NameNode● DataNode● HDFS● ZooKeeper ● Ambari Server/Agent● HCatalog● Sqoop● Job Tracker● Task Tracker● MapReduce● Hive● MySQL● Pig● WebHCat Server● Oozie● Ganglia● Nagios● HBase
HDP Plugin Stack Support
HDP 2.0● History Server● MapReduce 2 / YARN● Resource Manager● YARN Client
HDP 2.1● Storm● Falcon
Coming Soon!
Available
Available
HDP 2.1 +● SOLR● Cascading
Roadmap
![Page 19: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/19.jpg)
Ambari Blueprints● Two primary goals of Ambari Blueprints
○ Ability to export a complete description of a running cluster
○ Provide API based cluster installations based on a self- contained cluster description
● Blueprints contain cluster topology and configuration information
● Enables Interesting use cases between physical and virtual, including OpenStack/Sahara
![Page 20: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/20.jpg)
Blueprint API
BLUEPRINTPOST /blueprints/my-blueprint
CLUSTERINSTANCE POST
/clusters/MyCluster
1
2
![Page 21: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/21.jpg)
Example: Single-Node Definitions{ "configurations" : [ { ”hdfs-site" : {
"dfs.namenode.name.dir" : ”/hadoop/nn" } } ], "host_groups" : [ { "name" : ”uber-host", "components" : [ { "name" : "NAMENODE” }, { "name" : "SECONDARY_NAMENODE” }, { "name" : "DATANODE” }, { "name" : "HDFS_CLIENT” }, { "name" : "RESOURCEMANAGER” }, { "name" : "NODEMANAGER” }, { "name" : "YARN_CLIENT” }, { "name" : "HISTORYSERVER” }, { "name" : "MAPREDUCE2_CLIENT” } ], "cardinality" : "1" } ], "Blueprints" : { "blueprint_name" : "single-node-hdfs-yarn", "stack_name" : "HDP", "stack_version" : "2.0" }}
{ "blueprint" : "single-node-hdfs-yarn", "host_groups" :[ { "name" : ”uber-host", "hosts" : [ { "fqdn" : "c6401.ambari.apache.org”
} ] } ]}
BLUEPRINT
CLUSTER INSTANCE
Description• Single-node cluster• Use HDP 2.0 Stack• HDFS + YARN + MR2• Everything on c6401
![Page 22: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/22.jpg)
Demo - youtu.be/vmry_kXqn4c● http://jayunit100.github.io/bigpetstore/slides
● Bigpetstoreo A full stack hadoop applicationo Uses the main players in the hadoop ecosystemo To demonstrate a single domaino Just accepted into the Bigtop project!
● Come by the Red Hat booth - G18
![Page 23: Hadoop and OpenStack - Hadoop Summit San Jose 2014](https://reader034.vdocuments.net/reader034/viewer/2022042714/554f58dfb4c905524c8b5384/html5/thumbnails/23.jpg)
Q&A
● Status - Integrated for Juno (Oct 2014)● Distro - RDO (Fedora/RHEL/CentOS), RHEL
OSP 5, ...● Home - https://launchpad.net/sahara● Docs - http://docs.openstack.org/developer/sahara● Code - https://github.com/openstack/ *sahara*● Email - openstack-dev w/ [sahara]● IRC - #openstack-sahara on freenode