hadoop at the center: the next generation of hadoop
TRANSCRIPT
HADOOP AT THE CENTER: THE NEXT GENERATION OF HADOOP DATA MARKETING 2014 -‐ TORONTO
Adam Muise Principal Architect Hortonworks
Who am I?
Who is ?
We do Hadoop
The leaders of Hadoop’s development
Community driven, Enterprise Focused
Drive InnovaDon in the plaEorm – We lead the roadmap
100% Open Source – DemocraDzed Access to Data
We do Hadoop successfully. > Develop Open Source Hadoop > Distribute Hadoop with HDP > Support > Professional Services > Training
Hortonworks Approach Innovate the Core 1
Architect and build innovation at the core of Hadoop
• YARN: Data Operating System
• HDFS as the storage layer • Key processing engines
Extend Hadoop as an Enterprise Data Platform 2 Enable the Ecosystem 3
Extend Hadoop with enterprise capabilities for governance, security & operations Apply enterprise software rigor to the open source development process
Enable the leaders in the data center to easily adopt & extend their platforms
• Establish Hadoop as standard component of a modern data architecture
• Joint engineering
YARN : Data Opera>ng System
Script Pig
SQL
Hive/Tez, HCatalog
NoSQL
HBase Accumulo
Stream
Storm
Batch
Map Reduce
HDFS (Hadoop Distributed File System)
HDP 2.2
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management
YARN
Memory
Spark
YARN : Data Opera>ng System
Script Pig
Memory
Spark
SQL
Hive/Tez, HCatalog
NoSQL
HBase Accumulo
Stream
Storm
Batch
Map Reduce
HDFS (Hadoop Distributed File System)
Innova>ng within the community for the enterprise • Open Source: fastest path to innovaDon for a plaEorm technology
• Complete open source plaEorm speeds enterprise and ecosystem adopDon and minimizes lock in
• Enables the market to funcDon much bigger much faster
…all done completely in Open Source 4
HDP 2.2
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management
YARN
Driving our innova>on through Apache SoQware Founda>on Projects
Apache Project CommiTers PMC
Members
Hadoop 27 20
Pig 5 5
Hive 16 4
Tez 15 15
HBase 6 4
Phoenix 4 4
Accumulo 2 2
Storm 3 2
Slider 10 10
Flume 1 0
Sqoop 1 1
Ambari 32 27
Oozie 3 2
Zookeeper 2 1
Knox 11 5
Argus 10 n/a
Falcon 5 3
TOTAL 153 105
Let’s talk challenges…
Volume
Volume Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume Volume Volume
Volume
Volume Volume Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume Volume Volume
Volume
Volume Volume Volume
Volume Volume Volume Volume
Volume
Volume Volume Volume
Volume Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume Volume Volume
Volume
Volume Volume Volume
Volume Volume
Volume Volume
Volume
Volume Volume Volume
Volume Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Storage, Management, Processing all become challenges with Data at
Volume
TradiDonal technologies adopt a divide, drop, and conquer approach
The soluDon? EDW
Data Data Data Data
Data Data
Data Data Data
Yet Another EDW
Data Data Data Data
Data Data
Data Data Data
AnalyDcal DB
Data Data Data Data
Data Data
Data Data Data OLTP
Data Data Data Data
Data Data
Data Data Data
Another EDW
Data Data Data Data
Data Data
Data Data Data
Ummm…you dropped something
Data Data Data Data
Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
Data Data Data Data
Data Data
Data Data Data
Data Data Data Data
Data Data
Data Data Data
Data Data Data Data
Data Data
Data Data Data
Data Data Data Data
Data Data
Data Data Data
Data Data Data Data
Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
EDW
Data Data Data Data
Data Data
Data Data Data
Yet Another EDW
Data Data Data Data
Data Data
Data Data Data
AnalyDcal DB
Data Data Data Data
Data Data
Data Data Data
OLTP
Data Data Data Data
Data Data
Data Data Data
Another EDW
Data Data Data Data
Data Data
Data Data Data
What keeps us from our Data?
Data Silos. Your data silos are lonely places.
EDW
Data Data Data Data
Data Data
Data Data Data
Accounts
Data Data Data Data
Data Data
Data Data Data
Customers
Data Data Data Data
Data Data
Data Data Data
Web ProperDes
Data Data Data Data
Data Data
Data Data Data
… Data likes to be together.
EDW
Data Data Data Data
Data Data
Data Data Data
Accounts
Data Data Data Data
Data Data
Data Data Data
Customers
Data Data Data Data
Data Data
Data Data Data
Web ProperDes
Data Data Data Data
Data Data
Data Data Data
Data likes to socialize too. EDW
Data Data Data Data
Data Data
Data Data Data
Accounts
Data Data Data Data
Data Data
Data Data Data
Customers
Data Data Data Data
Data Data
Data Data Data
Web ProperDes
Data Data Data Data
Data Data
Data Data Data
Machine Data
Data Data Data Data
Data Data
Data Data Data
Twiber
Data Data Data Data
Data Data
Data Data Data
Data Data Data Data
Data Data
Data Data Data
CDR
Data Data Data Data
Data Data
Data Data Data
Weather Data
Data Data Data Data
Data Data
Data Data Data
New types of data don’t quite fit into your prisDne view of the world.
My Lible Data Empire
Data Data Data
Data Data Data
Data Data Data
Logs
Data Data Data Data
Data Data Data
Machine Data
Data Data Data Data
Data Data Data
? ? ? ?
To resolve this, some people take hints from Lord Of The Rings...
…and create One-‐Schema-‐To-‐Rule-‐Them-‐All…
EDW
Data Data Data Data
Data Data
Data Data Data Schema
…but that has its problems too.
EDW
Data Data Data Data
Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
EDW
Data Data Data Data
Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
What if the data was processed and stored centrally? What if you didn’t need to force it into a single schema?
We call it a Modern Data Architecture*
*AKA Data Lake
A Modern Data Architecture • Consolidate siloed data sets structured
and unstructured
• Central data set on a single cluster
• Multiple workloads across batch interactive and real time
• Central services for security, governance and operation
• Preserve existing investment in current tools and platforms
• Single view of the customer, product, supply chain
APPLICAT
IONS
DATA
SYSTEM
Business Analy>cs
Custom Applica>ons
Packaged Applica>ons
RDBMS
EDW
MPP
YARN: Data Opera>ng System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
Interactive Real-Time Batch CRM
ERP
Other 1 ° ° °
° ° ° °
HDFS (Hadoop Distributed File System)
SOURC
ES
EXISTING Systems
Clickstream Web &Social
Geoloca>on Sensor & Machine
Server Logs
Unstructured
What do you want to do with your data?
MarkeDng AnalyDcs needs data. Work with the populaDon, not just a
sample.
Your segmentaDon today.
Male
Female
Age: 25-‐30
Town/City
Middle Income Band
Product Category Preferences
Your segmentaDon with beber data.
Male
Female
Age: 27 but feels old
GPS coordinates
$65-‐68k per year
Product recommendaDons per Dme of day and per weather
Tea Party Hippie
Looking to start a business
Walking into Starbucks right now…
A depressed Toronto Maple Leaf’s Fan
Products lem in basket indicate drunk amazon shopper
Purchase history indicates a risk taker
Thinking about a new house
Unhappy with his cell phone plan
Pregnant
Spent 25 minutes looking at tea cozies
Pick up all of that data that was prohibiDvely expensive to store and
use.
To approach these use cases you need an affordable plaEorm that stores, processes, and analyzes the
data.
Don’t wait for your data. Batch is omen too late to influence the person who
is in your store or on your website right now.
Streaming Processing, Search, and Storage
Hortonworks Data Plaaorm 2.2
YARN
HDFS
APACHE KAFKA
Search Solr Slider
Online Data Processing
HBase
Real Time Stream Processing
Storm SQL Hive
Streaming Ingest
Stream data into Hadoop and process it in near real-‐;me
Real-‐Dme data feeds
How? With Hortonworks Data PlaEorm*
*AKA Hadoop
What’s New in HDP 2.2
New and Improved YARN Ready Engines
• Enterprise SQL at Hadoop Scale with Stinger.next
• Enterprise Ready Spark on YARN • Deep YARN integration for real-time
engines: HBase, Accumulo, Storm • Enabling ISVs with a general SDK and API
for direct YARN integration • Only solution to provide real-time to micro
batch for analyzing the internet of things • Other engines/tools: Solr, Cascading
Continued Innovation of Central Enterprise Services • Centralized security administration
and policy enforcement • Ease of use and operations agility
features to speed cluster deployment
• 100% uptime target with cluster rolling upgrades
Expanded Deployment Options • Enhanced business continuity with
replication/archival across on-premises and cloud storage tiers (Azure Blob, S3)
• Simultaneous ship of Windows and Linux installs
• Expand Azure support beyond HDInsight Azure to include HDP for Windows or Linux in Azure VMs
HDP 2.2 Delivering Apache Hadoop for the Enterprise
Complete List of New Features in HDP 2.2 Apache Hadoop YARN • Slide existing services onto YARN through ‘Slider’ • GA release of HBase, Accumulo, and Storm on
YARN • Support long running services: handling of logs,
containers not killed when AM dies, secure token renewal, YARN Labels for tagging nodes for specific workloads
• Support for CPU Scheduling and CPU Resource Isolation through CGroups
Apache Hadoop HDFS • Heterogeneous storage: Support for archival • Rolling Upgrade (This is an item that applies to the
entire HDP Stack. YARN, Hive, HBase, everything. We now support comprehensive Rolling Upgrade across the HDP Stack).
• Multi-NIC Support • Heterogeneous storage: Support memory as a
storage tier (TP) • HDFS Transparent Data Encryption (TP) Apache Hive, Apache Pig, and Apache Tez • Hive Cost Based Optimizer: Function Pushdown &
Join re-ordering support for other join types: star & bushy.
• Hive SQL Enhancements including: • ACID Support: Insert, Update, Delete • Temporary Tables • Metadata-only queries return instantly • Pig on Tez • Including DataFu for use with Pig • Vectorized shuffle • Tez Debug Tooling & UI
Hue • Support for HiveServer 2 • Support for Resource Manager HA
Apache Spark • Refreshed Tech Preview to Spark 1.1.0 (available
now) • ORC File support & Hive 0.13 integration • Planned for GA of Spark 1.2.0 • Operations integration via YARN ATS and Ambari • Security: Authentication • Apache Solr • Added Banana, a rich and flexible UI for visualizing
time series data indexed in Solr • Cascading • Cascading 3.0 on Tez distributed with HDP
— coming soon Apache Falcon • Authentication Integration • Lineage – now GA. (it’s been a tech preview
feature…) • Improve UI for pipeline management & editing: list,
detail, and create new (from existing elements) • Replicate to Cloud – Azure & S3 Apache Sqoop, Apache Flume & Apache Oozie • Sqoop import support for Hive types via HCatalog • Secure Windows cluster support: Sqoop, Flume,
Oozie • Flume streaming support: sink to HCat on secure
cluster • Oozie HA now supports secure clusters • Oozie Rolling Upgrade • Operational improvements for Oozie to better
support Falcon • Capture workflow job logs in HDFS • Don’t start new workflows for re-run • Allow job property updates on running jobs
Apache HBase, Apache Phoenix, & Apache Accumulo • HBase & Accumulo on YARN via Slider • HBase HA • Replicas update in real-time • Fully supports region split/merge • Scan API now supports standby RegionServers • HBase Block cache compression • HBase optimizations for low latency • Phoenix Robust Secondary Indexes • Performance enhancements for bulk import into
Phoenix • Hive over HBase Snapshots • Hive Connector to Accumulo • HBase & Accumulo wire-level encryption • Accumulo multi-datacenter replication Apache Storm • Storm-on-YARN via Slider • Ingest & notification for JMS (IBM MQ not
supported) • Kafka bolt for Storm – supports sophisticated
chaining of topologies through Kafka • Kerberos support • Hive update support – Streaming Ingest • Connector improvements for HBase and HDFS • Deliver Kafka as a companion component • Kafka install, start/stop via Ambari • Security Authorization Integration with Ranger Apache Slider • Allow on-demand create and run different versions
of heterogeneous applications • Allow users to configure different application
instances differently • Manage operational lifecycle of application
instances • Expand / shrink application instances • Provide application registry for publish and
discovery
Apache Knox & Apache Ranger (Argus) & HDP Security • Apache Ranger – Support authorization and auditing
for Storm and Knox • Introducing REST APIs for managing policies in
Apache Ranger • Apache Ranger – Support native grant/revoke
permissions in Hive and HBase • Apache Ranger – Support Oracle DB and storing of
audit logs in HDFS • Apache Ranger to run on Windows environment • Apache Knox to protect YARN RM • Apache Knox support for HDFS HA • Apache Ambari install, start/stop of Knox Apache Ambari • Support for HDP 2.2 Stack, including support for
Kafka, Knox and Slider • Enhancements to Ambari Web configuration
management including: versioning, history and revert, setting final properties and downloading client configurations
• Launch and monitor HDFS rebalance • Perform Capacity Scheduler queue refresh • Configure High Availability for ResourceManager • Ambari Administration framework for managing user
and group access to Ambari • Ambari Views development framework for
customizing the Ambari Web user experience • Ambari Stacks for extending Ambari to bring custom
Services under Ambari management • Ambari Blueprints for automating cluster
deployments • Performance improvements and enterprise usability
guardrails
Hortonworks Data Platform: A comprehensive data management platform
Hortonworks Data Plaaorm 2.2
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Authentication Authorization Accounting
Data Protection
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon
Cluster: Knox Cluster: Ranger
Deployment Choice Linux Windows On-Premises Cloud
YARN is the architectural center of HDP
Enables batch, interactive and real-time workloads
Provides comprehensive enterprise capabilities
The widest range of deployment options
Delivered Completely in the OPEN
HDP = Apache Hadoop
Hortonworks Data Plaaorm 2.2
Had
oop
&YA
RN
Pig
Hive & HCatalog
HBa
se
Sqo
op
Oozie
Zoo
keep
er
Amba
ri
Storm
Flume
Kno
x
Pho
enix
Accum
ulo
2.2.0 0.12.0
0.12.0 2.4.0
0.12.1
Data Management
0.13.0
0.96.1
0.98.0
0.9.1 1.4.4
1.3.1
1.4.0
1.4.4
1.5.1
3.3.2
4.0.0
3.4.5 0.4.0
4.0.0
1.5.1
Falcon
0.5.0
Ran
ger
Spa
rk
Kaf
a
0.14.0 0.14.0
0.98.4
1.6.1
4.2 0.9.3
1.2.0 0.6.0
0.8.1
1.4.5
1.5.0
1.7.0
4.1.0 0.5.0
0.4.0 2.6.0
3.4.5
Tez
0.4.0
Slid
er
0.60
HDP 2.0
October
2013
HDP 2.2
October
2014
HDP 2.1
April
2014
Solr
4.7.2
4.10.0
0.5.1
Data Access Governance & Integration Security Operations
What else are we working on?
hortonworks.com/labs/
Hadoop is the new Data OperaDng System for the Enterprise
© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 42
There is NO second place
Hortonworks …the Bull Elephant of Hadoop Innova>on