the time has come for big-data-as-a-service
TRANSCRIPT
#HadoopSummit
The Time Has Come for Big-Data-as-a-Service
Kris Applegate – Cloud and Big Data Solution Architect, Dell
Tom Phelan – Co-Founder and Chief Architect, BlueData
#HadoopSummit
Agenda
• A Brief History of Hadoop• Data Storage and Networking Evolution• The Virtualization Revolution• Rise of Big-Data-as-a-Service• Big-Data-as-a-Service (BDaaS) Defined• BDaaS – Public Cloud or On-Premises?• Q & A
#HadoopSummit
A Brief History of Hadoop
#HadoopSummit
In the Beginning (circa 2003) …• Networks were slow (1 Gigabit per
second maximum)
• Siloed storage was expensive (proprietary and often required special hardware)
• Local HDDs were cheap and fast enough for big data needs
Source: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
#HadoopSummit
Bringing the Compute to the Data
Compute Storage
Co-LocateCompute & Storage
Hadoop and HDFS are Born
#HadoopSummit
Network Improvements
#HadoopSummit
Data Compression Options in HDFSSource: www.slideshare.net/Hadoop_Summit/singh-kamat-june27425pmroom210c
#HadoopSummit
Result: Is Disk-Locality Irrelevant?
Source: https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/disk-irrelevant_hotos2011.pdf
Less relevant may be more accurate•Faster data center networks
•Distributed/non-distributed caching platforms
• Example: Alluxio (Tachyon)
•Compute and storage separation
#HadoopSummit
• Virtualization / “cloud” technology is not absolutely required
• But realistically … the flexibility and elasticity of BDaaS cannot be economically provided without these underlying technologies
BDaaS and Cloud
#HadoopSummit
The Virtualization Revolution
VMware
KVM
Docker
HyperVLXC
#HadoopSummit
Virtualization enabled several key benefits including:
•Automation, flexibility, elasticity• Cost reduction and consolidation• Higher utilization, less hardware overprovisioning
•Multi-tenancy• Security• VxLAN
• Fault isolation
The Virtualization Revolution
#HadoopSummit
But …. the overhead involved in the virtualization of storage and networking within a hypervisor
make it difficult to meet the performance needs of Big Data workloads (SLAs, QoS)
The Virtualization Revolution
#HadoopSummit
• Linux Containers• OS virtualization reduces CPU,
memory, network, and storage virtualization overhead
• Docker file format makes containers easy to use and share
The Virtualization Revolution
#HadoopSummit
Rise of Big-Data-as-a-Service
#HadoopSummit
Big Data New Realities
Big Data Traditional Assumptions
Bare-metal
Disk-locality
HDFS on local disks
Big Data New Realities
Containers
Compute and storage separation
In-place access on remote data stores
New Benefits and Value
Big-Data-as-a-Service
Agility and cost savings
Faster time-to-insights
#HadoopSummit
Journey to BDaaS
2003 Google paper
2012 Hadoop 1.0.2Snappy Compression
2012 10 Gbit networking in
data center
2008 Initial release of Linux
containers
2002 Initial release of
VMware ESX
2015 BlueData EPIC 2.0 with
Docker
2016 BDaaS available
on-prem or cloud
2004 Big Data
era begins
2002 2016
2014 VxLANs
available
2013 Dell Hadoop Performance
Analysis
2011 Dell first to launch optimized Apache Hadoop solution
2007 Hadoop release 0.14.1
2009 Dell DCS delivers first Big
Data server
2013 Initial release
of Docker2015 40 Gb
networking indata center
2014 BlueData wins Strata +
Hadoop World Showcase
2009 Amazon Launches EMR
#HadoopSummit
BDaaS – The Time Has Come
All the pieces are now available:
•Fast network hardware and good data compression Compute and storage separation Low overhead virtualization (containers) Ability to run network and storage-intensive workloads
•No sacrifice in performance•Demand from end users for agility, flexibility, & speed
#HadoopSummit
Big-Data-as-a-Service Defined
“A mechanism for the delivery of statistical analysis tools and information that helps organizations understand and use insights gained from large
information sets in order to gain a competitive advantage.”
On-Demand, Self-Service, ElasticBig Data Infrastructure, Applications, Analytics
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
#HadoopSummit
• Core BDaaS
• Performance BDaaS
• Feature BDaaS
• Integrated BDaaS
Four Types of BDaaS
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
#HadoopSummit
Core BDaaS• Minimal platform, such as Hadoop with YARN
Performance BDaaS • “Downwards” vertical integration• Includes optimized infrastructure• Tight integration with Core BDaaS
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
Four Types of BDaaS
#HadoopSummit
Four Types of BDaaS
Feature BDaaS • “Upwards” vertical integration• Include features beyond Hadoop• Support for multiple Core BDaaS providers
Integrated BDaaS• Full vertical integration and optimization• Includes both Performance BDaaS & Feature BDaaS
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
#HadoopSummit
BDaaS – Public Cloud or On-Prem?
#HadoopSummit
Public Cloud
• Low Capex, high Opex• “Infinite” expandability• Less secure?• Less control: software,
SLAs, configs, etc
On-Premises (Private Cloud)
•High Capex, low Opex•Eventually reach resource limit •More secure? •More control: software, SLAs, configs, etc.
BDaaS – Public Cloud or On-Prem
#HadoopSummit
Challenge: Public cloud services can be proprietaryGoal: Deliver API-compatible on-prem + public cloud• BDaaS layer (e.g. BlueData)
• PaaS layer (e.g. Cloudforms, Cloud Foundry)
• API-compatible private cloud (e.g. Microsoft Azure Pack/Stack, OpenStack, VMware)
BDaaS – Workload Portability
#HadoopSummit
• Workloads with a shorter life than 16 months* (e.g. Dev/Test)
• When data is in the cloud too
• Public-facing services
Example Public Cloud Use Cases
BDaaS – Public Cloud
* www.dell.com/learn/us/en/555/business~solutions~whitepapers~en/documents~microsoft-private-cloud-tco-0914.pdf
#HadoopSummit
Example On-Prem Use Cases
• High performance clusters• Data security• Data compliance• Persistent clusters with > 16 month lifespan*• High capacity clusters• When SLAs are needed
* The BlueData EPIC software platform addresses this potential limitation
BDaaS – On-Premises / Private Cloud
#HadoopSummit
• BDaaS software platform, using Docker containers• Self-service, on-demand Hadoop / Spark clusters• Bring your own application / distribution / version• Compute and storage separation
Scale resources independently Clusters with < 16 month lifespan well supported (e.g. transient) No HDFS data ingestion penalty
• Secure multi-tenancy, Quality of Service (QoS)
BlueData EPIC – Integrated BDaaS
#HadoopSummit
Big Data On-PremisesTraditional Big Data On-Prem
IT
ManufacturingSalesR&DServices
< 30% Utilization
Duplication of data
Management complexity
Weeks to build each cluster
Complex, painful
upgrades
BlueData EPIC Software Platform
ManufacturingSalesR&DServices
BI/Analytics Tools
> 90% Utilization
BDaaS On-Prem with BlueData
No Duplication of Data
Simplified Management
Multi-Tenant
Simple, instant
upgrades
Self-service, on-demand
clusters
with BlueData
#HadoopSummit
NEW – BDaaS On-Prem and Cloud
• BlueData announced AWS and multi-cloud strategy Extending the user experience and value of BlueData to public cloud Single pane of glass for on-prem and off-prem Big Data workloads Initial AWS support; then MS Azure, Google Cloud Platform, others
• Support for data on-prem and compute in the cloud Leverage cloud compute elasticity while keeping data on-premises Eliminate challenge of data movement from on-prem to cloud
#HadoopSummit
BlueData and Dell Partnership
• Joint solution for Big-Data-as-a-Service
• BlueData = Certified Dell Technology Partner
• Installed, tested, validated on Dell hardware
• Featured in Dell’s Global Customer Solution Centers