faster infrastructure for big data workloads · pdf filecloudformation optimized instance spot...

17
Faster Infrastructure for Big Data Workloads Baremetal, AWS or Clusters-as-a-Service July 19, 2017 Overview for Big Data Developers, IT DevOps and Data Scientists Morgan Littlewood, Founder [email protected]

Upload: dinhkhuong

Post on 20-Mar-2018

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

Faster Infrastructure for Big Data Workloads

Baremetal, AWS or Clusters-as-a-ServiceJuly 19, 2017

Overview for Big Data Developers, IT DevOps and Data Scientists

Morgan Littlewood, [email protected]

Page 2: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data

Infrastructure Blocks Big Data Deployment

2

The Infrastructure Chasm

Who provides the infrastructure?How do you ensure each stage is successful?How do you migrate applications between

infrastructures?

Big Data Software - Analytics software

- Orchestration software - Distro software

- Datasets- NoSQL databases- Application logic

Deployment Needs- SaaS/Hosted- Colo- Private Data Center- On-prem Managed Cloud

TEST PILOT PRODUCTIONDEV

Page 3: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data

The Infrastructure ChallengeHow do you design, deploy, maintain, scale and pay for your cluster infrastructure?

3

How many clusters and stacks?How many types of servers?How many servers?What CPU/RAM configuration?How much disk capacity?What disk speed?What network speed?Names and IP Addressing for servers?Which OS? What SW distros?Docker? VM? Don't care?Other stacks like NAS filer?

Gather Technical Needs Predict Business Needs

Development, Test and Production Clusters?Will the clusters grow with time?How long will clusters exist?What is the CAPEX budget?What is the OPEX budget?Where is the data coming from?How many users of each cluster? How critical is performance?Will the software stack change with time?What happens when the workload changes?Does all the data need to be retained?

Page 4: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data

Typical Big Data User Scenarios

4

Start-up: Analytics Services• Mix of real-time and batch analytics• Multiple datasets and customers• Datasets start small and grow with time• Performance requirements change rapidly• Multiple stacks: Hadoop, Kafka, Druid• 5-10 Clusters needed• Smaller budgets – grow with business• Time to start is critical

Mid-size Company: IT-Centric Products• Mix of real-time and batch analytics• Large numbers of very large datasets• Ever-increasing performance requirements• Large Development and Data Science team• Many data stacks – new stacks being tested• 20 – 100 Clusters needed• Larger budgets and security needs• Cost-efficiency is important

Page 5: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data

Today’s Choices for Cluster Infrastructure

5

Baremetal

Dedicated cluster hardwareHigh PerformanceHigh CAPEX

AMI

Amazon EC2 Amazon EBS

customer gateway

Amazon VPC

AWSCloudFormation

optimized instance

Spot instance

Availability Zone

Baremetal ClusterStandard software

Baremetal ClusterStandard software

Baremetal ClusterStandard software

Baremetal ClusterStandard software

AWS

EC2, EBS & S3Low PerformanceHigh OPEX – “Pay-per-month”

Page 6: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data

Baremetal: Servers Dedicated to Specific Stacks/ClustersServer Configurations Driven by Software/Application needs• Today’s applications need multiple stacks – data, metadata, real-time,…• Each stack needs specific CPU, RAM, Disk, Network + Power, Cabling• Performance can be scaled, but resource efficiency is poor• Very difficult to change configurations, but software changes constantly• Need to acquire and size datacenter space – LARGE BUDGETS!

HDFS NameNode HDFS DataNode MySQL Server Cassandra Server Spark Server

6

Page 7: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data 7

Since 2006, AWS is ‘Cloud 1.0’• Largest, most mature public cloud

• Wide range of useful service offerings

• All you need is a credit card

• Pricing by the hour makes it “addictive”

AWS is a Great Cloud - What’s not to Like?

AMI

Amazon EC2 Amazon EBS

customer gateway

Amazon VPC

AWSCloudFormation

optimized instance

Spot instance

Availability Zone

Page 8: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data 8

‘Instance’ & ‘Person to Machine’ VS. ‘Cluster’ & ‘Machine to Machine’

AWS is not Designed for Big DataBig Data RQMTS are a ‘Whole Different Ball Game’!

Webby Big Data

• Fixed Instances – Specific CPU/RAM/Disk/Network configurations

• Provisioning – Complex (spot/on-demand/reserved) & time-consuming

• Too Slow – 10Gb bandwidths only available with high CPU instances

Page 9: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data

EC2 Instances – Only a Few are Even 10Gbit/s

9

• Clusters need bandwidth for EBS and node-to-node

• Difficult to change instance sizes with instance storage

• Forced to use EBS for persistent data

Page 10: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data

Is There a Better Choice for Cluster Infrastructure?

10

AMI

Amazon EC2 Amazon EBS

customer gateway

Amazon VPC

AWSCloudFormation

optimized instance

Spot instance

Availability Zone

Baremetal ClusterStandard software

Baremetal ClusterStandard software

Baremetal ClusterStandard software

Baremetal ClusterStandard software

Clusters-as-a-Service

Hosted or On-PremisesHigh PerformanceLow CAPEX

Baremetal

Dedicated cluster hardwareHigh PerformanceHigh CAPEX

AWS

EC2, EBS & S3Low PerformanceHigh OPEX – “Pay-per-month”

Page 11: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data

Introducing Cloud 2.0: MemCloudTM

Combines the Speed of Baremetal with the Agility of Cloud

11

Memory-Speed: 100GE, ALL flash Infrastructure

On-Prem CloudGroup and Team Developers, Big Data

Hosted CloudBig-data ‘virtual’ IaaS

hybrid optionPublic Cloud

Public Cloud for Web, Storage, Repository,

Disaster and Spin-up/Spin-down Services

hybrid option

Page 12: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data

Clusters-as-a-ServiceFabric Solves the Performance Issues at Big Data Scale

12

Project Clusters NowProject ClusterMany groups of Server+disks

Project ClusterMany groups of Server+disks

Project ClusterMany groups of Server+disks

Fabric:Cluster Hypervisor on each ServerLinux software – Performance HW

DataContainer™ Virtual Cluster manifest for VM/Docker + vDisks + network

User SW runs in standard VM/ Docker containers

Virtual cluster

Fabric

Virtualized Clusters on

Serverswith disks

VM V

M VM

VM V

M VM

Virtual clusterVirtual

clusterVirtual cluster

VM V

M VM

VM V

M VM

KODIAK

Page 13: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data

Simplify Cluster InfrastructureManage nodes like Cattle not Pets!

13

In Hosted Co-Lo Clusters from $ 100 /week

On PremiseWorkgroup Appliances

Single-Click Deploymenton Environment of Choice

In Data CenterRack Solutions to PBs

100 Nodes & 10 Disks each= 3100 Steps (Create, Attach, Format) in AWS= 5 line spreadsheet and 50 line DataContainer

Gather Cluster Needs

Page 14: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data

Use-case: AWS Big Data Offload StrategyStartup: Machine Learning Company

14

• Business: Market Campaign Analysis• TB scale processing cluster per large customer - data in AWS S3• Paying $28K/year/customer cluster in AWS (growing)• 14 Node Cluster- Hadoop, Druid

• Solution: ML Data in S3 → MemCloud → Results in S3

• Benefit: 4X Faster, ONE HALF the Cost

• Benefit: ZERO Client IT/DevOps Required• 2 Day start-up: spec-build-run-migrate data from AWS -> MemCloud

Page 15: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data

Use-case: Private Cloud for AnalyticsMid-size Company: Agile Environment with Memory-Speed Performance

15

• Business: Self Driving Vehicles• PB scale data-processing clusters• Development, Test, Machine Learning, Production Clusters• Many cluster types – Batch, Real-time

• Solution: Virtual Cluster Infrastructure• Kubernetes for compute orchestration• Kodiak MemCloud for virtual disks and clusters• Shared resource pool for many clusters

• Benefit: 4X Faster Disks, ONE HALF the Cost

• Benefit: Low IT Admin Costs• Spin-up new clusters in minutes• Create disks and nodes of any size

Page 16: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data 16

Clusters-as-a-ServiceCloud 2.0 Provides Simple & Powerful Virtual Cluster Infrastructure

1-Click Provision Test/Dev/Prod Clusters...SIMPLE, in minutes

Data-Heavy Apps Run 5x Faster...Out-Of-Box

Provision ‘ALL Flash Data Plane’...at fixed, HDD pricing

Page 17: Faster Infrastructure for Big Data Workloads · PDF fileCloudFormation optimized instance Spot instance Availability Zone Baremetal Cluster Standard software Baremetal Cluster Standard

© 2017 Kodiak Data 17

Thank You!

www.kodiakdata.com

http://info.kodiakdata.com/memcloud/http://www.memcloud.works/benchmark-results.html

MemCloud access portal (need credentials from Kodiak Data) http://memcloud.kodiakdata.com

Contact us at [email protected].