sap virtualization week 2012 - the lego cloud

43
Benoit Hudzia Sr. Researcher; SAP Research CEC Belfast (UK) Aidan Shribman Sr. Researcher; SAP Research Israel SAP Virtualization Week 2012: TRND04 SAP DKOM 2012: NA 6747 The Lego Cloud

Upload: aidanshribman

Post on 20-May-2015

239 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: SAP Virtualization Week 2012 - The Lego Cloud

Benoit Hudzia Sr. Researcher; SAP Research CEC Belfast (UK)

Aidan Shribman Sr. Researcher; SAP Research Israel

SAP Virtualization Week 2012: TRND04

SAP DKOM 2012: NA 6747

The Lego Cloud

Page 2: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 2

Agenda

Introduction

Hardware Trends

Live Migration

Flash Cloning

Memory Pooling

Distributed Shared Memory

Summary

Page 3: SAP Virtualization Week 2012 - The Lego Cloud

Introduction The evolution of the datacenter

Page 4: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 4

No virtualization

Basic Consolidation

Flexible Resources Management (Cloud)

Resources Disaggregation

(True Utility Cloud)

Evolution of Virtualization

Page 5: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 5

Why Disaggregate Resources?

Better Performance

Replacing slow local devices (e.g. disk) with fast remote devices (e.g. DRAM).

Many remote devices working in parallel (e.g. DRAM, disk, compute)

Superior Scalability

Going beyond boundaries of the single node

Improved Economics

Do more with existing hardware

Reach better hardware utilization levels

Page 6: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 6

The Hecatonchire Project

Hecatonchires in Greek mythology means “Hundred Handed

Ones” – the original idea: provide Distributed Shared Memory

(DSM) capabilities to the cloud

Strategic goal : full resource liberation brought to the cloud by:

Breaking down physical nodes to their core elements (CPU, Memory, I/O)

Extend existing cloud software stack (KVM, QEMU, libvirt, OpenStack)

without degrading any existing capabilities

Using commodity cloud hardware and standard interconnects

Initiated by Benoit Hudzia in 2011. Currently developed by two

SAP Research TI Practice teams located in Belfast and Ra’anana

Hecatonchire is not a monolithic project – but a set of separate

capabilities. We are currently identifying stake holder and

defining use cases for each such capability.

Page 7: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 7

Hecatonchire Architecture

Cluster Servers

Commodity hosts (e.g. 64 GB 16 core)

Commodity network adapters:

– Standard: softiwarp over 1 GbE

– Enterprise: RoCE/iWARP over 10 GbE or native IB

A modified version of QEMU/KVM hypervisor

An RDMA remote memory kernel module

Guests / VMs

Use resource from one or several underlaying hosts

Existing OS/application can run transparently

– Not exactly … but we will get to this later

CPUs

Memory

I/O

CPUs

Memory

I/O

CPUs

Memory

I/O

H/W

OS

App

VM

H/W

OS

App

VM

H/W

OS

Ap

p

VM

H/W

OS

App

VM

Server #1 Server #2 Server #n

Guests

Fast RDMA Communication

Page 8: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 8

The Team - Panoramic View

Page 9: SAP Virtualization Week 2012 - The Lego Cloud

Hardware Trends The blurring of physical host boundaries

Page 10: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 10

DRAM Latency Has Remained Constant

CPU clock speed and memory bandwidth

increased steadily while memory latency

remained constant

As a result local memory has appears slower

from the CPU perspective

Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010

Page 11: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 11

CPU Cores Stopped Getting Faster

Moore’s law prevailed until 2005 when cores

hit a practical limit of about 3.4 GHz

The “single threaded free lunch” (as coined by

Herb Sutter) is over

So CPU cores have stopped getting faster -

but you do get more cores now

Source: http://www.intel.com/pressroom/kits/quickrefyr.htm

Source: “The Free Lunch Is Over..” by Herb Sutter

Page 12: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 12

But … Interconnects Continue to Evolve

(providing higher bandwidth and lower latency)

Page 13: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 13

Result: Remote Nodes Are Becoming “Closer”

Accessing DRAM on a remote host via IB interconnects is only 20x slower than local DRAM.

Remote DRAM is 100x or 5000x faster than local SSD or HDD devices respectively.

HANA Performance Analysis, Intel Westmere (formally Nehelem-C) and IB QDR, Chaim Bendelac, 2011

Page 14: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 14

Result: Blurring the Boundaries of the Physical Host

10,000,000 ns 2,000ns 2,000ns

10,000,000 ns 60ns-100ns 15ns-80ns

10,000,000 ns 2,000ns 2,000ns

Page 15: SAP Virtualization Week 2012 - The Lego Cloud

Live Migration Serving as a platform to evaluate remote page faulting

Page 16: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 16

Enabling Live Migration of SAP Workloads

Business Problem

Typical SAP workloads such as SAP ERP are

transactional, large, with a fast rate of memory writes.

Classic live migration fails for such workloads as rapid

memory writes cause memory pages to be re-sent over

and over again

Hecatonchire’s Solution

Enable live migration by reducing both the number of

pages re-sent and the cost of a page re-send

Across the board improvement of live migration metrics

– Downtime - reduced

– service degradation - reduced

– total migration time - reduced

Page 17: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 17

Classic Pre-Copy Live Migration

Pre-migration process

Reservation process

Iterative pre-copy

Stop and copy

Commitment

• VM active on host A

• Destination host selected

(Block devices mirrored)

• Initialize container on target host • Copy dirty pages in successive

rounds

• Suspend VM on host A

• Redirect network traffic

• Synch remaining state

• Activate on host B

• VM state on host A released

Page 18: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 18

Hecatonchire Pre-copy Live Migration

Reducing number of page re-sends

Page LRU reordering such that pages with a low

chance of being re-dirtied are sent first

Contribution to QEMU planned for 2012

Reducing the cost of a page re-sends

By using XBZRLE delta encoder we can much more

efficiently represent page changes

Contributed to QEMU during 2011

Page 19: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 19

Pre-Copy Live-

Migration

Page Pushing

1

Round

Stop

and

Copy

Commit

Total Migration Time

Downtime Live on A Degraded on B Live on B

Page Pushing

1

Round

Stop

and

Copy

Total Migration Time

Downtime Live on A Degraded on B Live on B

Iterative

Pre-Copy

X

Rounds

Post-Copy Live-

Migration

Hybrid Post-Copy

Live-Migration

More Than One Way to Live Migrate…

Iterative

Pre-copy X

Rounds

Stop

and

Copy

Commit

Total Migration Time

Downtime Live on A Live on B

Pre-migrate;

Reservation

Pre-migrate;

Reservation

Pre-migrate;

Reservation Commit

Page 20: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 20

Hecatonchire Post-copy Live Migration

In post-copy live migration we reverse order

1. Transfer of state: Transfer the VM running state from A to

B and Immediately activate the VM on B

2. Transfer of memory: B can initiate a network bound page

fault handled by A; Background actively push memory from

A to B until completion

Post-copy has some unique advantages

Downtime is minimal as only a few MBs for a GB sized VM

need to be transferred before re-activation

Total migration time is minimal and predictable

Hecatonchire unique enhancements

Low latency RDMA page transfer protocol

Demand pre-paging (pre-fetching) mechanism

Full Linux MMU integration

Hybrid post-copy supported

Page 21: SAP Virtualization Week 2012 - The Lego Cloud

Demo

Page 22: SAP Virtualization Week 2012 - The Lego Cloud
Page 23: SAP Virtualization Week 2012 - The Lego Cloud

Flash Cloning Sub-second elastic auto scaling

Page 24: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 24

Automated Elasticity

Elasticity is basis for cloud economics

You can scale-up or scale-down on-demand

You only pay for what you use

Chart depicts scaling evolution

Scale-up approach: purchase bigger machines to meet

rising demands

Traditional scale-out approach: reconfigure the cluster

size according to demand

Automated elasticity: grow and shrink your resources

automatically responding to changing demands

represented by monitored metrics

If you can’t respond fast enough you may either

miss business opportunities or have to increase

your margin of purchased resources

Amazon Web Services - Guide

Page 25: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 25

Hecatonchire Flash Cloning

Business Problem

AWS auto scaling (and others) take minutes to scale-up:

– Disk image clone from a template (AMI) image

– Full boot up sequence of VM

– Acquiring of an IP address via DHCP

– Starting up the application

Hecatonchire Solution

Provide just in time (sub-second) scaling according to demand

– Clone a paused source VM Copy-on-Write (CoW) including:

Disk Image, VM Memory, VM State (registers, etc.)

– Use a post-copy live-migration schema including page-faulting to

fetch missing pages with background active page pushing

– Create a private network switch per clone (to save the need for

assigning a new MAC and performing IP reconfigure)

Page 26: SAP Virtualization Week 2012 - The Lego Cloud

Memory Pooling Tapping into unused memory resources of remote hosts

Page 27: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 27

Capacity

Access S

peed

Hecatonchire Breakthrough Capability Breaking the Memory Box Barrier for Memory Intensive Applications

msec u

sec n

sec

SAN

NAS

Local Disk

Performance

Barrier

Em

bed

ded

Re

so

urc

es

Lo

cal

Re

so

urc

es

Ne

two

rked

Re

so

urc

es

PB TB GB MB

SSD

Page 28: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 28

The Memory Cloud Turns memory into a distributed memory service

Server1 Server2 Server3

Server1 Server2 Server3

Server1 Server2 Server3

VM VM VM

Storage

Applications

Memory

App

RAM

App App

RAM RAM

Business Problem

Large amounts of DRAM required on-demand – from shared cloud

hosts

Current cloud offerings are limited by the size of their physical host -

AWS can’t go beyond 68 GB DRAM as these large memory

instances fully occupy the physical host

Hecatonchire Solution

Access remote DRAM via low-latency RDMA stack (using pre-

pushing to hide latency)

MMU Integration for transport consumption for applications and

VMs. And as a result also support : compression (zcache), de-

duplication (KSM), N-tier storage

No hardware investment needed! No need for dedicated servers!

Page 29: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 29

Cloud

Management

Stack

VM High Availability

RAM

Many Physical Nodes

Hosting a variety of VMs

VM High Availability KVM Kemari / Xen Remus

RRAIM-1 (Mirroring) Hecatonchire

RAM App

VM VM VM VM

Hecatonchire RRAIM

RRAIM : Remote Redundant Array of Inexpensive Memory Memory Fault Tolerance as Part of a Full HA Solution

App Active Active Master Slave

RRAIM-1

Page 30: SAP Virtualization Week 2012 - The Lego Cloud

Distributed Shared Memory Our next challenge

Page 31: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 31

Cache-Coherent Non Uniform Memory Access (ccNUMA)

Traditional cluster

Distributed memory

Standard interconnects

OS instance on each node

Distribution handled by application

ccNUMA

Cache coherent shared memory

Fast interconnects

One OS instance

Distribution handled by hardware/hypervisor

Page 32: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 32

Hecatonchire Distributed Shared Memory (DSM) VM

Page 33: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 33

Hecatonchire DSM – Cache Coherency (CC) Challenge

Standard ccNUMA

Inter-node (2000ns) cache-coherency takes too long

Inter-node read is expensive while processor cache not large enough

Adding COMA (Cache Only Memory Access)

Can help to improve performance for multi-read scenario

COMA implementation requires 4k cache-line leading to false data share

NUMA Topology / Dynamic NUMA Topology

Application NUMA-aware implementation may not be complete

Dynamic changes in NUMA will not be supported by most current apps

We need to attempt to hide some of the performance challenges (so that we

can expose a fixed NUMA topology

Adding vCPU live migration

Compact vCPU state (only several KB) can be live migrated

ccNUMA

COMA

Page 34: SAP Virtualization Week 2012 - The Lego Cloud

Summary

Page 35: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 35

Roadmap

2011

• Live Migration

• Pre-copy XBZRLE Delta Encoding

• Pre-copy LRU page reordering

• Post-copy using RDMA interconnects

2012

• Memory Cloud

• Memory Pooling

• Memory Fault Tolerance (RRAIM)

• Flash Cloning

2013

• Lego Landscape

• Distributed Shared Memory

• Flexible resource management

Page 36: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 36

Key takeaways

Hecatonchire extends standard Linux stack requiring

only standard commodity hardware

With Hecatonchire unmodified applications or VMs

(which are NUMA-aware) can tape into remote resources

tranparently

To be released as open source under GPLv2 and LGPL

licenses to Qemu and Linux communities

Developed by SAP Research Technology Infrastructure

(TI) Practice

Page 37: SAP Virtualization Week 2012 - The Lego Cloud

Thank you

Benoit Hudzia; Sr. Researcher;

SAP Research CEC Belfast

[email protected]

Aidan Shribman; Sr. Researcher;

SAP Research Israel

[email protected]

Page 38: SAP Virtualization Week 2012 - The Lego Cloud

Appendix

Page 39: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 39

Communication Stacks have Become Leaner

Traditional network interface

Application / OS context switches

Intermediate buffer copies

OS handling transport processing

RDMA adapters

Zero copy directly from/to

application physical memory

Offloading of transport processing

to RDMA adapter and effectively

bypassing OS and CPU

A standard interface OFED “Verbs”

supporting all RDMA adapters (IB,

RoCE, iWARP)

Page 40: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 40

Linux Kernel Virtual Machine (KVM)

Released as a Linux Kernel Module (LKM)

under GPLv2 license in 2007 by Qumranet

Full virtualization via Intel VT-x and AMD-V

virtualization extensions to the x86 instruction

set

Uses Qemu for invoking KVM, for handling of

I/O and for advanced capabilities such as VM

live migration

KVM considered the primary hypervisor on

most major Linux distributions such as

RedHat and SuSE

Page 41: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 41

Remote Page Faulting Architecture Comparison

Hecatonchire

No context switches

Zero-copy

Use iWarp RDMA

Yobusame

Context switches into user mode

Use standard TCP/IP transport

Horofuchi and Yamahata, KVM Forum 2011 Hudzia and Shribman, SYSTOR 2012

Page 42: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 42

Hecatonchire DSM VM – ccNUMA Challenge

Linux NUMA topology

Linux is aware of NUMA topology (which cores

and memory banks reside in each zone/node).

Linux exposes this topology for applications to

make use of it.

But is up to the application to be NUMA-

aware … if not it may suffer when

running on NUMA topology

And even if the application is NUMA

aware the longer time needed for Cache-

Coherency (cc) may hurt performance

Inter-core: L3 Cache 20 ns

Inter-socket: Main Memory 100 ns

Inter-node (IB): Remote Memory 2,000 ns

Intel Nehalem Memory Hierarchy

Page 43: SAP Virtualization Week 2012 - The Lego Cloud

© 2012 SAP AG. All rights reserved. 43

Legal Disclaimer

The information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission of

SAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAP

has no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or

release any functionality mentioned therein. This document, or any related presentation and SAP's strategy and possible future

developments, products and or platforms directions and functionality are all subject to change and may be changed by SAP at

any time for any reason without notice. The information on this document is not a commitment, promise or legal obligation to

deliver any material, code or functionality. This document is provided without a warranty of any kind, either express or implied,

including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This

document is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors or

omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent.

All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially

from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as

of their dates, and they should not be relied upon in making purchasing decisions.