jim pinkerton microsoft corporation - snia · 2015. 9. 22. · • three “entitities” •...

26
Concepts on Moving From SAS connected JBOD to an Ethernet Connected JBOD (EBOD) Jim Pinkerton Microsoft Corporation 9/22/2015

Upload: others

Post on 09-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Concepts on Moving From SAS connected JBOD to an Ethernet

Connected JBOD (EBOD)

Jim Pinkerton Microsoft Corporation

9/22/2015

Page 2: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Re-Thinking Software Defined Storage Conceptual Model Definition

Storage Processing Node

(mem, CPU)

Non-volatile storage device

Reliable, Available Storage

Direct Attach: • SATA • NVME • CPU Memory Interface Shared Storage • SAS • Fibre Channel • Ethernet

Flaky Non-volatile Device

Fabrics: • Ethernet • InfiniBand • Fibre Channel

Disk Flash Tape Other

Compute/ Application

Node (mem, CPU)

• Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices

• Front end fabric: Ethernet, IB, FC • Back end fabric: Direct Attached or Shared Storage

File or Block Access

Page 3: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Yesterday’s Storage Architecture: Still highly profitable

Storage Processing Node

(mem, CPU)

Non-volatile storage device

Compute/ Appication

Node (mem, CPU)

Traditional SAN/NAS box

Storage Processing Node

(mem, CPU)

Storage Processing Node

(mem, CPU)

Non-volatile storage device

Non-volatile storage device

Reliable, Available Storage Compute/

Appication Node

(mem, CPU)

Compute/ Application

Node (mem, CPU) Direct Attach:

• SATA • NVME • CPU Memory Interface Shared Storage • SAS • Fibre Channel • Ethernet

Flakey Non-volatile Device

Fabrics: • Ethernet • InfiniBand • Fibre Channel

“Appliance” Vendor Tests it as a unit

Moved to general purpose CPU & OS

Disk Flash Tape Other

Compute Farm

Page 4: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Today: Software Defined Storage (SDS) “Converged” Deployments The rise of componentization of storage – but an

interop nightmare for the user

Storage Processing Node

(mem, CPU)

Non-volatile storage device

Compute/ Appication

Node (mem, CPU)

Storage Processing Node

(mem, CPU)

Storage Processing Node

(mem, CPU)

Non-volatile storage device

Non-volatile storage device

Reliable, Available Storage Compute/

Appication Node

(mem, CPU)

Compute/ Appication

Node (mem, CPU)

Compute Farm

Direct Attach: • SATA • NVME • CPU Memory Interface Shared Storage • SAS • Fibre Channel • Ethernet

Flakey Non-volatile Device

Fabrics: • Ethernet • InfiniBand • Fibre Channel

Software ships independent of hardware

JBODs

Customer is the first to test as a system

Storage Service

Page 5: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Today: Software Defined Storage (SDS) “Hyper-Converged” (H-C) Deployments

H-C appliances are a dream for the customer H-C $/GB storage is expensive

Storage Processing Node

(mem, CPU)

Non-volatile storage device

Compute/ Appication

Node (mem, CPU)

Storage Service

Storage Processing Node

(mem, CPU)

Storage Processing Node

(mem, CPU)

Non-volatile storage device

Non-volatile storage device

Reliable, Available Storage Compute/

Appication Node

(mem, CPU)

Compute/ Application

Node (mem, CPU)

Compute Farm

Direct Attach: • SATA • NVME • CPU Memory Interface Shared Storage • SAS • Fibre Channel • Ethernet

Flakey Non-volatile Device

Fabrics: • Ethernet • InfiniBand • Fibre Channel

Typically sold as an appliance (OEM owns E2E testing)

Unreliable NV Storage

Page 6: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

An Example: Microsoft’s Cloud Platform System (CPS)

Storage Processing Node

(mem, CPU)

Non-volatile storage device

Compute/ Appication

Node (mem, CPU)

Storage Processing Node

(mem, CPU)

Storage Processing Node

(mem, CPU)

Non-volatile storage device

Non-volatile storage device

Reliable, Available Storage Compute/

Appication Node

(mem, CPU)

Compute/ Application

Node (mem, CPU)

Compute Farm

Direct Attach: Shared Storage • SAS

Flakey Non-volatile Device

Fabrics: • Ethernet

JBODs Storage Service

SSD HDD

“Azure aligned innovation” “Appliance like experience” “Single Throat to Choke”

Package Deal Shipped as of Oct/2014

Page 7: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

SDS with DAS

Non-volatile storage device

Storage Service

Storage Processing Node

(mem, CPU)

Storage Processing Node

(mem, CPU)

Non-volatile storage device

Non-volatile storage device

Reliable, Available Storage Compute/

Application Node

(mem, CPU)

Compute Farm

Direct Attach: • SATA • NVME • CPU Memory Interface

Flakey Non-volatile Device

Fabrics: • Ethernet

Unreliable NV Storage

Non-volatile storage device

Storage FE Service

Storage Processing Node

(mem, CPU)

Storage Processing Node

(mem, CPU)

Non-volatile storage device

Non-volatile storage device

Reliable, Available Storage Compute/

Application Node

(mem, CPU)

Compute Farm

Direct Attach: • SATA, SAS • NVME • CPU Memory Interface

Flakey Non-volatile Device

Fabrics: • Ethernet

Unreliable NV Storage Storage BE Service

Storage Processing Node

(mem, CPU)

Storage Processing Node

(mem, CPU)

Unreliable Shared Block Storage

Fabrics: • Ethernet

Calabria

With DAS, rebuild & replication traffic moves to Ethernet

Page 8: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Software Defined Storage Topologies

Non-volatile storage device

Storage FE Service

Storage Processing

Node (mem, CPU)

Storage Processing

Node (mem, CPU)

Non-volatile storage device

Non-volatile storage device

Compute/ Application

Node (mem, CPU)

Compute Farm

Direct Attach: • SATA, SAS • NVME • CPU Memory Interface

Fabrics: • Ethernet

Unreliable NV Storage

Storage BE Service

Storage Processing

Node (mem, CPU)

Storage Processing

Node (mem, CPU)

Fabrics: • Ethernet

Non-volatile storage device

Storage FE Service

Storage Processing

Node (mem, CPU)

Storage Processing

Node (mem, CPU)

Non-volatile storage device

Non-volatile storage device

Compute/ Application

Node (mem, CPU)

Compute Farm

Direct Attach: • SATA, SAS • NVME • CPU Memory Interface

Fabrics: • Ethernet

Unreliable NV Storage

Storage BE Service

Storage Processing

Node (mem, CPU)

Storage Processing

Node (mem, CPU)

Fabrics: • Ethernet

Non-volatile storage device

Storage FE Service

Storage Processing

Node (mem, CPU)

Storage Processing

Node (mem, CPU)

Non-volatile storage device

Compute/ Application

Node (mem, CPU)

Compute Farm

Direct Attach: • SATA, SAS • NVME • CPU Memory Interface

Fabrics: • Ethernet

Unreliable NV Storage

Storage BE Service

Storage Processing

Node (mem, CPU)

Storage Processing

Node (mem, CPU)

Fabrics: • Ethernet

Hyper-Converged

Converged

EBOD Future?

= physical host boundary, blue is workload on physical node, arrows can go between physical nodes

File or Block Access Device

Access

Data Mover Workload

EBOD

Page 9: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

EBOD Works for a Variety of Access Protocols & Topologies SMB3 “block” Lustre – object store Ceph – object store NVME Fabric?… T10 Objects

9

Non-volatile storage device

Storage MDS Service

Storage Processing

Node (mem, CPU)

Storage Processing

Node (mem, CPU)

Non-volatile storage device

Non-volatile storage device

Compute/ Application

Node (mem, CPU)

Compute Farm

Direct Attach: • SATA, SAS • NVME • CPU Memory Interface

Fabrics: • Ethernet

Unreliable NV Storage

Storage BE Service

Storage Processing

Node (mem, CPU)

Storage Processing

Node (mem, CPU)

Fabrics: • Ethernet

EBOD

Page 10: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

I have a problem… shared SAS interop

Failover asdClustering s

SMB3 Service

Scale-Out File Server

Storage Spaces

File System

SAS HBA

SAS Port

SAS Port

SAS Port

SAS Port

SMB3 Service

Scale-Out File Server

SAS HBA

SAS Port

SAS Port

SAS Port

SAS Port

Storage Spaces

File System

SMB3 Service

Scale-Out File Server

Storage Spaces

File System

SAS HBA

SAS Port

SAS Port

SAS Port

SAS Port

SMB3 Service

Scale-Out File Server

Storage Spaces

File System

SAS HBA

SAS Port

SAS Port

SAS Port

SAS Port

CSV v2 + CSV Cache

Storage Pool(s)

NOTE: Each JBOD Chassis Expander has 4 SAS links – 2 links used to connect to Nodes, and two links to provide MPIO connectivity in the case of a SAS expander failure.

10Gb-E RDMA Port

10Gb-E RDMA Port

10Gb-E RDMA Port

10Gb-E RDMA Port

10Gb-E RDMA Port

10Gb-E RDMA Port

10Gb-E RDMA Port

10Gb-E RDMA Port

Dual Port SAS Disks

SAS JBOD Chassis with two Expanders

(2 in/out ports each)

SAS Exp

SAS Exp

Dual Port SAS Disks

SAS JBOD Chassis with two Expanders

(2 in/out ports each)

SAS Exp

SAS Exp

Dual Port SAS Disks

SAS JBOD Chassis with two Expanders

(2 in/out ports each)

SAS Exp

SAS Exp

Dual Port SAS Disks

SAS JBOD Chassis with two Expanders

(2 in/out ports each)

SAS Exp

SAS Exp

Node

Cha

ssis

SAS Shared Fabric • Multi-Initiator • Multi-Path

My Nightmare Experience • Disk multi-path interop • Expander multi-path interop • HBA distributed failures

There is a good reason why customers prefer appliances!

Example Storage Cluster (Microsoft’s CPS)

Page 11: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

To Share or Not to Share? Shared SAS:

Customer deployment can have serious bugs Failure of a FE node: JBOD fails over to another node Failure of a JBOD: all data is replicated

Non-Shared SAS (or SATA or NVME or SCM) Customer deployment more straightforward Failure of a FE node: EBOD fails over to another node Failure of a EBOD: all data is replicated New Ethernet traffic

Triple replica (3x increase bandwidth on Ethernet) Rebuild traffic

11

Page 12: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Hyper-Scale Cloud Tension – Fault Domain Rebuild Time

Rebuild time is a function of Number of disks/size of disk behind a node Speed of network and how much of it you want to use

Storage cost reduction is driving higher drive counts behind a node (>30 drives) Causes higher network costs because rebuild time must

occur in constant time

Conclusions:

Required network speed offsets benefits of greater density Fault domain for storage is too big

TB behind

one node

% BW utilization

Net speed (gb/s)

# nodes in

rebuild

Min for rebuild

180 25 40 120 20 1080 25 40 120 120 1080 50 100 120 24

Large # drives require

extreme network

bandwidth

Page 13: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Private Cloud Tension – Not enough Disks

Goal is entry point at 4 nodes (or less) If used same 30 disk JBOD

Loss of one node implies loss of 30 disks (180 TB) To recover from node loss, must have 25% of

capacity idle for single node failure, 50% idle for dual node fault tolerance

Conclusion: Fault domain is too large

13

Page 14: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Goals in Refactoring SDS

Optimize workloads for class of CPU Backend is “data mover” (EBOD)

Primarily movement of data and background tasks Data Integrity, caching, …

Little processing power

Frontend is “general purpose CPU” Still need accelerators for data movement, data integrity,

encryption, etc.

14

Page 15: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

EBOD Goals

Reduce Storage Costs Right size config for front end and back-end

workloads Reduce size of fault domain (and rebuild traffic

and network bandwidth requirements) Small Private Clouds Move storage fabric to Ethernet

Build on more robust ecosystem of DAS Keep topology for storage device simple

15

Page 16: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

EBOD Design Points

Ethernet Connected JBOD High End Box (NVME, Storage Class Memory) Volume Box (Some NVME/SSD, HDD) Capacity Box (primarily HDD, some NVME/SSD)

What If we used “small core” CPU? Fewer disks because cheaper CPU

16

Page 17: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

EBOD Volume Box

Enable an Ethernet connected JBOD with low disk count at *very* low cost Critical to hit a price point similar to existing SAS

JBODs that are integrated into the chassis Export just raw disks to keep CPU as simple as

possible, and SDS as close to hardware as possible Needed for Storage Class Memory

Enable front end nodes (big core) to create reliable/available storage

Page 18: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Comparing Storage Node Design Points

H B A

Multiple Expanders

Current Hyper-Scale

S O C

Expander N I C

Ethernet Switch

Ethernet Cable

Ethernet Connected

JBOD (EBOD)

60 disks

20 disks x 3

x3

Large Core CPU

N I C

N I C

Large Core CPU

N I C

HBA Replaced With: • NIC on Storage Node • Switch port to Storage Node • Switch port to EBOD

Expander Replaced With: • SOC • NIC • Smaller expanders Switch in Chassis?

JBOD

EBOD

Page 19: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

EBOD Volume Concept

CPU and Memory cost optimized for EBOD

Dual attach >=10 GbE SOC with integrated

RDMA NIC SATA/SAS/PCIE

connectivity to ~20 devices Universal connector (SFF-8639) Management

Out-of-band management through BMC In-band management with SCSI Enclosure

Services 19

SFF-8639?

EBOD

4 NVME Devices 16 HDD Drives Dual Port

>=10 GbE

Mgmt Port DMTF Redfish?

Page 20: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Volume EBOD Proof Point

Intel Avaton Microserver PCIE Gen 2 Chelsio 10 GbE NIC SAS HBA SAS SSD

20

Page 21: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

EBOD Avaton Microserver POC, 8K Random Reads IOPs

Max remote performance ~159K IOPS w/ RDMA ~122K IOPS w/o RDMA

Higher Performance gain with RDMA at lower IOPs

At Higher queue depths, RDMA gains reduce to ~30% CPU% capped at 122K

(28 outstanding IOPs/7 SSDs)

CPU is bottleneck

55%

47%

41%

35%

32% 31% 31%

0%

10%

20%

30%

40%

50%

60%

-

50,000

100,000

150,000

200,000

250,000

300,000

1 2 4 8 16 32 64

RD

MA

PE

RF.

GA

IN

IOP

S

OUTSTANDING IO / DISK

8K Random Reads (100%)

Local IOPs Remote IOPS No RDMA Remote IOPS RDMA perf. gain

60 50 40 30 20 10 0

Local IOPs RDMA Remote IOPs No RDMA Remote IOPs

1 2 4 8 16 32 64 Outstanding IO/Disk

300

250

200

150

100

50

K IO

Ps

(8 K

B)

55%

41%

47%

35% 32% 31% 31%

RD

MA

Per

form

ance

Gai

n Remote Access is bottlenecked on Network Speed

Configuration • Intel Avaton Microserver

• PCIE Gen 2 • Chelsio 10 GbE NIC • SAS HBA • SAS SSD

Page 22: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

EBOD Avaton Microserver POC - 8K Random Reads, Latency

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60 70

MSE

C

OUTSTANDING IO/DISK

8K Random - 100% reads - Latency

Local latency remote latency No RDMA latency

With RDMA At least 65%

speedup in IO latency (~75% at higher IOPs)

Remote Latency drops at 4 outstanding IO/disk (7 disks)

Goes back up at 8 outstanding IO/disk

0 10 20 30 40 50 60 70 Outstanding IO/Disk

4

3.5

3

2.5

2

1.5

1

0.5

0

Late

ncy

(mse

c)

Local Latency RDMA Remote Latency No RDMA Remote Latency

Page 23: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

EBOD Performance Concept Goal is highest performant

storage, thus big-CPU is small part of total cost Hi-speed CPU

reduces latency Dual attach >=40 GbE SOC with integrated

RDMA NIC PCIE connectivity to ~20

devices Possibly all NVME attach or

Storage Class Memory 23

EBOD

16 NVME Drives or Storage Class Mem Dual Port

>=40 GbE

Mgmt Port DMTF Redfish?

See “SMB3.1.1 Update” SDC Talk for Dual 100 GbE early results

Page 24: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Summary

EBOD enables Shared storage on Ethernet using DAS storage

DAS storage has better interop within eco-system Price point of EBOD must be carefully managed

Microserver CPU is viable for broad spectrum of perf Integrated SOC solution is preferred

Low price point of EBOD CPU/Memory enables smaller fault domain (fewer disks can be behind the Microserver)

24

Page 25: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Q&A

25

Page 26: Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices • Front end fabric: Ethernet,

2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Outstanding Technical Issues

Enclosure level How to manage storage enclosure? SCSI Enclosure

Services (SES)? Base Management? DMTF Redfish or IPMI? How to provision raw storage? Liveness of drives/enclosure

How to fence individual drives? Security model Advanced features in EBOD?

Low latency, integrity check, caching, …

26