brent gorda general manager, high performance data division · – not individual compute node...

18
Brent Gorda General Manager, High Performance Data Division

Upload: others

Post on 15-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

Brent Gorda

General Manager, High Performance Data Division

Page 2: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

Legal Disclaimer

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase.

For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.

Intel, Xeon, Xeon Phi and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2016 Intel Corporation. All rights reserved.

2

Page 3: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

3

Today: ML Data Movement with Lustre*

Collection:

Network Infrastructure

DATA CENTER/ CLOUDLustre Parallel F/S

Data Ingress

Petabytes of Data

Machine Learning

Remote Ruggedized100’s MB/s of IO

Terabytes of Data

Asynchronous uploadFormatting / AuthenticationSecurity Upto GB/s of IO

100’s GB/s of Bandwtih

Data = Memories CPUs = Brain More Data== Better ResultsMore CPU = Faster Results

20GB/s+ per PB10’s – 100’s of PBData Growth > 2 PB MonthRack scale compute

Page 4: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

4

Lustre with OpenStack is a growing area

https://www.openstack.org/videos/video/lustre-integration-for-hpc-on-openstack-at-cambridge-and-monash

Page 5: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Designed for PB-Scale Data Sets

Density Optimized Design For Scale• Dense Storage Design Translates to Lower $/GB• Limitless Scale Capability – Solution Grows Linearly in

Innovative Software Features

Leading Edge Yet Enterprise Ready Solution• ZFS software RAID provides Snapshot, Compression & Error Correction• ZFS reduces hardware costs with uncompromised performance• Rigorously Tested for Stability & Efficiency

High Performance Storage Solution

Meets Demanding I/O requirementsPerformance measured for an Apollo 4520 building block • Up to 17 GB/s Read/15 GB/s Writes with EDR1

• Up to 16 GB/s Reads and Writes with OPA1

• Up to 21GB/s Reads and 15GB/s Writes with all SSD’s²

“Appliance Like” Delivery Model

Pre-Configured Flexible Solution• Deployment Services for Installation• Simplified Wizard Driven Management thru Intel Manager for

LustreHPE Scalable Storage with Intel EE for Lustre*

OmniPath

DL360 + MSA 2040

HPC Clients

Apollo 6000

DL360 (Intel Management Server)

Built on Apollo 4520

1: Different Conditions & Workloads can affect the Performance 2: 24 x1.6TB MU SSD’s in A4520 no JBOD

Page 6: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,
Page 7: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

7

Byte-addressable Persistent Memory

`

Rotational NAND 3D-Xpoint

Fre

qu

en

cy o

f A

cce

ss

SSD(PCIe)

Disk Drives

Co

ldW

arm

Ho

t

NVDIMM

SSD

NVMe

SSD SSD

Disk Drives Disk Drives Disk Drives

NVMe NVMe

Disk Drives

Random Access Time (Gap)

Page 8: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

8

3D Xpoint + Omni-Path

Byte granular

Ultra low latency

– ~0.01 µS storage latency

– + ~1 µS network latency

= ~1 µS hardware latency

Conventional Storage Stack

Block/page granular and locking

High overhead

– ~1 µS kernel/user context switch

– + ~10 µS communications software

– + ~100 µS filesystem & block I/O stack

= ~100 µS software latency

Disruptive Technologies

Entirely masksHW capabilities!!!

~ = order of magnitude µS = micro seconds

Page 9: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

9

3D Xpoint + Omni-Path

Byte granular

Ultra low latency

– ~0.01 µS storage latency

– + ~1 µS network latency

= ~1 µS hardware latency

Conventional Exascale NEW Storage Stack

Arbitrary alignment and granularity

Ultra low overhead

– OS bypass comms + storage

– Shared nothing

Disruptive Technologies

Deliver HW performance!!!100x/1000x increase in data velocity!!!

~ = order of magnitude µS = micro seconds

Page 10: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

Storage Server

10

End-to-end OS bypass

Mercury userspace function shipping

– MPI equivalent communications latency

– Built over libfabric

Applications link directly with DAOS lib

– Direct call, no context switch

– No locking, caching or data copy

Userspace DAOS server

– Mmap non-volatile memory (NVML)

– NVMe access through SPDK*

– User-level thread with Argobots**

– FPGA offload

Lightweight Storage Stack

HPC Application

DAOS library

DAOS Server

Mercury/libfabric

NVMe

NVRAM

Bulk transfers

* https://01.org/spdk ** https://github.com/pmodels/argobots/wiki

Page 11: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

11

Mix of storage technologies

NVRAM (3D Xpoint DIMMs)

– DAOS metadata & application metadata

– Byte-granular application data

NVMe (NAND, 3D NAND, 3D Xpoint)

– Cheaper storage for bulk data

– Multi-KB

I/Os are logged & inserted into persistent index

All I/O operations tagged/indexed by version

Non-destructive write: log blob@version

Consistent read: blob@version

No alignment constraints

Ultra-fine grained I/O Index

ExtentsVers

ion =

epoch

Being written

Committed

NVRAM

NVMe

v1

v2

v3

read@v3

Page 12: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

12

Scalable I/O

Lockless, no read-modify-write

Producers not blocked by consumers

– And vice-versa

Conflict resolution in I/O middleware

– No system-imposed, worst case serialization

– Ad hoc concurrency control mechanism

Scalable communications

Track process groups/jobs

– not individual compute node

Tree-based broadcast

Scalable metadata

Collective open/close

Tree-based caching, refcount, open handles, …

Distributed/global transactions

No object metadata maintained by default

Shared-nothing distribution schema

Algorithmic placement

– Scales with # storage nodes

Data-driven placement

– Scales with volume of data

Performance domains

Extreme Scale-out

Page 13: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

13

DAOS Ecosystem

DAOSApache 2.0

HPC application, storage service/engine,machine learning …

HDF5 + ExtensionsHPC

LegionHPC

NetCDFHPC

USDComputer animation

HDFS/SparkAnalytics

CloudIntegra

tion

POSIXHPC

KV Store

Varia

MPI-IOHPC

Page 14: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

ESSIO (Q3’15 - Q2’17)

Alpha quality

N-way replication

Online rebuild

Multi-tier prototype

Follow-on projects

DAOS Productization

Next gen HW support

System integration

Future

Progressive layout

Erasure code

Security Model

DAOS Roadmap

Page 15: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

15

DAOS Resources (Apache 2.0)

Public git repository

git clone http://review.whamcloud.com/daos/daos_m

Browsable source code: http://git.whamcloud.com/daos/daos_m.git

Mirror on github: https://github.com/daos-stack/daos

Released under Open source Apache 2.0 License

Leveraging other open source projects

– DoE: Mercury, Argobots

– Intel: Libfabric, PMIx, NVML, SPDK, ISA-L

Page 16: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,
Page 17: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

High IOPS Lustre Configurations with Intel SSDs Utilize single port SSDs to build high performance Lustre

scratch file system

Design with SATA SSDs for Object store target while NVMe SSD for Metadata targets.

All flash configuration at Intel Endeavor cluster on 350 SSDs deliver remarkable performance for parallel access needs

SSD 4x throughput @44GBps Base cost @ 4x <than

commercial HDD solution Continued interleaved scaling

after 32x clients – Iozone*

https://www-ssl.intel.com/content/www/us/en/solid-state-drives/hpc-ssd-dc-family-lustre-file-system-study.html

LFS09 – Intel SSD DC S3500 Series Based Lustre* System 1x Meta Data Server (MDS)2 x Intel® Xeon® Processor E5-2680 + 64GB RAM + 2x SATA RAID0 MDT FDR In niband* (56Gb/s) 8x Storage Server (OSS)2 x Intel® Xeon® Processor E5-2680 + 64GB RAM 1x Intel® SSD 320 Series for OS 3x RAID controllers with 8 SAS/SATA targets each 6x OST per server, each target = 4x Intel® SSD DC S3500 Series @600GB SSDs ‘over-provisioned’ to 75% for 450GB usable Software Stack:RedHat* Enterprise Linux* 6.4 + Lustre* 2.1.5

*Other names and brands may be claimed as the property of others.

+Intel® SSD DC S3500 Series

17

Page 18: Brent Gorda General Manager, High Performance Data Division · – not individual compute node Tree-based broadcast Scalable metadata Collective open/close Tree-based caching, refcount,

Emerging trends

Vast majority of storage objects are tiny…

Kilobytes and smaller

Trending to larger proportion of system capacity

High performance => Billions of IOPs @ lowest possible latency

Resilience => Replication improves concurrency/scalability @ acceptable storage overhead

Poorly supported by today’s filesystems

…but vast majority of space is used by large storage objects

Megabytes and larger

High performance => efficient streaming @ Terabytes per second

Resilience => Erasure codes required for space efficiency & limit system cost

>