designing high-performance, resilient and …...designing high-performance, resilient and...

16
Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. Dhabaleswar K. Panda (Advisor) Dr. Xiaoyi Lu (Co-Advisor) Department of Computer Science & Engineering The Ohio State University

Upload: others

Post on 07-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for

Modern HPC Clusters

Dipti Shankar

Dr. Dhabaleswar K. Panda (Advisor)Dr. Xiaoyi Lu (Co-Advisor)

Department of Computer Science & EngineeringThe Ohio State University

Page 2: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 2Network Based Computing Laboratory

Outline

• Introduction and Problem Statement

• Research Highlights and Results• Broader Impact

• Conclusion & Future Avenues

Page 3: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 3Network Based Computing Laboratory

Key-Value Storage in HPC and Data Centers• General purpose distributed memory-centric storage

– Allows to aggregate spare memory from multiple nodes (e.g, Memcached)

• Accelerating Online and Offline Analytics in High-Performance Compute (HPC) environments

• Our Basis: Current High-performance and hybrid key-value stores for modern HPC clusters

v High-Performance Network Interconnects (e.g., InfiniBand) v Low end-to-end latencies with IP-over-InfiniBand

(IPoIB) and Remote Direct Memory Access (RDMA)

v ‘DRAM+SSD’ hybrid memory designs

v Extend storage capabilities beyond DRAM capabilities using high-speed SSDs

!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High Performance

Networks

High Performance

Networks

(Offline Analytical Workloads: Software-Assisted Burst-Buffer )

(Online Analytical Workloads: OLTP/NoSQL Query Cache)

Page 4: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 4Network Based Computing Laboratory

Drivers of Modern HPC Cluster Architectures

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

• Solid State Drives (e.g., PCIe/NVMe-SSDs), NVRAM (e.g., PCM, 3DXpoint), Parallel Filesystems (e.g., Lustre)

• Accelerators (e.g., NVIDIA GPGPUs)

• Production-scale HPC Clusters: SDSC Comet, TACC Stampede, OSC Owens, etc.

High Performance

Interconnects

Multi-core Processors with

vectorization support +

Accelerators (GPUs)

Large Memory

Nodes (DRAM)SSD, NVMe-SSD, NVRAM

Page 5: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 5Network Based Computing Laboratory

• Our focus: Holistic approach to designing a high-performance and resilient

key-value storage for current and emerging HPC systems

– RDMA-enabled networking

– Hybrid NVM storage-awareness: High-speed NVMe-/PCIe-SSDs and upcoming

NVRAM technologies

– Heterogeneous compute devices (e.g., SIMD capabilities of CPUs and GPUs)

• Goals: Maximize end-to-end performance to

– Optimally exploit available compute/storage/network capabilities

– Enable data-intensive Big Data applications to leverage memory-centric key-value

storage to improve their overall performance

Holistic Approach for HPC-Centric KVS

Page 6: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 6Network Based Computing Laboratory

Research Framework

High­Performance, Hybrid and Resilient Key Value Storage 

Application WorkloadsOnline 

(E.g., SQL/NoSQL query cache/ LinkBench and TAO Graph KV workloads)

Offline (E.g., Burst­Buffer over PFS for Hadoop for

MapReduce/Spark)

Non­Blocking RDMA­aware API Extensions

Fast Online Erasure Coding (Reliability)

NVRAM­aware CommunicationProtocols (Persistence)

Accelerations on SIMD­based(e.g., GPU) Architecture

(Scalability)

Heterogeneity­Aware (DRAM/NVRAM/NVMe/PFS) Key­Value Storage Engine 

Modern HPC System Architecture

Volatile/Non­VolatileMemory Technologies (DRAM, NVRAM)

Multi­CoreNodes 

(w/ SIMD units)GPUs

RDMA­CapableNetworks (InfiniBand,

40Gbe RoCE)

Parallel FileSystem (Lustre)

Local Storage(PCIe SSD/  NVMe­SSD)

Page 7: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 7Network Based Computing Laboratory

High-Performance Non-Blocking API Semantics

v Heterogeneous Storage-Aware Key-Value Stores (e.g., ‘DRAM + PCIe/NVMe-SSD’)

• Higher data retention at the cost of SSD I/O; suitable for out-of-memory scenarios

• Performance limited by Blocking API semantics

v Goals: Achieve near in-memory speeds while being able to exploit hybrid memory

v Approach: Novel Non-blocking API Semantics to extend RDMA-Libmemcached library

• memcached_(iset/iget/bset/bget) APIs for SET/GET

• memcached_(test/wait) APIs for progressing communication

D. Shankar, X. Lu , N. Islam , M. W. Rahman , and D. K. Panda, “High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and

SSDs: Non-blocking Extensions, Designs, and Benefits”, IPDPS 2016

BLOCKINGAPI

NON-BLOCKINGAPIREQ.

NON-BLOCKINGAPIREPLY

CLIENT

SERV

ER

HYBRIDSLABMANAGER(RAM+SSD)

RDMA-ENHANCEDCOMMUNICATIONLIBRARY

RDMA-ENHANCEDCOMMUNICATIONLIBRARY

LIBMEMCACHEDLIBRARY

BlockingAPIFlow Non-BlockingAPIFlowNonB-b

Test/Wait

NonB-i

Page 8: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 8Network Based Computing Laboratory

High-Performance Non-Blocking API Semantics

• Set/Get Latency with Non-Blocking API: Up to 8x gain in overall latency vs. blocking API semantics over RDMA+SSD hybrid design

• Up to 2.5x gain in throughput observed at client; Ability to overlap request and response phases to hide SSD I/O overheads

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set GetIPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-

Block H-RDMA-Opt-

NonB-i H-RDMA-Opt-

NonB-b

Aver

age

Late

ncy

(us)

Miss Penalty (Backend DB Access Overhead)Client WaitServer ResponseCache UpdateCache check+Load (Memory and/or SSD read)Slab Allocation (w/ SSD write on Out-of-Mem)

Overlap SSD I/O Access (near in-memory latency)

Page 9: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 9Network Based Computing Laboratory

v Erasure Coding (EC): A low storage-overhead alternative to Replication

v Bottlenecks: (1) Encode/decode compute overheads (2) Communication overhead of scattering/ gathering distributed data/parity chunks

v Goal: Making Online EC viable for key-value stores

v Approach: Non-blocking RDMA-aware semantics to enable compute/communication overlap

v Encode/Decode offload capabilities integrated into Memcached client (CE/CD) and server (SE/SD)

Fast Online Erasure Coding with RDMA

EC

Perf

orman

ce

Ideal Scenario

MemoryEfficient

High Performance

Memory Overhead

No FT

Replication

RS (3,2) => 66% Storage Overhead vs 200% of Replication-Factor=3

Page 10: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 10Network Based Computing Laboratory

Fast Online Erasure Coding with RDMA

D. Shankar , X. Lu , and D. K. Panda, “High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads”, 37th International Conference on Distributed Computing Systems (ICDCS 2017)

0

500

1000

1500

2000

2500

3000

3500

512 4K 16K 64K 256K 1M

Tota

l Set

Lat

ency

(us

)

Value Size (Bytes)

Sync-Rep=3 Async-Rep=3

Era(3,2)-CE-CD Era(3,2)-SE-SD

Era(3,2)-SE-CD

~1.6x

~2.8x

0

50

100

150

200

250

4K 16K 32K 4K 16K 32K

YCSB-A YCSB-B

Ag

g. T

hro

ug

hp

ut

(Ko

ps/

sec)

Memc-RDMA-NoRep Async-Rep=3Era(3,2)-CE-CD Era(3,2)-SE-CD

~1.5x~1.34x

• Experiments with YCSB for Online EC vs. Async. Rep:

• 150 Clients on 10 nodes on SDSC Comet Cluster (IB FDR + 24-core Intel Haswell) over 5-node RDMA-Memcached Cluster

• (1) CE-CD gains ~1.34x for Update-Heavy workloads; SE-CD on-par (2) CE-CD/SE-CD on-par for Read-Heavy workloads

Intel Westmere + QDR-IB ClusterRep=3 vs. RS (3,2)

Page 11: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 11Network Based Computing Laboratory

• Offline Data Analytics Use-Case: Hybrid and resilient key-value store-based Burst-Buffer system Over Lustre (Boldio)

• Overcome local storage limitations on HPC nodes; performance of `data locality’

• Light-weight transparent interface to Hadoop/ Spark applications

• Accelerating I/O-intensive Big Data workloads

– Non-blocking RDMA-Libmemcached APIs to maximize overlap

– Client-based replication or Online Erasure Coding with RDMA for resilience

– Asynchronous persistence to Lustre parallel file system at RDMA-Memcached Servers

D. Shankar, X. Lu, D. Panda, Boldio: A Hybrid and Resilient Burst-Buffer over Lustre for Accelerating Big Data I/O, IEEE International Conference on Big

Data 2016 (Short Paper)

Co-Designing Key-Value Store-based Burst Buffer over PFS

Page 12: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 12Network Based Computing Laboratory

Co-Designing Key-Value Store-based Burst Buffer over PFS

• TestDFSIO on SDSC Gordon Cluster (16-core Intel Sandy Bridge and IB QDR) with 16-node MapReduce Cluster + 4-node Boldio Cluster

• Boldio can sustain 3x and 6.7x gains in read and write throughputs over stand-alone Lustre

0

20000

40000

60000

80000

60 GB 100 GB 60 GB 100 GB

Write Read

Agg.Throu

ghpu

t(MBp

s) Lustre-Direct Boldio

6.7x

3x

0

10000

20000

30000

40000

50000

60000

Write Read

DF

SIO

Agg

. T

hro

ugh

pu

t (M

Bp

s)

Lustre-DirectAlluxio-RemoteBoldio_Async-Rep=3Boldio_Online-EC

No EC overhead (Rep=3 vs. RS (3,2))

• TestDFSIO on Intel Westmere Cluster (8-core Intel Sandy Bridge and IB QDR); 8-node MapReduce Cluster + 5-node Boldio Cluster over Lustre

• Performance gains over designs like Alluxio(formerly Tachyon) in HPC environments with no local storage

Page 13: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 13Network Based Computing Laboratory

Exploring Opportunities with NVRAM and RDMA

DRAM HCA 

Mem.  Cntrl 

DRAM NVRAM 

I/O  Cntrllr 

L3 

PCIe Bus 

HCA 

Mem.Cntrl

NVRAM 

I/O Cntrllr

L3 

PCIe Bus  Du

rab

ility P

ath

RDMA into NVRAM RDMA from NVRAMLocal Durability Point

Core L1/L2 

Core L1/L2 

Core L1/L2 

Core L1/L2 

Node1 (Initiator/Target)Core L1/L2 

Core L1/L2 

Core L1/L2 

Core L1/L2 

Node0 (Initiator/Target)• Emerging Non-volatile Memory technologies

(NVRAM)

• Potential: Byte-addressable and persistent; capable of RDMA

• Observations: RDMA writes into NVRAM needs to guarantee remote durability

• Appliance Method: Hardware-Assisted remote direct persistence

• General Purpose Server Method: Server-Assisted software-based persistence

• Opportunities: Designing high-performance RDMA-based Persistence Protocols (e.g., Persistent In-Memory KVS over NVRAM) D. Shankar, X. Lu, D. Panda, RDMP-KVS: Remote Direct Memory

Persistence Aware Key-Value Store for NVRAM Systems (Under

Submission)

Page 14: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 14Network Based Computing Laboratory

• RDMA for Apache Spark (RDMA-Spark), Apache Hadoop 2.x (RDMA-Hadoop-2.x), RDMA for Apache HBase

• RDMA for Memcached (RDMA-Memcached)

– RDMA-aware `DRAM+SSD’ hybrid Memcached server design

– Non-Blocking RDMA-based Client API designs (RDMA-Libmemcached)

– Based on Memcached 1.5.3 and Libmemcached client 1.0.18

– Available for InfiniBand and RoCE

• OSU HiBD-Benchmarks (OHB)– Memcached Set/Get Micro-benchmarks for Blocking and Non-Blocking APIs, and Hybrid

Memcached designs

– YCSB plugin for RDMA-Memcached

– Also includes HDFS, HBase, Spark Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 290 organizations, 34 countries, 27,950 downloads

Broader Impact: The High-Performance Big Data (HiBD) Project

Page 15: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 15Network Based Computing Laboratory

Conclusion & Future Avenuesv Holistic approach to designing key-value storage by exploiting the capabilities of HPC

clusters for (1) performance, (2) scalability, and, (3) data resilience/availability

v RDMA-capable Networks: (1) Proposed Non-blocking RDMA-based Libmemcached APIs

(2) Fast Online EC-based RDMA-Memcached designs

v Heterogeneous Storage-Awareness: (1) Leverage `RDMA+SSD’ hybrid designs, (2)

`RDMA+NVRAM’ Persistent Key-Value Storage

v Application Co-Design: Memory-centric data-intensive applications on HPC Clusters

v Online (e.g., SQL query cache, YCSB) and Offline Data Analytics (e.g., Boldio Burst-Buffer for

Hadoop I/O)

v Future Work: Ongoing work in this thesis direction

v Heterogeneous compute capabilities of CPU/GPU: End-to-end SIMD-aware KVS designs

v Exploring co-design of (1) Read-intensive Graph-based workloads (E.g., LinkBench, RedisGraph)

(2) Key-value storage engine for ML Parameter Server frameworks

Page 16: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK

SC 2018 Doctoral Showcase 16Network Based Computing Laboratory

[email protected]

http://www.cse.ohio-state.edu/~shankard

Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/