beegfs & nvmesh taming i/o-hungry application beasts with · 2019. 8. 22. · taming i/o-hungry...
TRANSCRIPT
Taming I/O-hungry Application Beasts with BeeGFS & NVMesh
Sven BreunerField [email protected]
About Your Speaker: Sven Breuner
M.Sc. in Computer Science, specialization in Distributed Algorithms & Applications
Joined Fraunhofer Center for High Performance Computing in 2005 to create a new parallel file system, which later became BeeGFS
Co-Founder, CEO & CTO of ThinkParQ in 2013 to provide professional services and to extend BeeGFS for use-cases beyond HPC
Joined Excelero in 2019 as Field CTO to be at the forefront of the paradigm shift to the new generation of NVMe-based storage systems
•Ultra low latency & high throughput storage to…• Enable deeper analytics & more realistic simulations for better insights• Complete critical tasks in shorter time• Allow for better parallelism in multi-user environments• Enable a new class of efficient algorithms that are not designed around avoiding disk seeks
• Stay ahead of the competition
Why is everyone talking about NVMe storage these days3
But: While it’s easy to throw some NVMe drives at a server these days,it’s usually considered hard to make them play well with the cluster
Ingredients for ultra low latency & high throughput storage4
• Software for network access to NVMe, designed for ultra low latency: NVMesh
• Software for scale-out volumes on top of NVMe drives: NVMesh
• Flexibility to…• Have a good system balance (e.g. network interconnect vs drives per server)• Pick the hardware (models & amount) that works best for you• Keep your solution affordable⇨ Software-defined storage (NVMesh + BeeGFS)
• Easy management, so that you don’t have to hug your storage every day to keep it running:⇨ NVMesh + BeeGFS
• A cluster file system that nicely handles…• Small & large files, different access patterns,
various system sizes, concurrent access:BeeGFS
• Hardware technology for ultra low latency & high throughput: NVMe• A fast network: Ideally InfiniBand or 100GbE
5
Enter NVMesh…Turn individual NVMe drives into something that’s actually useful for a cluster
Ingredient #1: Remote NVMe Access at Local Latency 6
Ingredient #2: Scale-out Volumes
File Systems
Logical Volumes
with Multi-Pathing
Ultra Fast Remote NVMe Access
OS
Servers
7
Linux
I/O-hungry Application Beasts
Local
Parallel
No Redundancy
N+M Erasure Coding(Distributed RAID5 / RAID6)
Mirroring Parity-based
Ingredient #3: Flexible Data Protection
New!90+% UsableConcatenated
Mirrored(Distributed
RAID1)
Striped(Distributed RAID0)
Striped & Mirrored(Distributed RAID10)
8
Ingredient #4: Deployment Your WayConverged - Local Storage in Servers Top-of-Rack Flash
• Single, unified storage pool
• NVMesh runs on all nodes
• NVMesh bypasses server CPU
• Rotating parity
• Linearly scalable
• Single, unified storage pool
• NVMesh Target runs on
storage nodes
• NVMesh Client runs on
application servers
• Applications get
performance of local
storage
• Rotating parity
• Linearly scalable
NVMesh Targets
NVMesh Clients
NVMesh Targets
NVMesh Clients
9
10
Enter BeeGFS…Turn fast block storage into fast file storage…
Ingredient #5: Turn fast blocks into fast files11
Storage Server #1
Storage Server #2
Storage Server #3
Storage Server #4
Storage Server #5
Metadata Server #1
File #1
1 1 2
File #2File #3
2 3 31 2 3 M M M
Simply grow CAPACITY and PERFORMANCE to the Level that you need
is…
/mnt/beegfs/dir1
…
A Hardware-independent parallel Cluster FilesystemDesigned for performance-critical Environments
BeeGFS Services Overview12
▪ CLIENT SERVICE▪ Native Linux module to mount the file system
▪ STORAGE SERVICE▪ Store the (distributed) file contents
▪METADATA SERVICE▪ Maintain striping information for files▪ Not involved in data access between file open/close
▪MANAGEMENT SERVICE + GUI▪ Service registry and watch dog
▪ BeeGFS Servers assume fast, locally-attached, protected volumes to deliver best-in-class Performance⇨ NVMesh
BeeOND: BeeGFS-On-Demand for NVMes you already have13
Create parallel file system instanceson-the-fly
▪ Start/stop with one simple command▪ Can be integrated into cluster batch system
(Slurm, Univa, PBS, …)▪ Common use case:
Per-job parallel file system▪ Aggregate the performance and capacity of local
drives in compute nodes of a job▪ Take load from global storage▪ Speed up "nasty" I/O patterns
▪ NVMesh flexibly creates the volumes and makes them available where needed
Turbo-boosted Storagewith BeeGFS & NVMesh
Sven BreunerField CTO, [email protected]
BeeGFS + NVMesh: How are they related?
/
OS: Linux
Efficient Applications & Happy Users
How to boost Storage with BeeGFS & NVMesh
BeeGFS runs seamlesslyon top of NVMesh
• Various booster options• Metadata target, Storage Pool, BeeOND,
all-flash system• Data protection
• NVMesh adds improved mirroring and erasure coding to BeeGFS
• Logical volumes• NVMe drives can be shared for BeeGFS
and other use-cases• Easy & flexible monitoring
• Grafana dashboards can show BeeGFS & NVMesh workloads
Metadata
BeeOND
Storage Pool
Typical AI System with BeeGFS + NVMesh in a Box17
The DGX-2s here are very happy for not starving on I/O :-)
What to keep in mind for perfect Balance• 48 PCIe lanes per Intel Socket
(96 PCIe lanes for dual Socket)• 16 PCIe lanes per 100Gbps NIC• 4 PCIe lanes per NVMe drive• To avoid crossing sockets, you are limited to 1 NIC per Socket
(because 2 NICs would consume more than half the available lanes)
⇨ Thus, 1 NIC + 4 NVMe drives per Intel Socket are optimal(2 NICs + 8 NVMe drives per dual-socket server)
Recipe for balanced NVMe Servers
BeeGFS & NVMesh in a Box
Balanced Reference Design
• Elegant and dense• 4x Server in 2U, 24x NVMe, 8x 100Gbps NIC
• Always on• Nicely goes with RAID10 or RAID6 (6+2)
• Fully unleashed NVMe performance• Random 4K write IOPS boosted to
<imagine_crazy_high_number_here>• File creates boosted to <imagine_insanely
cool_number_here>
BeeGFS & NVMesh Performance “out of the box”20
Clients / Compute Nodes
4-Node 2U Server with 24 NVMe Drives
8x BeeGFS Client, each:• 2x Intel 2630 CPU, 128 GB RAM
• 1x Mellanox Connect-X 4 100Gb InfiniBand
4x BeeGFS & NVMesh Storage Node, each:• 6x WD SN200 3.2TB NVMe
• 2x Intel Xeon 4114 CPU, 192GB RAM
• 2x Mellanox Connect-X 5 100Gb InfiniBand
BeeGFS Storage Node Setup, each:• 4x 100GB MeshProtect 10 volumes for metadata services
• 6x 1.5TB MeshProtect 10 volumes for storage services
OR
• 6x 2.25TB MeshProtect 6 (6+2) volumes for storage services
RDMA Fabric(100Gb InfiniBand)
Metadata Write & Read Operations: MeshProtect 10 vs. Buddy Mirror21
BeeGFS file creates boosted to 600,000/s(3x improvement on same hardware)
BeeGFS file stats boosted to >60,000,000/s(2.5x improvement on same hardware)
Small I/O (4K) Performance: MeshProtect 10 vs. Buddy Mirror22
BeeGFS random small writes boosted to > 1.25M/s(2.5x improvement on same hardware)
23
Large I/O Performance
50%
U
sabl
eCa
paci
ty
75%
Usa
ble
Capa
city
50%
Usa
ble
Capa
city
50%
Usa
ble
Capa
city
50%
U
sabl
eCa
paci
ty
MeshProtectErasure Coding (6+2)
=High Throughput,
more Capacity
75%
Usa
ble
Capa
cit
y
NVMesh Erasure Coding enables over 90% Usable Capacity (11+1) for BeeGFS at NVMe Speed
See what’s going on for BeeGFS & NVMesh together24
Combine BeeGFS and NVMesh Grafana Dashboards to produce a unified, end-to-end cluster view.
In this example:• NVMesh Cluster
IOPs• BeeGFS IO load• BeeGFS
metadata operations
• NVMesh cluster throughput
• BeeGFS file system throughput
To sum it up: How do NVMesh & BeeGFS boost your storage?25
•Protected Storage & Protected Investment• Easy to manage software-defined storage with flexible redundancy• Freedom to choose components from different manufacturers• No lock-in
•Full NVMe Advantage & Scale-out Performance• NVMesh provides ultra low latency access to NVMe volumes for BeeGFS over the network
• Hardware, volumes and service instances can be added on-the-fly
Sven BreunerField [email protected]
Thank you!Questions?
Taming I/O-hungry Application Beastswith BeeGFS & NVMesh