dcs: a fast and scalable device-centric server architecture · 2015-12-17 · dcs:a fast and...
Post on 25-Jan-2020
8 Views
Preview:
TRANSCRIPT
DCS: A Fast and Scalable Device-Centric Server Architecture
Jaehyung Ahn, Dongup Kwon, Youngsok Kim,Mohammadamin Ajdari, Jaewon Lee, and Jangwoo Kim
{jh2ekd, nankdu7, elixir, majdari, spiegel0, jangwoo}@postech.ac.kr
High Performance Computing LabPohang University of Science and Technology (POSTECH)
Inefficient device utilization
• Host-centric device management− Host manages every device invocation
− Frequent host-involved layer crossings Increases latency and management cost
1
Userspace
Kernel
Hardware
Application
Device A Device B
Driver BKernel stack
Driver AKernel stack
Device C
Driver CKernel stack
Datapath Metadata/Command path
Latency: High software overhead
• Single sendfile: Storage read & NIC send− Faster devices, more software overhead
2
Software overhead
Late
ncy
Dec
ompo
sitio
n(N
orm
aliz
ed)
7%
HDD10Gb NIC
50%
NVMe10Gb NIC
77%
PCM10Gb NIC
82%
PCM100Gb NIC
Software Storage NIC
0%
100%
Cost: High host resource demand
• Sendfile under host resource (CPU) contention− Faster devices, more host resource consumption
3
Sendfilebandwidth
100%
No contention
CPU Busy Sendfile bandwidth
*Measured from NVMe SSD/10Gb NIC
SendfileCPU usage
34%
High contention
Sendfilebandwidth
14%Sendfile
CPU usage 6%
Index
• Inefficient device utilization• Limitations of existing solutions• DCS: Device-Centric Server architecture• Experimental results• Conclusion
Limitations of existing work• Single-device optimization− Do not address inter-device communication
e.g., Moneta (SSD), DCA (NIC), mTCP (NIC), Arrakis (Generic)
• Inter-device communication − Not applicable for unsupported devices
e.g., GPUNet (GPU-NIC), GPUDirect RDMA (GPU-Infiniband)
• Integrating devices− Custom devices and protocols, limited applicability
e.g., QuickSAN (SSD+NIC), BlueDBM (Accelerator – SSD+NIC)
Need for fast, scalable, and generic inter-device communication
5
Index
• Inefficient device utilization• Limitations of existing solutions• DCS: Device-Centric Server architecture− Key idea and benefits
− Design considerations
• Experimental results• Conclusion
DCS Library Application
DCS Driver Device drivers & Kernel stacks
DCS: Key idea
• Minimize host involvement & data movement
7
Userspace
Kernel
Hardware
Datapath Metadata/Command path
Single command → Optimized multi-device invocation
Device CDevice BDevice A
DCS Engine
DCS: Benefits
• Better device performance− Faster data delivery, lower total operation latency
• Better host performance/efficiency− Resource/time spent for device management
now available for other applications
• High applicability− Relies on existing drivers / kernel supports / interfaces− Easy to extend and cover more devices
8
Index
• Inefficient device utilization• Limitations of existing solutions• DCS: Device-Centric Server architecture− Key idea and benefits
− Design considerations By discussing implementation details
• Experimental results• Conclusion
DCS: Architecture overview
10
Userspace
Kernel
Hardware
ApplicationDCS Librarysendfile(), encrypted sendfile()
DCS Driver
Command generatorKernel communicator
DCS Engine (on NetFPGA NIC)NVMe SSD
GPU
NetFPGA NIC
Fully compatible with existing system
CommandQueue
Commandinterpreter
Per-devicemanager
PCIe Switch
Drivers &Kernel stack
Existing System
Communicating with storage
11
Userspace
Kernel
Hardware
ApplicationDCS Library
DCS Driver
DCS Engine
NVMe SSDTarget device
Block addr (in device) / buffer addr (cached)
VFS cache
Source device
File descriptor
Hook / API call
Data consistency guaranteed
Source device
Target
(Virtual) Filesystem
❶
❷❸
❹
❺
Communicating with network interface
12
Userspace
Kernel
Hardware
ApplicationDCS Library
DCS Driver
DCS Engine
Data buffer
Network stackConnection information
NetFPGA NIC
Packet generation & Send HW PacketGen
Socket descriptor
Hook / API call
HW-assisted packet generation
❶
❷❸
❹
❺
Communicating with accelerator
13
Userspace
Kernel
Hardware
ApplicationDCS Library
DCS Driver
DCS Engine
Memory
GPU
Memory allocation
GPU user library
GPU kernel driverGet memory mapping
DMA / NVMe transferSource device
Kernel invocation
Process data(Kernel launch)
Call DCS library
Direct data loading without memcpy
❶❷
❸❺
❻
❼
❹
Index
• Inefficient device utilization• Limitations of existing solutions• DCS: Device-Centric Server architecture• Experimental results• Conclusion
Experimental setup• Host: Power-efficient system− Core 2 Duo @ 2.00GHz, 2MB LLC− 2GB DDR2 DRAM
• Device: Off-the-shelf emerging devices− Storage: Samsung XS1715 NVMe SSD− NIC: NetFPGA with Xilinx Virtex 5 (up to 1Gb bandwidth)− Accelerator: NVIDIA Tesla K20m− Device interconnect: Cyclone Microsystems PCIe2-2707
(Gen 2 switch, 5 slots, up to 80Gbps)
15
DCS prototype implementation• Our 4-node DCS prototype
− Can support many devices per host
16
NVMe SSD
NetFPGA NIC
GPU
PCIe Switch
Reducing device utilization latency• Single sendfile: Storage read & NIC send− Host-centric: Per-device layer crossings− DCS: Batch management in HW layer
17
Late
ncy
(µs)
HW75
SW79
75
Host-centric DCS
DCS39
Reducing device utilization latency• Single sendfile: Storage read & NIC send− Host-centric: Per-device layer crossings− DCS: Batch management in HW layer
18
Late
ncy
(µs)
HW75
SW79
75
Host-centric DCS
DCS39
2x latency improvement(with low-latency devices)
Host-centric DCS
Late
ncy
71% BW / CPU 11% busy
100% BW / CPU 29% busy
Host-independent performance• Sendfile under host resource (CPU) contention− Host-centric: host-dependent, high management cost− DCS: host-independent, low management cost
CPU BusySendfile bandwidth
Host-centricDCS
100% BW / CPU 70% busy
13% BW / CPU 10% busy
No contention High contention
High performance even on weak hosts
Multi-device invocation• Encrypted sendfile (SSD → GPU → NIC, 512MB)− DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps)
20
Normalized processing time
68
62
6
Host-centric
DCS
32 6
6 6Network send (1Gb)
14% reduction
GPU data loading GPU processing Network send NVIDIA driver
Multi-device invocation• Encrypted sendfile (SSD → GPU → NIC, 512MB)− DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps)
21
Normalized processing time
68
62
6
Host-centric
DCS
32 6
6 6Network send (1Gb)
14% reduction13
12
Network send (10Gb)38% reduction
GPU data loading GPU processing Network send NVIDIA driver
Real-world workload: Hadoop-grep• Hadoop-grep (10GB)− Faster input delivery & smaller host resource consumption
22
0255075
1000 10 20 30 40 50 60 70 80 90 100
110
120
130
140
150
Map progress Reduce progress
Host-centric%
0255075
100 DCS%
Map
/Red
uce
prog
ress
38% faster processing
Scalability: More devices per host • Doubling # of devices in a single host
23
CPU Utilization 60%
Total device throughput(N
ormalized)
2x1.3x
Scalable many-device support100% 22% 37%
Devices SSDNIC
SSDx2NICx2
SSDNIC
SSDx2NICx2
Host-centric DCS
Conclusion• Device-Centric Server architecture− Manages emerging devices on behalf of host− Optimized data transfer and device control− Easily extensible modularized design
• Real hardware prototype evaluation− Device latency reduction: ~25%− Host resource savings: ~61%− Hadoop-grep speed improvement: ~38%
24
Thank you!
High Performance Computing LabPohang University of Science and Technology (POSTECH)
Device latency reduction ~25%Host resource savings ~61%
Hadoop-grep speed improvement ~38%
NVMe SSD
NetFPGA NIC
GPU
PCIe Switch
top related