nightwatch: remoting accelerator apis through the hypervisorhyu/talk/ava-poster.pdf · nightwatch:...

1
NIGHTWATCH: Remoting Accelerator APIs through the Hypervisor Hangchen Yu, Amogh Akshintala, Arthur Peters, Christopher J. Rossbach The University of Texas at Austin University of North Carolina at Chapel Hill VMware Research Group NIGHTWATCH: Automatic Accelerator Virtualization Automatic Generation Evaluation Compatibility recovered: stack generation is automatic Interposition recovered: APIs forwarded over VMM- managed transport Near-native performance Scales to 16 VMs Fair scheduling with heterogeneous workloads Migration overhead < 40 ms Slowdown under memory over-subscription < 2.5× # of API Lines of Spec LOC (generated) Time OpenCL 88 4,200 4,150 A handful of days CUDA 211 6,350 8,150 TensorFlow 160 5,350 6,900 MVNC API 25 910 2,450 Silos Complicate Virtualization NWExtractor API forwarding: sacrifices interposition and compatibility Development Effort Para-virtual I/O: e.g. SVGA translates guest interactions into DirectX which leads to serious complexity and compatibility issues CL.h cuda.h tensorflow.h NWCC CUresult CUDAAPI cuMemcpyHtoD( CUdeviceptr dstDevice, const void* srcHost, size_t byteCount) // marshal data to accelerator shared buffer switch (guest_base->api_id) { case CU_MEMCPY_H_TO_D: COPY_FROM_GUEST(arg1_0); COPY_FROM_GUEST(arg2_0); COPY_SPACE_FROM_GUEST(arg3_0, guest->arg2_0); break; } // copy result from accelerator shared buffer switch (guest_base->api_id) { case CU_MEMCPY_H_TO_D: copy_to_user(&arg->ret_arg0, &host->ret_arg0, sizeof(CUresult)); break; } switch (param->base.api_id) { case CU_MEMCPY_H_TO_D: const void* srcHost = GET_PTR_FROM_DPOOL( param->arg3_0, const void*); param->ret_arg0 = cuMemcpyHtoD( param->arg1_0, srcHost, param->arg2_0); break; } PVADL specification cuMemcpyHtoD: async: False args: dstDevice: - type: Cudeviceptr srcHost: - type: const void* - dim: 1 - length: byteCount byteCount: - type: size_t ret: - type: Curesult LibForward Guest Driver Worker CUresult CUDAAPI cuMemcpyHtoD( CUdeviceptr dstDevice, const void* srcHost, size_t byteCount) { INIT_CUDA_PARAM(param); param.base.api_id = CU_MEMCPY_H_TO_D; // set arguments param.arg1_0 = dstDevice; param.arg2_0 = byteCount; param.arg3_0 = srcHost; // compute data size param.base.dpool_size += COMPUTE_SIZE( param.arg3_0, byteCount); // forward call to guest driver IOCTL_TO_DRIVER(&param); return param.ret_arg0; } LibForward Guest Driver Worker driver vCPU vDISK APPLICATIONS vNVM ioctl Runtime Device APIs mmap vFPGA vASIC vDSP vGPU vTPU Runtime Device APIs Runtime Device APIs Runtime Device APIs CPU DISK GPU FPGA ASIC DSP NVM CRYPTO HYPERVISOR HW Vendor-specific driver API dispatcher Fair scheduler Device memory management Worker Universal vAccelerator Physical Accelerator HYPERVISOR Guest Driver LibForward OpenCL APIs CUDA APIs TensorFlow APIs APPLICATIONS Virtual PCIe Full-virtualization: causes significant overheads by trap- based interposition SRIOV: remains lacking by hardware support (< 0.95% NVIDIA GPUs) Accelerator Stacks are Silos Hardware Interface: MMIO, mmap’d command queues Software Interface: vendor-specific drivers, proprietary protocols GPU Interposition only possible at the top or bottom of silo Example Automatic generated components Universal components HW Vendor Driver Driver VM

Upload: others

Post on 21-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NIGHTWATCH: Remoting Accelerator APIs through the Hypervisorhyu/talk/ava-poster.pdf · NIGHTWATCH: Remoting Accelerator APIs through the Hypervisor Hangchen Yu, Amogh Akshintala,

NIGHTWATCH: Remoting Accelerator APIs through the Hypervisor

Hangchen Yu, Amogh Akshintala, Arthur Peters, Christopher J. Rossbach

The University of Texas at Austin University of North Carolina at Chapel Hill VMware Research Group

NIGHTWATCH:Automatic Accelerator Virtualization

Automatic Generation Evaluation

• Compatibility recovered: stack generation is automatic

• Interposition recovered: APIs forwarded over VMM-

managed transport

• Near-native performance

• Scales to 16 VMs

• Fair scheduling with heterogeneous workloads

• Migration overhead < 40 ms

• Slowdown under memory over-subscription < 2.5×

# of API Lines of Spec LOC (generated) Time

OpenCL 88 4,200 4,150

A handful

of days

CUDA 211 6,350 8,150

TensorFlow 160 5,350 6,900

MVNC API 25 910 2,450

Silos Complicate Virtualization

NWExtractor

API forwarding: sacrifices

interposition and compatibility

Development Effort

Para-virtual I/O: e.g. SVGA

translates guest interactions into

DirectX which leads to serious

complexity and compatibility

issues

CL.hcuda.htensorflow.h

NWCC

CUresult CUDAAPI cuMemcpyHtoD(

CUdeviceptr dstDevice,

const void* srcHost,

size_t byteCount)

// marshal data to accelerator shared buffer

switch (guest_base->api_id) {

case CU_MEMCPY_H_TO_D:

COPY_FROM_GUEST(arg1_0);

COPY_FROM_GUEST(arg2_0);

COPY_SPACE_FROM_GUEST(arg3_0,

guest->arg2_0);

break;

}

// copy result from accelerator shared buffer

switch (guest_base->api_id) {

case CU_MEMCPY_H_TO_D:

copy_to_user(&arg->ret_arg0,

&host->ret_arg0, sizeof(CUresult));

break;

}

switch (param->base.api_id) {

case CU_MEMCPY_H_TO_D:

const void* srcHost = GET_PTR_FROM_DPOOL(

param->arg3_0,

const void*);

param->ret_arg0 = cuMemcpyHtoD(

param->arg1_0,

srcHost,

param->arg2_0);

break;

}

PVADL specification

cuMemcpyHtoD:

async: False

args:

dstDevice:

- type: Cudeviceptr

srcHost:

- type: const void*

- dim: 1

- length: byteCount

byteCount:

- type: size_t

ret:

- type: Curesult

LibForward

Guest Driver

Worker

CUresult CUDAAPI cuMemcpyHtoD(

CUdeviceptr dstDevice,

const void* srcHost,

size_t byteCount)

{

INIT_CUDA_PARAM(param);

param.base.api_id = CU_MEMCPY_H_TO_D;

// set arguments

param.arg1_0 = dstDevice;

param.arg2_0 = byteCount;

param.arg3_0 = srcHost;

// compute data size

param.base.dpool_size += COMPUTE_SIZE(

param.arg3_0, byteCount);

// forward call to guest driver

IOCTL_TO_DRIVER(&param);

return param.ret_arg0;

}

LibForward

Guest Driver

Worker

driver

vCPU vDISK

APPLICATIONS

vNVM

ioctl

Runtime

Device APIs

mmap

vFPGA

vASIC

vDSPvGPU vTPU

Runtime

Device APIs

Runtime

Device APIs

Runtime

Device APIs

CPU DISKGPU

FPGA

ASIC

DSPNVM CRYPTO

HYPERVISOR

HW Vendor-specific

driver

API dispatcher

Fair scheduler

Device memory management

Worker

Universal vAccelerator

Physical Accelerator

HYPERVISOR

Guest Driver

LibForward

OpenCLAPIs

CUDA APIs

TensorFlowAPIs

APPLICATIONS

Virtual PCIe

Full-virtualization: causes

significant overheads by trap-

based interposition

SRIOV: remains lacking by

hardware support (< 0.95%

NVIDIA GPUs)

Accelerator Stacks are Silos

• Hardware Interface: MMIO, mmap’d

command queues

• Software Interface: vendor-specific

drivers, proprietary protocols

GPU

Interposition only possible at the top or bottom of silo

Example

Automatic generated

components

Universal components

HW Vendor Driver

Driver VM