Download - Microsecond Latency, Real-Time, Multi-Input/Output Control ...on-demand.gputechconf.com/gtc/2013/presentations/S... · Microsecond Latency, Real-Time, Multi-Input/Output Control using

Microsecond Latency, Real-Time,Multi-Input/Output Control using GPU Processing

Nikolaus Rath

March 20th, 2013

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 1 / 23

Outline

1 Motivation

2 GPU Control System Architecture

3 Performance


Outline

1 Motivation


3 Performance


Fusion keeps the Sun Burning

Nuclear fusion is the processthat keeps the sun burning.

Very hot hydrogen atoms(the “plasma”) collide toform helium, releasing lots ofenergy

Would be great to replicatethis on earth. Plenty of fuelavailable, and no risk ofnuclear meltdown.

Challenges: heat things tomillions of degrees (not sohard), and keep themconfined (very hard)

n + 14.1 MeV

H2H3

He + 3.5 MeV4


At Millions of Degrees, Small Plasmas EvaporateAway


Magnetic Fields Constrain Plasma Movement toOne Dimension


Closed Magnetic Fields Can Confine Plasmas


Tokamaks Confine Plasmas Using Magnetic Fields

Orange, Magenta, Green: magnetic field generating coils

Violet: plasma; Blue: single magnetic field line (example)

1 meter radius, 1 million °C, 15000 Ampere current


Self Generated Fields Cause Instabilities

Electric currents (which generate magnetic fields) flow not justin the coils, but also in the plasma itself

The plasma thus modifies the fields that confine it

... sometimes in a self-amplifying way – instability

Typical shape: rotating, helical deformation. Timescale: 50microseconds.


Only High-Speed Feedback Control Can PreserveConfinement

Sensors detect deformations due to plasma currentsControl coils dynamically push back – “feedback control”


Outline

1 Motivation


3 Performance


Real-Time Performance is Determined ByLatency and Sampling Period

S sample paket

GPU Processing Pipelines

S

S S S

S S S

Digitizer Analog Output

samplingperiod

S

S

latency

Latency is response time of feedback system

Sampling period determines smoothness

Algorithmic complexity limits latency, not sampling period

Need both latency and sampling period in the order of fewmicroseconds


Control Algorithm is Implemented in One Kernel

CPU GPU

Read input data

Process data

Send data to GPU memory

Start GPU kernel A

Wait for GPU kernel A

Read results fromGPU Memory

Computeresult a

Process results

Send new data to GPU memory

Start GPU kernel B

Wait for GPU kernel B

Read results fromGPU Memory

Computeresult b

... ...

Time

CPU GPU

Read data

Send parametersto GPU memory

Start GPU kernel

Wait for GPU kernel

Computeresult a

Process results

Computeresult b

...

Process data

Write output data

Write output data


Redundant PCIe Transfers have to be Avoided ToReduce Latency

Traditional

Data bounces throughhost RAM

PCIe bus has multi GB/sthroughput

Transfer setup takesseveral µs

Okay if data chunksare big, transfer andprocessing takes long

Bad if latency is longerthan transfer time


Redundant PCIe Transfers have to be Avoided ToReduce Latency

New

Peer-to-peer transferseliminate need forbounce buffer

Good performanceeven for smallamounts of data

Can be implementedin software (kernel)

Required peer-to-peercapable root-complexpresent in most mid-to high-endmainboards.


Peer-to-peer PCIe transfers are set up by sharingBARs

A/D ModuleGPU

BARs

0x020x01 0x03BARs

0x05 0x06

GPU Memory

DMAController

writes

D/A Module

BARs0x08 0x09

DMAController

reads

0x03 0x01

Initialized fromBIOS by CPU

PCIe devices communicate via “BARs” in the PCI address space

GPU can map part of its memory into a BAR

AD/DA modules can transfer to/from arbitrary PCI address

CPU establishes communication by telling AD/DA modulesabout GPU BAR.

Required some trickery in the past, but with CUDA 5 nowofficially supported.


Example: Userspace

/* Allocate buffer with extra space for 64kb alignment */CUdeviceptr dev_addr;cuMemAlloc(&dev_addr, size + 0xFFFF);

/* Prepare mapping */CUDA_POINTER_ATTRIBUTE_P2P_TOKENS tokens;cuPointerGetAttribute(&tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS,

dev_addr);

/* Align to 64kb */dev_addr += 0xFFFF;dev_addr &= ~0xFFFF;

/* Call custom kernel module to get bus address,

* @fd refers to open device file */struct rdma_info s;s.dev_addr = dev_addr;s.p2pToken = tokens.p2pToken;s.vaSpaceToken = tokens.vaSpaceToken;s.size = size;ioctl(fd, RDMA_TRANSLATE_TOKEN, &s)


Example: Kernelspace

long rtm_t_dma_ioctl(struct file *filp, unsigned int cmd,unsigned long arg) {

nvidia_p2p_page_table_t *page_table;// ....switch(cmd){case RDMA_TRANSLATE_TOKEN: {

COPY_FROM_USER(&rdma_info, varg, sizeof(struct rdma_info));nvidia_p2p_get_pages(rdma_info.p2pToken, rdma_info.vaSpaceToken,

rdma_info.dev_addr, rdma_info.size,&page_table, rdma_free_callback, tdev);

rdma_info.bus_addr = page_table->pages[0]->physical_address;COPY_TO_USER(varg, &rdma_inf, sizeof(struct rdma_info));return 0;

}// Other ioctls

}


Userspace Continued

/* Call custom kernel module to get bus address,

* @fd refers to open device file */rtm_t_rdma_info s;s.dev_addr = dev_addr;ioctl(fd, RTM_T_TRANSLATE_TOKEN, &s)

/* Retrieve bus address */uint64_t bus_addr;bus_addr = s.bus_addr;

/* Send bus address to digitizer */init_rtm_t(bus_addr, other, stuff, here);

// Start GPU kernel

// Kernel polls input buffer

// Wait for kernel to complete


Outline

1 Motivation


3 Performance


The HBT-EP Plasma Control System was Built withCommodity Hardware.

Hardware:

Workstation PC

NVIDIA GeForce GTX 580

D-TACQ ACQ196 A-D Converter(96 channels, 16 bit)

2 D-TACQ AO32CPCI D-A Converter(2 x 32 channels, 16 bit)

Standard Linux host system(no real-time kernel required!)


P2P Transfers Reduce Latency by 50%

2 4 6 8 10 12 14 16 18 20Sampling Period [us]

10

12

14

16

18

20

22

24La

tenc

y [u

s]

GPU RAMHost RAM

Optimal latency when using host memory: 16 µs

Optimal latency when using GPU memory: 10 µs

50% difference does not mean having to wait twice as long, itis the difference between things blowing up or not.


GPU Beats CPU in Computational and Real-TimePerformance even in the Microsecond Regime

Performance testedwith repeated matrixapplication

GPU beats CPU downto 5 µs

Missed samplescounted in 1000 runs

Missed samples withGPU: None, with CPU:up to 2.5%

30 40 50 60 70 80 90 100Matrix Size

0

10

20

30

40

50

60

70

Sam

plin

g Pe

riod

[us]

GPUCPU

0.0 0.5 1.0 1.5 2.0 2.5Missed Samples [%]

10-1

100

101

102

103

Coun

t

CPUGPU


Summary

1 The advantages of GPUs are not restricted to large problemsrequiring long calculations.

2 Even when processing kB sized batches under microsecondlatency constraints, GPUs can be faster than CPUs, while at thesame time offering better real-time performance.

3 In these regimes, data transfer overhead becomes thedominating factor, and using peer to peer transfers improvesperformance by more than 50%.

4 A GPU based real-time control system has been developed atColumbia University and tested for the control of magneticallyconfined plasmas. Contact us for details.


Outline

4 Backup Slides


Latency and Sampling Period are MeasuredExperimentally by Copying Square Waves

2380 2385 2390 2395 2400 2405Time [us]

0.20

0.15

0.10

0.05

0.00

0.05

0.10

0.15

0.20Vo

ltShot 76504A

BControl InputControl OutputSample Clock

Control algorithm set up to copy input to output 1:1

Blue trace is input square wave

Green trace is output square wave

Output lags behind input by control system latency

Red trace is sampling interval (sampling on downward edge)


Plasma Physics Results: Dominant Mode AmplitudeReduced by up to 60%

20 15 10 5 0 5 10 15 20Frequency [kHz]

0.00

0.08

0.16

0.24

Ampl

itude

No FBg=144g=577


Self Generated Fields Cause Instabilities

Electric currents (which generate magnetic fields) flow not justin the coils, but also in the plasma itself

The plasma thus modifies the fields that confine it

... sometimes in a self-amplifying way – instability

Typical shape: rotating, helical deformation. Timescale: 50microseconds.


Feedback Control uses Measurements to DetermineControl Signals

Controller SystemActuators

SensorsMeasurements / Control Input

Input OutputControl Signal / Control Output

PhysicalInteraction

PhysicalInteraction

Goal: keep system in specific state

If system is perfectly known, can calculate required controlsignals (open-loop control)

In practice, need to use measurements to determine effects andupdate signals: feedback control

A control system acquires measurements, performs computations,and generates control output to manipulate the system state.


Data Passthrough Establishes 8 µs Lower Latency Limit

0 2 4 6 8 10 12Sampling Period [us]

8

10

12

14

16La

tenc

y [u

s]

GPU RAMHost RAM

Control system uses same buffer to write input and read output

No GPU processing, so no difference between host and GPUmemory

Jump: 4 µs required for A-D conversion and data push

Offset: 4 µs required for data pull and D-A conversion


Download - Microsecond Latency, Real-Time, Multi-Input/Output Control ...on-demand.gputechconf.com/gtc/2013/presentations/S... · Microsecond Latency, Real-Time, Multi-Input/Output Control using

Top Related