Microsecond Latency, Real-Time,Multi-Input/Output Control using GPU Processing
Nikolaus Rath
March 20th, 2013
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 1 / 23
Outline
1 Motivation
2 GPU Control System Architecture
3 Performance
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 2 / 23
Outline
1 Motivation
2 GPU Control System Architecture
3 Performance
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 3 / 23
Fusion keeps the Sun Burning
Nuclear fusion is the processthat keeps the sun burning.
Very hot hydrogen atoms(the “plasma”) collide toform helium, releasing lots ofenergy
Would be great to replicatethis on earth. Plenty of fuelavailable, and no risk ofnuclear meltdown.
Challenges: heat things tomillions of degrees (not sohard), and keep themconfined (very hard)
n + 14.1 MeV
H2H3
He + 3.5 MeV4
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 4 / 23
At Millions of Degrees, Small Plasmas EvaporateAway
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 5 / 23
Magnetic Fields Constrain Plasma Movement toOne Dimension
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 6 / 23
Closed Magnetic Fields Can Confine Plasmas
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 7 / 23
Tokamaks Confine Plasmas Using Magnetic Fields
Orange, Magenta, Green: magnetic field generating coils
Violet: plasma; Blue: single magnetic field line (example)
1 meter radius, 1 million °C, 15000 Ampere current
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 8 / 23
Self Generated Fields Cause Instabilities
Electric currents (which generate magnetic fields) flow not justin the coils, but also in the plasma itself
The plasma thus modifies the fields that confine it
... sometimes in a self-amplifying way – instability
Typical shape: rotating, helical deformation. Timescale: 50microseconds.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 9 / 23
Only High-Speed Feedback Control Can PreserveConfinement
Sensors detect deformations due to plasma currentsControl coils dynamically push back – “feedback control”
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 10 / 23
Outline
1 Motivation
2 GPU Control System Architecture
3 Performance
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 11 / 23
Real-Time Performance is Determined ByLatency and Sampling Period
S sample paket
GPU Processing Pipelines
S
S S S
S S S
Digitizer Analog Output
samplingperiod
S
S
latency
Latency is response time of feedback system
Sampling period determines smoothness
Algorithmic complexity limits latency, not sampling period
Need both latency and sampling period in the order of fewmicroseconds
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 12 / 23
Control Algorithm is Implemented in One Kernel
CPU GPU
Read input data
Process data
Send data to GPU memory
Start GPU kernel A
Wait for GPU kernel A
Read results fromGPU Memory
Computeresult a
Process results
Send new data to GPU memory
Start GPU kernel B
Wait for GPU kernel B
Read results fromGPU Memory
Computeresult b
... ...
Time
CPU GPU
Read data
Send parametersto GPU memory
Start GPU kernel
Wait for GPU kernel
Computeresult a
Process results
Computeresult b
...
Process data
Write output data
Write output data
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 13 / 23
Redundant PCIe Transfers have to be Avoided ToReduce Latency
Traditional
Data bounces throughhost RAM
PCIe bus has multi GB/sthroughput
Transfer setup takesseveral µs
Okay if data chunksare big, transfer andprocessing takes long
Bad if latency is longerthan transfer time
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 14 / 23
Redundant PCIe Transfers have to be Avoided ToReduce Latency
New
Peer-to-peer transferseliminate need forbounce buffer
Good performanceeven for smallamounts of data
Can be implementedin software (kernel)
Required peer-to-peercapable root-complexpresent in most mid-to high-endmainboards.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 14 / 23
Peer-to-peer PCIe transfers are set up by sharingBARs
A/D ModuleGPU
BARs
0x020x01 0x03BARs
0x05 0x06
GPU Memory
DMAController
writes
D/A Module
BARs0x08 0x09
DMAController
reads
0x03 0x01
Initialized fromBIOS by CPU
PCIe devices communicate via “BARs” in the PCI address space
GPU can map part of its memory into a BAR
AD/DA modules can transfer to/from arbitrary PCI address
CPU establishes communication by telling AD/DA modulesabout GPU BAR.
Required some trickery in the past, but with CUDA 5 nowofficially supported.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 15 / 23
Example: Userspace
/* Allocate buffer with extra space for 64kb alignment */CUdeviceptr dev_addr;cuMemAlloc(&dev_addr, size + 0xFFFF);
/* Prepare mapping */CUDA_POINTER_ATTRIBUTE_P2P_TOKENS tokens;cuPointerGetAttribute(&tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS,
dev_addr);
/* Align to 64kb */dev_addr += 0xFFFF;dev_addr &= ~0xFFFF;
/* Call custom kernel module to get bus address,
* @fd refers to open device file */struct rdma_info s;s.dev_addr = dev_addr;s.p2pToken = tokens.p2pToken;s.vaSpaceToken = tokens.vaSpaceToken;s.size = size;ioctl(fd, RDMA_TRANSLATE_TOKEN, &s)
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 16 / 23
Example: Kernelspace
long rtm_t_dma_ioctl(struct file *filp, unsigned int cmd,unsigned long arg) {
nvidia_p2p_page_table_t *page_table;// ....switch(cmd){case RDMA_TRANSLATE_TOKEN: {
COPY_FROM_USER(&rdma_info, varg, sizeof(struct rdma_info));nvidia_p2p_get_pages(rdma_info.p2pToken, rdma_info.vaSpaceToken,
rdma_info.dev_addr, rdma_info.size,&page_table, rdma_free_callback, tdev);
rdma_info.bus_addr = page_table->pages[0]->physical_address;COPY_TO_USER(varg, &rdma_inf, sizeof(struct rdma_info));return 0;
}// Other ioctls
}
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 17 / 23
Userspace Continued
/* Call custom kernel module to get bus address,
* @fd refers to open device file */rtm_t_rdma_info s;s.dev_addr = dev_addr;ioctl(fd, RTM_T_TRANSLATE_TOKEN, &s)
/* Retrieve bus address */uint64_t bus_addr;bus_addr = s.bus_addr;
/* Send bus address to digitizer */init_rtm_t(bus_addr, other, stuff, here);
// Start GPU kernel
// Kernel polls input buffer
// Wait for kernel to complete
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 18 / 23
Outline
1 Motivation
2 GPU Control System Architecture
3 Performance
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 19 / 23
The HBT-EP Plasma Control System was Built withCommodity Hardware.
Hardware:
Workstation PC
NVIDIA GeForce GTX 580
D-TACQ ACQ196 A-D Converter(96 channels, 16 bit)
2 D-TACQ AO32CPCI D-A Converter(2 x 32 channels, 16 bit)
Standard Linux host system(no real-time kernel required!)
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 20 / 23
P2P Transfers Reduce Latency by 50%
2 4 6 8 10 12 14 16 18 20Sampling Period [us]
10
12
14
16
18
20
22
24La
tenc
y [u
s]
GPU RAMHost RAM
Optimal latency when using host memory: 16 µs
Optimal latency when using GPU memory: 10 µs
50% difference does not mean having to wait twice as long, itis the difference between things blowing up or not.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 21 / 23
GPU Beats CPU in Computational and Real-TimePerformance even in the Microsecond Regime
Performance testedwith repeated matrixapplication
GPU beats CPU downto 5 µs
Missed samplescounted in 1000 runs
Missed samples withGPU: None, with CPU:up to 2.5%
30 40 50 60 70 80 90 100Matrix Size
0
10
20
30
40
50
60
70
Sam
plin
g Pe
riod
[us]
GPUCPU
0.0 0.5 1.0 1.5 2.0 2.5Missed Samples [%]
10-1
100
101
102
103
Coun
t
CPUGPU
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 22 / 23
Summary
1 The advantages of GPUs are not restricted to large problemsrequiring long calculations.
2 Even when processing kB sized batches under microsecondlatency constraints, GPUs can be faster than CPUs, while at thesame time offering better real-time performance.
3 In these regimes, data transfer overhead becomes thedominating factor, and using peer to peer transfers improvesperformance by more than 50%.
4 A GPU based real-time control system has been developed atColumbia University and tested for the control of magneticallyconfined plasmas. Contact us for details.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 23 / 23
Outline
4 Backup Slides
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 1 / 6
Latency and Sampling Period are MeasuredExperimentally by Copying Square Waves
2380 2385 2390 2395 2400 2405Time [us]
0.20
0.15
0.10
0.05
0.00
0.05
0.10
0.15
0.20Vo
ltShot 76504A
BControl InputControl OutputSample Clock
Control algorithm set up to copy input to output 1:1
Blue trace is input square wave
Green trace is output square wave
Output lags behind input by control system latency
Red trace is sampling interval (sampling on downward edge)
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 2 / 6
Plasma Physics Results: Dominant Mode AmplitudeReduced by up to 60%
20 15 10 5 0 5 10 15 20Frequency [kHz]
0.00
0.08
0.16
0.24
Ampl
itude
No FBg=144g=577
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 3 / 6
Self Generated Fields Cause Instabilities
Electric currents (which generate magnetic fields) flow not justin the coils, but also in the plasma itself
The plasma thus modifies the fields that confine it
... sometimes in a self-amplifying way – instability
Typical shape: rotating, helical deformation. Timescale: 50microseconds.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 4 / 6
Feedback Control uses Measurements to DetermineControl Signals
Controller SystemActuators
SensorsMeasurements / Control Input
Input OutputControl Signal / Control Output
PhysicalInteraction
PhysicalInteraction
Goal: keep system in specific state
If system is perfectly known, can calculate required controlsignals (open-loop control)
In practice, need to use measurements to determine effects andupdate signals: feedback control
A control system acquires measurements, performs computations,and generates control output to manipulate the system state.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 5 / 6
Data Passthrough Establishes 8 µs Lower Latency Limit
0 2 4 6 8 10 12Sampling Period [us]
8
10
12
14
16La
tenc
y [u
s]
GPU RAMHost RAM
Control system uses same buffer to write input and read output
No GPU processing, so no difference between host and GPUmemory
Jump: 4 µs required for A-D conversion and data push
Offset: 4 µs required for data pull and D-A conversion
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 6 / 6