gpu programming for high- performance graphics workstation
TRANSCRIPT
![Page 1: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/1.jpg)
GPU Programming for High-
Performance Graphics
Workstation Applications
Shalini Venkataraman, Alina Alt,
Will Braithwaite
Applied Engineering, NVIDIA PSG
![Page 2: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/2.jpg)
NVIDIA Confidential
Talk Outline
Scaling transfers and rendering
- Shalini Venkataraman
Mixing graphics and compute
- Alina Alt
Developing an optimized Maya plugin using CUDA and OpenGL
- Will Braithwaite
Questions at the end
![Page 3: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/3.jpg)
Scaling Transfers and Rendering
Overlapping transfers & rendering
Implementing various transfer methods
Multi-threading and Synchronization
Debugging transfers
Best Practices & Results
Scaling to Multi-GPU
Pinning OpenGL context to GPU
Application structure
Optimized inter-GPU transfers
![Page 4: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/4.jpg)
Applications
Streaming videos/time varying geometry or volumes
Broadcast, real-time fluid simulations etc
Level of detailing
Out of core image viewers, terrain engines
Bricks paged in as needed
Parallel rendering
Fast communication between multiple GPUs for scaling
data/render
Remoting Graphics
Readback GPU results fast and stream over network
![Page 5: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/5.jpg)
Previous Approach – Synchronous Transfers
Straightforward
Upload texture every frame
Driver does all copy
Copy, download and draw are
sequential
![Page 6: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/6.jpg)
Previous Approach - CPU Asynchronous Transfers
Non CPU-blocking transfer using Pixel Buffer Objects (PBO)
Ping-pong PBOs for optimal throughput
Data must be in GPU native format
OpenGL Controlled
Memory
Datacur: glTexSubImage
PBO0
PBO1
…
pData
[nBricks]
Main Memory
[0]
[1]
[2]
Graphics Memory
texID
Datanext memcpy
Textures
Disk
PBO0
PBO1
![Page 7: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/7.jpg)
CPU Asynchronous - Timeline
time
Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1
CPU
GPU Drawt0 Drawt1
Frame Draw
Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0
Bus
CPU Async
Analysis with GPUView
(http://graphics.stanford.edu/~mdfish
er/GPUView.html)
![Page 8: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/8.jpg)
Example – 3D texture upload +Ping-Pong PBOs
Gluint pbo[2] ; //ping-pong pbo generate and initialize them ahead
unsigned int curPBO = 0;
//bind current pbo for app->pbo transfer
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[curPBO]); //bind pbo
GLubyte* ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_UNPACK_BUFFER_ARB, 0, size,
GL_MAP_WRITE_BIT|GL_MAP_INVALIDATE_BUFFER_BIT);
memcpy(ptr,pData[curBrick],xdim*ydim*zdim);
glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER_ARB);
//Copy pixels from pbo to texture object
glBindTexture(GL_TEXTURE_3D,texId);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[1-curPBO]); //bind pbo
glTexSubImage3D(GL_TEXTURE_3D,0,0,0,0,xdim,ydim,zdim,GL_LUMINANCE,GL_UNSIGNED_BYTE,0);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB,0);
glBindTexture(GL_TEXTURE_3D,0);
curPBO = 1-curPBO;
//Call drawing code here
![Page 9: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/9.jpg)
PBOs
Synchronous
0
500
1000
1500
2000
2500
3000
3500
4000
4500
16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3 (2MB) 256^3 (16MB)
PBO vs Synchronous uploads - Quadro 6000
PBO (MB/s) TexSubImage (MB/s)
Results – Synchronous vs CPU Async
- Transfers only
- Adding rendering will reduce bandwidth, GPU can’t do both
- Ideally – want to sustain bandwidth with render, need GPU overlap
Bandw
idth
(M
B/s)
![Page 10: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/10.jpg)
Achieving GPU Overlap – Copy Engines
Fermi+ have copy engines
GeForce, low-end Quadro- 1 CE
Quadro 4000+ - 2 CEs
Allows copy-to-host + compute
+ copy-to-device to overlap
simultaneously
Graphics/OpenGL
Using PBO’s in multiple threads
Handle synchronization
![Page 11: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/11.jpg)
GPU Asynchronous Transfers
Downloads/uploads in separate
thread
Using OpenGL PBOs
ARB_SYNC used for context
synchronization
Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1
CPU
GPU Drawt0 Drawt2 Drawt1
Frame Draw
Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0
Bus
Using PBO
Using CE
Upload Draw
Init
Main App Thread
Shared textures
Readback
![Page 12: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/12.jpg)
Multi-threaded Context Creation
Sharing textures between multiple contexts
Don’t use wglShareLists
Use WGL/GLX_ARB_CREATE_CONTEXT instead
Set OpenGL debug on
static const int contextAttribs[] =
{
WGL_CONTEXT_FLAGS_ARB, WGL_CONTEXT_DEBUG_BIT_ARB,
0
};
mainGLRC = wglCreateContextAttribsARB(winDC, 0, contextAttribs);
wglMakeCurrent(winDC, mainGLRC);
glGenTextures(numTextures, srcTex);
//uploadGLRC now shares all its textures with mainGLRC
uploadGLRC = wglCreateContextAttribsARB(winDC, mainGLRC, contextAttribs);
//Create Upload thread
//Do above for readback if using
![Page 13: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/13.jpg)
Upload-Render: Application Layout
Disk
OpenGL Controlled
Memory
PBO0
PBO1
…
pData
[nBricks]
Main Memory
[0]
[1]
[2]
Graphics Memory srcTex
[numTextures]
Render
Thread
glBindTexture
Upload Thread
Datacur: glTexSubImage
Datanext : memcpy
uploadGLRC
mainGLRC
![Page 14: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/14.jpg)
Adding Render – Readback
OpenGL Controlled
Memory
Images
[nFrames]
[0]
[1]
[2]
Framecur: glGetTexImage
Frameprev : memcpy
glFramebufferTexture
(GL_DRAW_FRAMEBUFFER
_TEXTURE,…)
DRAW
[0]
[1]
[2]
[3]
PBO0
PBO1
mainGLRC
readbackGLRC
Render Thread Readback Thread
Main Memory
Graphics Memory
resultTex
[numTextures]
Use glGetTexImage, not glReadPixels between contexts/threads
![Page 15: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/15.jpg)
Synchronization using ARB_SYNC
OpenGL commands are asynchronous
When glDrawXXX returns, does not mean command is completed
Sync object glSync (ARB_SYNC) is used for multi-threaded apps
that need sync
Eg rendering a texture waits for upload completion
Fence is inserted in a unsignaled state but when completed
changed to signaled.
//Upload //Render glTexSubImage(texID,..) glWaitSync(fence);
GLSync fence = glFenceSync(..) glBindTexture(.., texID);
unsignaled
signaled
![Page 16: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/16.jpg)
Upload-Render-Readback Pipeline
// Wait for signal to start upload
CPUWait(startUploadValid);
glWaitSync(startUpload[2]);
// Bind texture object
BindTexture(capTex[2]);
// Upload
glTexSubImage(texID…);
// Signal upload complete
GLSync endUpload[2]= glFenceSync(…);
CPUSignal(endUploadValid);
// Wait for download to complete
CPUWait(endDownloadValid);
glWaitSync(endDownload[3]);
// Wait for upload to complete
CPUWait(endUploadValid);
glWaitSync(endUpload)[0]);
// Bind render target
glFramebufferTexture(playTex[3]);
// Bind video capture source texture
BindTexture(capTex[0]);
// Draw
// Signal next upload
startUpload[0] = glFenceSync(…);
CPUSignal(startUploadValid);
// Signal next download
startDownload[3] = glFenceSync(…);
CPUSignal(startDownloadValid);
// Playout thread
CPUWait(startDownloadValid);
glWaitSync(startDownload[2]);
// Readback
glGetTexImage(playTex[2]);
// Read pixels to PBO
// Signal download complete
endDownload[2] = glFenceSync(…);
CPUSignal(endDownloadValid);
Upload Thread Render Thread Readback Thread
True, S038 – Best Practices in GPU-based Video Processing, GTC 2012 Proceedings
[0]
[1]
[2]
[3]
[0]
[1]
[2]
[3]
![Page 17: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/17.jpg)
Results
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
256KB 1MB 8MB 32MB
Scaln
g F
acto
r
Texture Size
Performance Scaling from CPU Asynchronous Transfers
Upload-Render Scaling Render-Download Scalng
4.2 GB/s 3.2GB/s
1.4 GB/s
900 MB/s
Perfect Scaling
No Scaling
Quadro
6000
Larger texture sizes
scale better
![Page 18: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/18.jpg)
Debugging Transfers
Some OGL calls may not overlap between transfer/render thread
Eg non-transfer related OGL calls in transfer thread
Driver generates debug message
“Pixel transfer is synchronized with 3D rendering”
Application uses ARB_DEBUG_OUTPUT to check the OGL debug log
OpenGL 4.0 and above
Currently supported for PBOs, not VBOs
Will serialize on Pre-Fermi hardware
GL_ARB_debug_output -
http://www.opengl.org/registry/specs/ARB/debug_output.txt
![Page 19: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/19.jpg)
Debugging with Nsight Visual Studio
![Page 20: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/20.jpg)
Scaling Rendering on Multi-GPU
Focus on OpenGL graphics
Onscreen Rendering
Display scaling for multi-projector, multi-tiled display environments http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0353-GTC2012-Multi-GPU-
Rendering.pdf
Offscreen Parallel Rendering
Image Scaling – final image resolution
Data scaling – texture size, # triangles
Task/Process Scaling – eg render farm serving thin clients
X
![Page 21: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/21.jpg)
Multi-GPU - Transparent Behavior
Default Behavior of OGL command dispatch
Win XP : Sent to all GPUs, slowest GPU gates performance
Linux : Only to the GPU attached to screen
Win 7: Sent to most powerful GPU and blitted across
SLI AFR
Single threaded application
Data and commands are replicated across all GPUs
![Page 22: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/22.jpg)
Specifying OpenGL GPU on NVIDIA Quadro
Directed GPU Rendering
Quadro-only
Heuristics for automatic GPU
selection
Allow app to pick the GPU for
rendering, fast blit path to
other displays
Programmatically using NVAPI
or using CPL
![Page 23: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/23.jpg)
Programming for Multi-GPU
Linux
Specify separate X screens using XOpenDisplay
Xinerama disabled
Windows
Vendor specific extension
NVIDIA : NV_GPU_AFFINITY extension
AMD Cards : AMD_GPU_Association
Display* dpy = XOpenDisplay(“:0.”+gpu)
GLXContext = glxCreateContextAttribs(dpy,…);
![Page 24: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/24.jpg)
GPU Affinity– Enumerating and attaching to GPUs
Enumerate GPUs
Enumerate Displays per GPU
Pinning OpenGL context to a specific GPU
BOOL wglEnumGpusNV(UINT iGpuIndex, HGPUNV *phGPU)
BOOL wglEnumGpusDevicesNV(HGPUNV hGPU, UINT iDeviceIndex,
PGPU_DEVICE lpGpuDevice);
For #GPUs enumerated {
GpuMask[0]=hGPU[0];
GpuMask[1]=NULL;
//Get affinity DC based on GPU
HDC affinityDC = wglCreateAffinityDCNV(GpuMask);
setPixelFormat(affinityDC);
HGLRC affinityGLRC = wglCreateContext(affinityDC);
}
![Page 25: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/25.jpg)
Scaling Rendering
Scaling data size using Sort-Last approach
Eg Visible Human Dataset : 14GB 3D Texture rendered across 4GPUs
GPU #0
Data Distribution +
Render Sort +
Alpha Composite
GPU #1
GPU #3 GPU #2 Display decoupled from
Render
Final Image
![Page 26: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/26.jpg)
Using GPU Affinity
App manages
Distributing render
workload
implementing various
composition methods for
final image assembly
InterGPU communication
Data, image & task
scaling
wglMakeCurrent
Composite
affinityDC
affinityGLRC
gpuMask=0 gpuMask=1
affinityDC
affinityGLRC
wglCreateContext
wglMakeCurrent
winDC
Copy over PCI-e
Render
Offscreen
(FBO)
Render
Offscreen
wglCreateContext
wglMakeCurrent
Primary Slave
Producer
GPU
Consumer
GPU
Scaling Image Resolution
![Page 27: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/27.jpg)
Sharing data between GPUs
For multiple contexts on same GPU ShareLists & GL_ARB_Create_Context
For multiple contexts across multiple GPU
Readback (GPU1-Host) Copies on host Upload (Host-GPU0)
NV_copy_image extension for OGL 3.x
Windows - wglCopyImageSubData
Linux - glXCopyImageSubDataNV
Avoids extra copies, same pinned host memory is accessed by both GPUs
![Page 28: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/28.jpg)
NV_Copy_Image Extension
Transfer in single call
No binding of objects
No state changes
Supports 2D, 3D textures &
cube maps
Async for Fermi & above
Requires programming
Copy
Engine
Graphics
Engine
Copy
Engine Graphics
Engine
Consumer
GPU
Memory srcTex destTex
destCtx srcCtx
GPU
Memory
wglCopyImageSubDataNV(srcCtx, srcTex, GL_TEXTURE_2D,0, 0, 0, 0,
destCtx, destTex, GL_TEXTURE_2D, 0, 0, 0, 0,
width, height, 1);
![Page 29: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/29.jpg)
Producer-Consumer Application Structure
One thread per GPU to
maximize CPU core utilization
OpenGL commands are
asynchronous
Need GPU level synchronization
Use GL_ARB_SYNC
Can scale to multiple
producers/consumers
glFramebuffer
Texture
glBindTex
…
GPU
Memory
srcTex
[nBuffers]
destCtx srcCtx
[0]
[1]
[2]
GPU
Memory
destTex
[nBuffers]
[0]
[1]
[2]
glDraw*
FBO
Consumer Producer
glCopyImageNV
App
![Page 30: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/30.jpg)
Applications : Texture/Geometry Scaling
Adding more GPUs increases transfer time
But scales data size
Full-res images transferred between GPUs
Volumetric Data
Transfer RGBA images
Polygonal Data (2X transfer overhead)
Transfer RGBA and Depth (32bit) images
![Page 31: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/31.jpg)
Applications : Task Scaling
Render scaling
Flight simulation, raytracing
Server-side rendering
Assign GPU for a user depending on heuristics
Eg using GL_NVX_MEMORY_INFO to assign GPU
![Page 32: GPU Programming for High- Performance Graphics Workstation](https://reader031.vdocuments.net/reader031/viewer/2022012019/61688387d394e9041f701c9b/html5/thumbnails/32.jpg)
References
OpenGL Insights chapters
Chapter 29 Fermi Asynchronous Texture Transfers
Chapter 27 - Multi-GPU Rendering on NVIDIA Quadro
Source Code -
https://github.com/OpenGLInsights/OpenGLInsightsCode
GTC 2012 On-demand talks
http://www.gputechconf.com/gtcnew/on-demand-gtc.php
S0353 - Programming Multi-GPUs for Scalable Rendering
S0356 - Optimized Texture Transfers