gpu programming for high- performance graphics workstation

32
GPU Programming for High- Performance Graphics Workstation Applications Shalini Venkataraman, Alina Alt, Will Braithwaite Applied Engineering, NVIDIA PSG

Upload: others

Post on 15-Oct-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU Programming for High- Performance Graphics Workstation

GPU Programming for High-

Performance Graphics

Workstation Applications

Shalini Venkataraman, Alina Alt,

Will Braithwaite

Applied Engineering, NVIDIA PSG

Page 2: GPU Programming for High- Performance Graphics Workstation

NVIDIA Confidential

Talk Outline

Scaling transfers and rendering

- Shalini Venkataraman

Mixing graphics and compute

- Alina Alt

Developing an optimized Maya plugin using CUDA and OpenGL

- Will Braithwaite

Questions at the end

Page 3: GPU Programming for High- Performance Graphics Workstation

Scaling Transfers and Rendering

Overlapping transfers & rendering

Implementing various transfer methods

Multi-threading and Synchronization

Debugging transfers

Best Practices & Results

Scaling to Multi-GPU

Pinning OpenGL context to GPU

Application structure

Optimized inter-GPU transfers

Page 4: GPU Programming for High- Performance Graphics Workstation

Applications

Streaming videos/time varying geometry or volumes

Broadcast, real-time fluid simulations etc

Level of detailing

Out of core image viewers, terrain engines

Bricks paged in as needed

Parallel rendering

Fast communication between multiple GPUs for scaling

data/render

Remoting Graphics

Readback GPU results fast and stream over network

Page 5: GPU Programming for High- Performance Graphics Workstation

Previous Approach – Synchronous Transfers

Straightforward

Upload texture every frame

Driver does all copy

Copy, download and draw are

sequential

Page 6: GPU Programming for High- Performance Graphics Workstation

Previous Approach - CPU Asynchronous Transfers

Non CPU-blocking transfer using Pixel Buffer Objects (PBO)

Ping-pong PBOs for optimal throughput

Data must be in GPU native format

OpenGL Controlled

Memory

Datacur: glTexSubImage

PBO0

PBO1

pData

[nBricks]

Main Memory

[0]

[1]

[2]

Graphics Memory

texID

Datanext memcpy

Textures

Disk

PBO0

PBO1

Page 7: GPU Programming for High- Performance Graphics Workstation

CPU Asynchronous - Timeline

time

Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1

CPU

GPU Drawt0 Drawt1

Frame Draw

Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0

Bus

CPU Async

Analysis with GPUView

(http://graphics.stanford.edu/~mdfish

er/GPUView.html)

Page 8: GPU Programming for High- Performance Graphics Workstation

Example – 3D texture upload +Ping-Pong PBOs

Gluint pbo[2] ; //ping-pong pbo generate and initialize them ahead

unsigned int curPBO = 0;

//bind current pbo for app->pbo transfer

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[curPBO]); //bind pbo

GLubyte* ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_UNPACK_BUFFER_ARB, 0, size,

GL_MAP_WRITE_BIT|GL_MAP_INVALIDATE_BUFFER_BIT);

memcpy(ptr,pData[curBrick],xdim*ydim*zdim);

glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER_ARB);

//Copy pixels from pbo to texture object

glBindTexture(GL_TEXTURE_3D,texId);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[1-curPBO]); //bind pbo

glTexSubImage3D(GL_TEXTURE_3D,0,0,0,0,xdim,ydim,zdim,GL_LUMINANCE,GL_UNSIGNED_BYTE,0);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB,0);

glBindTexture(GL_TEXTURE_3D,0);

curPBO = 1-curPBO;

//Call drawing code here

Page 9: GPU Programming for High- Performance Graphics Workstation

PBOs

Synchronous

0

500

1000

1500

2000

2500

3000

3500

4000

4500

16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3 (2MB) 256^3 (16MB)

PBO vs Synchronous uploads - Quadro 6000

PBO (MB/s) TexSubImage (MB/s)

Results – Synchronous vs CPU Async

- Transfers only

- Adding rendering will reduce bandwidth, GPU can’t do both

- Ideally – want to sustain bandwidth with render, need GPU overlap

Bandw

idth

(M

B/s)

Page 10: GPU Programming for High- Performance Graphics Workstation

Achieving GPU Overlap – Copy Engines

Fermi+ have copy engines

GeForce, low-end Quadro- 1 CE

Quadro 4000+ - 2 CEs

Allows copy-to-host + compute

+ copy-to-device to overlap

simultaneously

Graphics/OpenGL

Using PBO’s in multiple threads

Handle synchronization

Page 11: GPU Programming for High- Performance Graphics Workstation

GPU Asynchronous Transfers

Downloads/uploads in separate

thread

Using OpenGL PBOs

ARB_SYNC used for context

synchronization

Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1

CPU

GPU Drawt0 Drawt2 Drawt1

Frame Draw

Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0

Bus

Using PBO

Using CE

Upload Draw

Init

Main App Thread

Shared textures

Readback

Page 12: GPU Programming for High- Performance Graphics Workstation

Multi-threaded Context Creation

Sharing textures between multiple contexts

Don’t use wglShareLists

Use WGL/GLX_ARB_CREATE_CONTEXT instead

Set OpenGL debug on

static const int contextAttribs[] =

{

WGL_CONTEXT_FLAGS_ARB, WGL_CONTEXT_DEBUG_BIT_ARB,

0

};

mainGLRC = wglCreateContextAttribsARB(winDC, 0, contextAttribs);

wglMakeCurrent(winDC, mainGLRC);

glGenTextures(numTextures, srcTex);

//uploadGLRC now shares all its textures with mainGLRC

uploadGLRC = wglCreateContextAttribsARB(winDC, mainGLRC, contextAttribs);

//Create Upload thread

//Do above for readback if using

Page 13: GPU Programming for High- Performance Graphics Workstation

Upload-Render: Application Layout

Disk

OpenGL Controlled

Memory

PBO0

PBO1

pData

[nBricks]

Main Memory

[0]

[1]

[2]

Graphics Memory srcTex

[numTextures]

Render

Thread

glBindTexture

Upload Thread

Datacur: glTexSubImage

Datanext : memcpy

uploadGLRC

mainGLRC

Page 14: GPU Programming for High- Performance Graphics Workstation

Adding Render – Readback

OpenGL Controlled

Memory

Images

[nFrames]

[0]

[1]

[2]

Framecur: glGetTexImage

Frameprev : memcpy

glFramebufferTexture

(GL_DRAW_FRAMEBUFFER

_TEXTURE,…)

DRAW

[0]

[1]

[2]

[3]

PBO0

PBO1

mainGLRC

readbackGLRC

Render Thread Readback Thread

Main Memory

Graphics Memory

resultTex

[numTextures]

Use glGetTexImage, not glReadPixels between contexts/threads

Page 15: GPU Programming for High- Performance Graphics Workstation

Synchronization using ARB_SYNC

OpenGL commands are asynchronous

When glDrawXXX returns, does not mean command is completed

Sync object glSync (ARB_SYNC) is used for multi-threaded apps

that need sync

Eg rendering a texture waits for upload completion

Fence is inserted in a unsignaled state but when completed

changed to signaled.

//Upload //Render glTexSubImage(texID,..) glWaitSync(fence);

GLSync fence = glFenceSync(..) glBindTexture(.., texID);

unsignaled

signaled

Page 16: GPU Programming for High- Performance Graphics Workstation

Upload-Render-Readback Pipeline

// Wait for signal to start upload

CPUWait(startUploadValid);

glWaitSync(startUpload[2]);

// Bind texture object

BindTexture(capTex[2]);

// Upload

glTexSubImage(texID…);

// Signal upload complete

GLSync endUpload[2]= glFenceSync(…);

CPUSignal(endUploadValid);

// Wait for download to complete

CPUWait(endDownloadValid);

glWaitSync(endDownload[3]);

// Wait for upload to complete

CPUWait(endUploadValid);

glWaitSync(endUpload)[0]);

// Bind render target

glFramebufferTexture(playTex[3]);

// Bind video capture source texture

BindTexture(capTex[0]);

// Draw

// Signal next upload

startUpload[0] = glFenceSync(…);

CPUSignal(startUploadValid);

// Signal next download

startDownload[3] = glFenceSync(…);

CPUSignal(startDownloadValid);

// Playout thread

CPUWait(startDownloadValid);

glWaitSync(startDownload[2]);

// Readback

glGetTexImage(playTex[2]);

// Read pixels to PBO

// Signal download complete

endDownload[2] = glFenceSync(…);

CPUSignal(endDownloadValid);

Upload Thread Render Thread Readback Thread

True, S038 – Best Practices in GPU-based Video Processing, GTC 2012 Proceedings

[0]

[1]

[2]

[3]

[0]

[1]

[2]

[3]

Page 17: GPU Programming for High- Performance Graphics Workstation

Results

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

256KB 1MB 8MB 32MB

Scaln

g F

acto

r

Texture Size

Performance Scaling from CPU Asynchronous Transfers

Upload-Render Scaling Render-Download Scalng

4.2 GB/s 3.2GB/s

1.4 GB/s

900 MB/s

Perfect Scaling

No Scaling

Quadro

6000

Larger texture sizes

scale better

Page 18: GPU Programming for High- Performance Graphics Workstation

Debugging Transfers

Some OGL calls may not overlap between transfer/render thread

Eg non-transfer related OGL calls in transfer thread

Driver generates debug message

“Pixel transfer is synchronized with 3D rendering”

Application uses ARB_DEBUG_OUTPUT to check the OGL debug log

OpenGL 4.0 and above

Currently supported for PBOs, not VBOs

Will serialize on Pre-Fermi hardware

GL_ARB_debug_output -

http://www.opengl.org/registry/specs/ARB/debug_output.txt

Page 19: GPU Programming for High- Performance Graphics Workstation

Debugging with Nsight Visual Studio

Page 21: GPU Programming for High- Performance Graphics Workstation

Multi-GPU - Transparent Behavior

Default Behavior of OGL command dispatch

Win XP : Sent to all GPUs, slowest GPU gates performance

Linux : Only to the GPU attached to screen

Win 7: Sent to most powerful GPU and blitted across

SLI AFR

Single threaded application

Data and commands are replicated across all GPUs

Page 22: GPU Programming for High- Performance Graphics Workstation

Specifying OpenGL GPU on NVIDIA Quadro

Directed GPU Rendering

Quadro-only

Heuristics for automatic GPU

selection

Allow app to pick the GPU for

rendering, fast blit path to

other displays

Programmatically using NVAPI

or using CPL

Page 23: GPU Programming for High- Performance Graphics Workstation

Programming for Multi-GPU

Linux

Specify separate X screens using XOpenDisplay

Xinerama disabled

Windows

Vendor specific extension

NVIDIA : NV_GPU_AFFINITY extension

AMD Cards : AMD_GPU_Association

Display* dpy = XOpenDisplay(“:0.”+gpu)

GLXContext = glxCreateContextAttribs(dpy,…);

Page 24: GPU Programming for High- Performance Graphics Workstation

GPU Affinity– Enumerating and attaching to GPUs

Enumerate GPUs

Enumerate Displays per GPU

Pinning OpenGL context to a specific GPU

BOOL wglEnumGpusNV(UINT iGpuIndex, HGPUNV *phGPU)

BOOL wglEnumGpusDevicesNV(HGPUNV hGPU, UINT iDeviceIndex,

PGPU_DEVICE lpGpuDevice);

For #GPUs enumerated {

GpuMask[0]=hGPU[0];

GpuMask[1]=NULL;

//Get affinity DC based on GPU

HDC affinityDC = wglCreateAffinityDCNV(GpuMask);

setPixelFormat(affinityDC);

HGLRC affinityGLRC = wglCreateContext(affinityDC);

}

Page 25: GPU Programming for High- Performance Graphics Workstation

Scaling Rendering

Scaling data size using Sort-Last approach

Eg Visible Human Dataset : 14GB 3D Texture rendered across 4GPUs

GPU #0

Data Distribution +

Render Sort +

Alpha Composite

GPU #1

GPU #3 GPU #2 Display decoupled from

Render

Final Image

Page 26: GPU Programming for High- Performance Graphics Workstation

Using GPU Affinity

App manages

Distributing render

workload

implementing various

composition methods for

final image assembly

InterGPU communication

Data, image & task

scaling

wglMakeCurrent

Composite

affinityDC

affinityGLRC

gpuMask=0 gpuMask=1

affinityDC

affinityGLRC

wglCreateContext

wglMakeCurrent

winDC

Copy over PCI-e

Render

Offscreen

(FBO)

Render

Offscreen

wglCreateContext

wglMakeCurrent

Primary Slave

Producer

GPU

Consumer

GPU

Scaling Image Resolution

Page 27: GPU Programming for High- Performance Graphics Workstation

Sharing data between GPUs

For multiple contexts on same GPU ShareLists & GL_ARB_Create_Context

For multiple contexts across multiple GPU

Readback (GPU1-Host) Copies on host Upload (Host-GPU0)

NV_copy_image extension for OGL 3.x

Windows - wglCopyImageSubData

Linux - glXCopyImageSubDataNV

Avoids extra copies, same pinned host memory is accessed by both GPUs

Page 28: GPU Programming for High- Performance Graphics Workstation

NV_Copy_Image Extension

Transfer in single call

No binding of objects

No state changes

Supports 2D, 3D textures &

cube maps

Async for Fermi & above

Requires programming

Copy

Engine

Graphics

Engine

Copy

Engine Graphics

Engine

Consumer

GPU

Memory srcTex destTex

destCtx srcCtx

GPU

Memory

wglCopyImageSubDataNV(srcCtx, srcTex, GL_TEXTURE_2D,0, 0, 0, 0,

destCtx, destTex, GL_TEXTURE_2D, 0, 0, 0, 0,

width, height, 1);

Page 29: GPU Programming for High- Performance Graphics Workstation

Producer-Consumer Application Structure

One thread per GPU to

maximize CPU core utilization

OpenGL commands are

asynchronous

Need GPU level synchronization

Use GL_ARB_SYNC

Can scale to multiple

producers/consumers

glFramebuffer

Texture

glBindTex

GPU

Memory

srcTex

[nBuffers]

destCtx srcCtx

[0]

[1]

[2]

GPU

Memory

destTex

[nBuffers]

[0]

[1]

[2]

glDraw*

FBO

Consumer Producer

glCopyImageNV

App

Page 30: GPU Programming for High- Performance Graphics Workstation

Applications : Texture/Geometry Scaling

Adding more GPUs increases transfer time

But scales data size

Full-res images transferred between GPUs

Volumetric Data

Transfer RGBA images

Polygonal Data (2X transfer overhead)

Transfer RGBA and Depth (32bit) images

Page 31: GPU Programming for High- Performance Graphics Workstation

Applications : Task Scaling

Render scaling

Flight simulation, raytracing

Server-side rendering

Assign GPU for a user depending on heuristics

Eg using GL_NVX_MEMORY_INFO to assign GPU

Page 32: GPU Programming for High- Performance Graphics Workstation

References

OpenGL Insights chapters

Chapter 29 Fermi Asynchronous Texture Transfers

Chapter 27 - Multi-GPU Rendering on NVIDIA Quadro

Source Code -

https://github.com/OpenGLInsights/OpenGLInsightsCode

GTC 2012 On-demand talks

http://www.gputechconf.com/gtcnew/on-demand-gtc.php

S0353 - Programming Multi-GPUs for Scalable Rendering

S0356 - Optimized Texture Transfers