accelerating cloud graphics - nvidia · constrained vbv buffer size —network packet framing for...
TRANSCRIPT
![Page 1: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/1.jpg)
Accelerating Cloud Graphics Franck DIARD, Ph. D. SW Architect Distinguished Engineer, NVIDIA
![Page 2: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/2.jpg)
Agenda
30 minute talk
10 minute demo
10 minute Q&A
![Page 3: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/3.jpg)
GeForce® GRID
Lower Latency
Higher Density
Higher Quality
![Page 4: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/4.jpg)
Scope GeForce GRID
— Coherent set of technologies
— Moving GPUs into the Cloud, providing scalability
— Overcoming density and cost challenges
Hardware
— GPU architecture / System integration
Software
— APIs, SDK, SW environment
— Virtualization
— Clients
![Page 5: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/5.jpg)
Kepler, the First Cloud GPU
High performance per watt
Integrated hardware encoder
Low-latency frame buffer
reads
GPU Virtualization
28nm
![Page 6: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/6.jpg)
What for?
Streaming anything from Cloud GPUs
— Gaming
— Enterprise workstation/VDI
— Consumer destop
Mobile clients
— Tegra3
— Low power playback
— Convenience
![Page 7: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/7.jpg)
GeForce GRID Latency
CLIENT
Decode Render
Kybd/Mse
SERVER
Render
Capture
Encode
GeForce GRID
<30 ms
Network
30 ms
2 Frames
GeForce GRID
<16ms
IP Network
CPU NIC
![Page 8: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/8.jpg)
Server Optimized GPU
GPUs CUDA
Cores
Memory
Size
Memory
Perf
Shader
Perf TDP
Dual 3,072 8GB 320 GB/sec 4.7 TFLOPS 250W
![Page 9: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/9.jpg)
Server Grade Solution
Passive cooling
High Quality Components
![Page 10: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/10.jpg)
Power – Density, Savings
GeForce GRID Game Servers 4 GPUs / Server
4 Game Streams / Server
75W / Game Stream
First Generation Cloud 1 GPU / Server
1 Game Stream / Server
150W / Game Stream Power management library
— Per GPU Power capping
![Page 11: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/11.jpg)
GeForce Grid Software
SDK
— 1 Header file, 1 DLL
— Set of code samples, documentation
Server Side
— Accelerated frame grabbing
— Video compression
— Virtualization
Client Side
— PC low latency HW decode/display
— Tegra low latency decode/display
![Page 12: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/12.jpg)
Server Side Architecture
Frame Buffer
Render
Target
HO
ST
I/F
DR
AM
I/F
DR
AM
I/F
DR
AM
I/F
DR
AM
I/F
GPU Virtualization
HO
ST
I/F
H
OS
T I/F
H
OS
T I/F
Front
Buffer
NVIFR
NVENC
Render
Target Render
Target
NVIFR NVIFR H.264 Streams
Other Interfaces
NVFBC
![Page 13: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/13.jpg)
GPU Virtualization
Increase density, cutting cost
NVMOS
— Deployment, isolation
— nvidia.com NVIDIA driver in VM at bare metal speed
SDK: API Shimming
— Injection of application
— Inserting in band encoding calls
— Allows n games to run on GPU
![Page 14: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/14.jpg)
Windows 7+
GeForce Grid: Virtualization
GPU
REMOTE GRAPHICS
Windows 7+
Low Latency Frame Buffer Capture
Low Latency Render Target Capture
NVENC Low latency
Encoder
NVMOS Platform Virtualization
Dedicated GPU
Ad-hoc API Shimming DirectX
GPU
Windows 7+
Game Game
GPU GPU
Game Game
Game Game
![Page 15: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/15.jpg)
Frame Grabbing
Low latency
— Using async units in GPU, 0 CPU cycles
Convenient
— Minimal API, fast integration in existing stacks
Flexible API
— To HW H.264 encoder fastpath
— To system memory for CPU codecs
— To CUDA buffer for specialized codecs
![Page 16: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/16.jpg)
Whole Display Grabbing
Asynchronous Windows7 display grabber
Orthogonal to all GFX stacks (gdi,dx9, dx10, OGL)
Windows7 head, desktop games, flash games
Standard Windows API
— does not grab all cases
— incurs a severe performance hit
![Page 17: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/17.jpg)
Whole Display Grabbing
HW overlay, HW mouse, Aero on, off, transitions
Tear-free, all DMA, not vsync’d, format conversion, scaling
Performance:
— 4 ms to H.264 encoder, bits written back in system memory (720p)
— 2 ms API call to system memory
— 0.1 ms to CUDA
![Page 18: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/18.jpg)
Render Target Grabbing
SDK to use with API shimming
Render target read back: Dx9, Dx10, Dx11 (OGL planned)
— format conversion, scaling
In band with GFX API: Present() call
— Page locked sysmem
— H.264 interface
— CUDA interoperability
Asynchronous Event Signaling
— Not blocking main render loop
— CPU friendly, interrupt driven
![Page 19: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/19.jpg)
H.264 HW Encoding
Completely separate GPU unit: <2 watts
PSNR
— comparable to x264
up to 32 encoding contexts
— 4 HD streams @60fps
High Profile
— 720p: 4 ms
— 1080p: 8 ms
![Page 20: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/20.jpg)
H.264 Encoder Features
Constrained VBV buffer size
— network packet framing for real time delivery
CBR, VBR, Min QP
CUDA Interoperability
I-frame on-demand
Max frame/slice size Capping
Reference picture invalidation logic API for packet loss
4Kx4K support
Stereo MVC Encoding
![Page 21: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/21.jpg)
Client Side
Client side is important
— easy to ruin the user experience
Generic, CPU based plugins
— Slow decode and multi-frame buffering increase latency
— Slow render of decoded output
— CPU cycles burn a lot of power
GeForce GRID for client: low latency and low power
— GPU offload for decode and fast render
— CPU just drives the IP stack and feeds GPU hardware
![Page 22: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/22.jpg)
Client Side PC SDK
SDK for:
— bits in, frame out on the screen, lean and mean
— feeding from system memory buffer
HW decode on all nvidia GPUs: Windows, Linux
— 60 FPS HD on common NVIDIA GPUs
CUDA/DX/OGL interoperability
— Gamma correction
— Titling/HUD
![Page 23: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/23.jpg)
Client Side Tegra3 SDK
SDK:
— HoneyComb/ICS and up, native
— No added frame latency
— Bypass of OS traditional stack, this is not streaming
Decoder: 8ms 720p
Tear free display
720p / 60 FPS
1080p / 30 FPS
![Page 24: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/24.jpg)
Recorded Demo
Gaming with Tegra3 over WIFI from local server
![Page 25: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/25.jpg)
Recorded Demo
Win8 with Tegra3 over WIFI from local server
![Page 26: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/26.jpg)
Live Demos
Server
— CoreI7 2.6 Ghz
— @NVIDIA Headquarters, Santa Clara (10 miles away)
— Bare metal Win7 32
— Kepler Geforce GRID edition
4GB FB, 1500 cores
![Page 27: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/27.jpg)
BF3
Tegra Transformer Prime
— USB/Ethernet
720p
5Mbps
30 fps
HW H.264 Encoding, high profile, no B frames
![Page 28: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/28.jpg)
Desktop Remoting
Aero On
1080p
5 Mbps
Google SketchUp
Flash/Web gaming
Video playback
— HW overlay on server
![Page 29: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/29.jpg)
High End Gaming
DX10
8 Mbps
1080p
![Page 30: Accelerating Cloud Graphics - NVIDIA · Constrained VBV buffer size —network packet framing for real time delivery CBR, VBR, Min QP CUDA Interoperability I-frame on-demand Max frame/slice](https://reader033.vdocuments.net/reader033/viewer/2022060707/6073a7cac758d2004719bf18/html5/thumbnails/30.jpg)
Thanks To GeForce GRID Partners
Gaikai
Ubitus
Playcast
G-Cluster
Otoy
…