Download - Threading Games for Performance – Architecture – Case Studies

Threading Games for Performance – Architecture – Case Studies

Threading Issues Threads are a tool, not a ready-made

solution. Most threading tutorials use

“embarrassingly parallel” examples. Games are especially challenging for

threading, because of architectural requirements and genre expectations

Many issues have to be considered when implementing a threading strategy:◦ Is high frame rate the most important

performance indicator?◦ Is input latency a deal-breaker?◦ Is it fair for clients to run at different speeds?◦ High frame rate or smooth frame rate?◦ How well will it scale when Intel ships n-core?

Terminology A task is a piece of work that is mapped to a thread.

◦ Dedicated threads run the same task repeatedly

◦ Thread pools are assigned tasks dynamically

Work can be broken down into tasks various ways.

◦ One task for each subsystem is functional

decomposition

◦ Multiple tasks for a subsystem is data

decomposition AnimationPhysics Physics

Physics

Let’s hack a game…Task Cluster: isolate procedures

in the game into general tasks.

ParticlesRender

AI

Animation Level LoadingPhysics

Get Organized - A simple Render Split?

Render Split: queue up calls, pass them to the

render thread.

Buff

er

Render

Everything Else

Threading Games for Performance – Case Studies

6

Render – Split == wait on data or tasks. Try a Work Crew Work Crew: like a Task Cluster, but buffer data for

each task.

Render

Physics

Particles

Animation

Render

Physics

Particles Particles Particles Particles

Physics Physics

Animation

AI

Animation Animation

AI


7

Work Crew == High Memory Bandwidth. Try an Operation Queue

Operation Queue: data is broken into blocks with a

service thread which executes operations put into a

queue.

QueueAI

Service

Animation

Physics

Render

8

Architecture Model – Synchronous Function Parallel Model

Find parallel tasks from an existing loop.

To reduce the need for communication between

parallel tasks, the tasks should preferably be truly

independent of each other.

9


Divide the functionality to small tasks, build a graph

of which tasks precede which task.

Supply this task-dependency graph to a framework.

The framework in turn will schedule the proper tasks

to be run, minding the amount of available processor

cores.

10


There is an upper limit to how many cores they can

support dictated by the limit of how many parallel

tasks it is possible to find in the engine.

The number of meaningful tasks is decreased by the

fact that threading very small tasks will yield

negligible results.

The parallel tasks should have very little

dependencies on each other.

11

Architecture Model – Asynchronous Function Parallel Model

This model doesn't contain a game loop.

The tasks that drive the game forward update at their

own pace.

The most recent information is used by the render

engine.

12

Architecture Model – Asynchronous Function Parallel Model

The scalability of the asynchronous function parallel

model is limited by how many tasks it is possible to

find from the engine.

Communication between threads by only using the

latest information available effectively reduces the

need for the threads to be truly independent.

The asynchronous model can support a larger amount

of tasks, and therefore a larger amount of processor

cores, than the synchronous model.

13

Architecture Model – Data Parallel Model

Find some set of similar data for which to perform the

same tasks in parallel.

These are typically the objects in the game.

◦ Example: In a flying simulation, divide all of the planes into two

threads. Each thread handles the simulation of half of the planes.

Optimally the engine would use as many threads as there are

logical processor cores.

14


How to divide the objects into threads?

◦ Threads should be properly balanced, so that each processor core gets

used to full capacity.

What will happen when two objects in different threads

need to interact?

◦ Communication using synchronization primitives could potentially

reduce the amount of parallelism.

◦ Use message passing accompanied by using latest known updates as in

the asynchronous model.

◦ Communication between threads can be reduced by grouping objects

that are most likely to interact with each other.

◦ Objects are more likely to come into contact with their neighbors, so

one strategy could be to group objects by area.

15


The data parallel model has excellent scalability.

The amount of object threads can be automatically set to the

amount of cores the system is running, and the only non-

parallelizable parts of the game loop would be ones that don't

directly deal with game objects.

Data parallelism is needed to fully utilize future processors with

dozens of cores.

The performance of the data parallel model is directly related to

how large a part of the game engine can be parallelized by data.

As the amount of processor cores goes up, the data parallel parts of

the engine take less time to run. Fortunately these are usually also

the performance heavy parts of a game engine.

16


The biggest drawback of the model is the need to have

components that support data parallelism.

For example, a physics component would need to be able to

run several physics updates in parallel, and be able to

correctly calculate collisions with objects that are in these

separate threads.


17

Valve uses a hybrid approach to threading the Source* engine Uses both functional and data parallelism

(coarse and fine grain). Single mechanism (thread pool with task

queue) supports both. Conventional functional threading: Sound,

Rendering back end (D3D calls).

Example parallel tasks:◦ Construct scene rendering lists for multiple

scenes in parallel (e.g., the world and its reflection in water)

◦ Graphics simulation (particles, ropes, sprites)◦ Character bone transformations for all

characters in all scenes in parallel◦ Shadows for all characters


18

Valve’s hybrid threading

Main Thread

Game Engin

e

Loop

Task Q D3D

Driver

Thread

Pool

Re-OrderBuffer

Render Thread

SoundThread


19

The Quake 4* engine takes a different approach to threadingThe Engine is split up into 3 main

Components- The Quake 4 Engine (exe) – this is the part

that gets threaded- idlib common library for all is stuff (math,

timing , algorithms, memory management, parsers,… ) linked statically very well optimized with SSE,SSE2, SSE3.

- The Game DLL – the basic game dll implements classes specific to the game like Weapons, Vehicles, Characters, Script engine, AI, Game physics,… calls into the Quake Engine for all of the lower level work like the skinning of characters during animation


20

VTune™ Analyzer shows unthreaded Quake 4* has no big hotspots

Analysis with the VTune™ Performance Analyzer revealed that:

◦It was single threaded and CPU bound

◦Roughly equal amount is being spent in the driver and the engine 41% & 49% respectively

◦Each of the major hotspots consumed 2-4% of CPU time

Legal text goes here in Verdana regular 7pt.

Best performance gains by overlapping engine and renderer


21

Quake 4* gets the Render Split treatment

◦ Latency is a key issue, so we have to achieve the most performance in a time constrained scenario – only one frame of latency allowed.

◦ The engine was functionally decomposed to maximize overlap and minimize synchronization into its two largest blocks

◦ All of the time spent in the OpenGL driver is due to the rendering subsystem of the Quake 4 Engine

◦ Split the render into front-end and back-end so all the OpenGL calls were now made from the back-end thread

◦ The front-end and back-end communicate through command queues and synchronization events


22

Quake 4* control flow

Frame n

Front End Back End

Frame nFrame n+1

Frame n+1Frame n+2


23

Though simple in concept, the Render Split requires significant changes◦ The frame was prepared by the front end handed

over to the back end while the front end prepared the next frame.

◦ Data specific to a frame was duplicated◦ Data had to be allocated and freed safely.◦ All allocations with the exception of a few were

done in the front end ◦ Data to be freed was kept till the backend was

done and cleared at the front end just before reuse.

◦ Subsystems that were not thread safe had to be re written for thread safety models classes, animation, shadows, texture subsystems, deforms, loaders, writers, vertex caches, effects, …

Minimize synchronizationHave a policy on memory allocation


24

Debugging the threaded engine is a further challenge

◦ Debugging the threaded code is the hardest problem

◦ Issues could be broadly categorized into 3 major types

Data race conditions Object lifetime issues OpenGL context issues

◦ Added a lock step mode to the threaded code where the front end and back end would run on separate threads but run lock step

◦ Added lots of initialization and destruction code to deal with lifetime issues

◦ Used synchronization points to slowly & painfully eliminate data races

- Threading is hard. Interaction with the GPU adds more complexity- Need to design debugging aids while designing engine threading


25

Multi-threaded drivers enable a further performance gainAfter Quake 4* was threaded NVIDIA

and ATI both have released multi-threaded drivers.

The drivers have matured and now work well with a threaded renderer

With the multi-threaded drivers we see a further gain of about 30-40%

Download - Threading Games for Performance – Architecture – Case Studies

Top Related