Threading Games for Performance – Architecture – Case Studies
Threading Issues Threads are a tool, not a ready-made
solution. Most threading tutorials use
“embarrassingly parallel” examples. Games are especially challenging for
threading, because of architectural requirements and genre expectations
Many issues have to be considered when implementing a threading strategy:◦ Is high frame rate the most important
performance indicator?◦ Is input latency a deal-breaker?◦ Is it fair for clients to run at different speeds?◦ High frame rate or smooth frame rate?◦ How well will it scale when Intel ships n-core?
Terminology A task is a piece of work that is mapped to a thread.
◦ Dedicated threads run the same task repeatedly
◦ Thread pools are assigned tasks dynamically
Work can be broken down into tasks various ways.
◦ One task for each subsystem is functional
decomposition
◦ Multiple tasks for a subsystem is data
decomposition AnimationPhysics Physics
Physics
Let’s hack a game…Task Cluster: isolate procedures
in the game into general tasks.
ParticlesRender
AI
Animation Level LoadingPhysics
Get Organized - A simple Render Split?
Render Split: queue up calls, pass them to the
render thread.
Buff
er
Render
Everything Else
Threading Games for Performance – Case Studies
6
Render – Split == wait on data or tasks. Try a Work Crew Work Crew: like a Task Cluster, but buffer data for
each task.
Render
Physics
Particles
Animation
Render
Physics
Particles Particles Particles Particles
Physics Physics
Animation
AI
Animation Animation
AI
Threading Games for Performance – Case Studies
7
Work Crew == High Memory Bandwidth. Try an Operation Queue
Operation Queue: data is broken into blocks with a
service thread which executes operations put into a
queue.
QueueAI
Service
Animation
Physics
Render
8
Architecture Model – Synchronous Function Parallel Model
Find parallel tasks from an existing loop.
To reduce the need for communication between
parallel tasks, the tasks should preferably be truly
independent of each other.
9
Architecture Model – Synchronous Function Parallel Model
Divide the functionality to small tasks, build a graph
of which tasks precede which task.
Supply this task-dependency graph to a framework.
The framework in turn will schedule the proper tasks
to be run, minding the amount of available processor
cores.
10
Architecture Model – Synchronous Function Parallel Model
There is an upper limit to how many cores they can
support dictated by the limit of how many parallel
tasks it is possible to find in the engine.
The number of meaningful tasks is decreased by the
fact that threading very small tasks will yield
negligible results.
The parallel tasks should have very little
dependencies on each other.
11
Architecture Model – Asynchronous Function Parallel Model
This model doesn't contain a game loop.
The tasks that drive the game forward update at their
own pace.
The most recent information is used by the render
engine.
12
Architecture Model – Asynchronous Function Parallel Model
The scalability of the asynchronous function parallel
model is limited by how many tasks it is possible to
find from the engine.
Communication between threads by only using the
latest information available effectively reduces the
need for the threads to be truly independent.
The asynchronous model can support a larger amount
of tasks, and therefore a larger amount of processor
cores, than the synchronous model.
13
Architecture Model – Data Parallel Model
Find some set of similar data for which to perform the
same tasks in parallel.
These are typically the objects in the game.
◦ Example: In a flying simulation, divide all of the planes into two
threads. Each thread handles the simulation of half of the planes.
Optimally the engine would use as many threads as there are
logical processor cores.
14
Architecture Model – Data Parallel Model
How to divide the objects into threads?
◦ Threads should be properly balanced, so that each processor core gets
used to full capacity.
What will happen when two objects in different threads
need to interact?
◦ Communication using synchronization primitives could potentially
reduce the amount of parallelism.
◦ Use message passing accompanied by using latest known updates as in
the asynchronous model.
◦ Communication between threads can be reduced by grouping objects
that are most likely to interact with each other.
◦ Objects are more likely to come into contact with their neighbors, so
one strategy could be to group objects by area.
15
Architecture Model – Data Parallel Model
The data parallel model has excellent scalability.
The amount of object threads can be automatically set to the
amount of cores the system is running, and the only non-
parallelizable parts of the game loop would be ones that don't
directly deal with game objects.
Data parallelism is needed to fully utilize future processors with
dozens of cores.
The performance of the data parallel model is directly related to
how large a part of the game engine can be parallelized by data.
As the amount of processor cores goes up, the data parallel parts of
the engine take less time to run. Fortunately these are usually also
the performance heavy parts of a game engine.
16
Architecture Model – Data Parallel Model
The biggest drawback of the model is the need to have
components that support data parallelism.
For example, a physics component would need to be able to
run several physics updates in parallel, and be able to
correctly calculate collisions with objects that are in these
separate threads.
Threading Games for Performance – Case Studies
17
Valve uses a hybrid approach to threading the Source* engine Uses both functional and data parallelism
(coarse and fine grain). Single mechanism (thread pool with task
queue) supports both. Conventional functional threading: Sound,
Rendering back end (D3D calls).
Example parallel tasks:◦ Construct scene rendering lists for multiple
scenes in parallel (e.g., the world and its reflection in water)
◦ Graphics simulation (particles, ropes, sprites)◦ Character bone transformations for all
characters in all scenes in parallel◦ Shadows for all characters
Threading Games for Performance – Case Studies
18
Valve’s hybrid threading
Main Thread
Game Engin
e
Loop
Task Q D3D
Driver
Thread
Pool
Re-OrderBuffer
Render Thread
SoundThread
Threading Games for Performance – Case Studies
19
The Quake 4* engine takes a different approach to threadingThe Engine is split up into 3 main
Components- The Quake 4 Engine (exe) – this is the part
that gets threaded- idlib common library for all is stuff (math,
timing , algorithms, memory management, parsers,… ) linked statically very well optimized with SSE,SSE2, SSE3.
- The Game DLL – the basic game dll implements classes specific to the game like Weapons, Vehicles, Characters, Script engine, AI, Game physics,… calls into the Quake Engine for all of the lower level work like the skinning of characters during animation
Threading Games for Performance – Case Studies
20
VTune™ Analyzer shows unthreaded Quake 4* has no big hotspots
Analysis with the VTune™ Performance Analyzer revealed that:
◦It was single threaded and CPU bound
◦Roughly equal amount is being spent in the driver and the engine 41% & 49% respectively
◦Each of the major hotspots consumed 2-4% of CPU time
Legal text goes here in Verdana regular 7pt.
Best performance gains by overlapping engine and renderer
Threading Games for Performance – Case Studies
21
Quake 4* gets the Render Split treatment
◦ Latency is a key issue, so we have to achieve the most performance in a time constrained scenario – only one frame of latency allowed.
◦ The engine was functionally decomposed to maximize overlap and minimize synchronization into its two largest blocks
◦ All of the time spent in the OpenGL driver is due to the rendering subsystem of the Quake 4 Engine
◦ Split the render into front-end and back-end so all the OpenGL calls were now made from the back-end thread
◦ The front-end and back-end communicate through command queues and synchronization events
Threading Games for Performance – Case Studies
22
Quake 4* control flow
Frame n
Front End Back End
Frame nFrame n+1
Frame n+1Frame n+2
Threading Games for Performance – Case Studies
23
Though simple in concept, the Render Split requires significant changes◦ The frame was prepared by the front end handed
over to the back end while the front end prepared the next frame.
◦ Data specific to a frame was duplicated◦ Data had to be allocated and freed safely.◦ All allocations with the exception of a few were
done in the front end ◦ Data to be freed was kept till the backend was
done and cleared at the front end just before reuse.
◦ Subsystems that were not thread safe had to be re written for thread safety models classes, animation, shadows, texture subsystems, deforms, loaders, writers, vertex caches, effects, …
Minimize synchronizationHave a policy on memory allocation
Threading Games for Performance – Case Studies
24
Debugging the threaded engine is a further challenge
◦ Debugging the threaded code is the hardest problem
◦ Issues could be broadly categorized into 3 major types
Data race conditions Object lifetime issues OpenGL context issues
◦ Added a lock step mode to the threaded code where the front end and back end would run on separate threads but run lock step
◦ Added lots of initialization and destruction code to deal with lifetime issues
◦ Used synchronization points to slowly & painfully eliminate data races
- Threading is hard. Interaction with the GPU adds more complexity- Need to design debugging aids while designing engine threading
Threading Games for Performance – Case Studies
25
Multi-threaded drivers enable a further performance gainAfter Quake 4* was threaded NVIDIA
and ATI both have released multi-threaded drivers.
The drivers have matured and now work well with a threaded renderer
With the multi-threaded drivers we see a further gain of about 30-40%