dynamic load-balancing on graphics processors

57
On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Upload: daced

Post on 02-Nov-2014

1.500 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Dynamic Load-balancing On Graphics Processors

On Dynamic Load Balancing on Graphics Processors

Daniel Cederman and Philippas TsigasChalmers University of Technology

Page 2: Dynamic Load-balancing On Graphics Processors

Overview

• Motivation

• Methods

• Experimental evaluation

• Conclusion

Page 3: Dynamic Load-balancing On Graphics Processors

The problem setting

Work

Task Task Task

Task Task Task Task

Task Task Task Task

Offline

Online

Page 4: Dynamic Load-balancing On Graphics Processors

Static Load Balancing

Processor Processor Processor Processor

Page 5: Dynamic Load-balancing On Graphics Processors

Static Load Balancing

Processor Processor Processor Processor

Task Task Task Task

Page 6: Dynamic Load-balancing On Graphics Processors

Static Load Balancing

Processor Processor Processor Processor

Task

Task

Task

Task

Page 7: Dynamic Load-balancing On Graphics Processors

Static Load Balancing

Processor Processor Processor Processor

Task

Task

Task

Task

Subtask Subtask Subtask Subtask

Page 8: Dynamic Load-balancing On Graphics Processors

Static Load Balancing

Processor Processor Processor Processor

Task

Task

Task

Task

SubtaskSubtask

Subtask

Subtask

Page 9: Dynamic Load-balancing On Graphics Processors

Dynamic Load Balancing

Processor Processor Processor Processor

Task

Task

Task

Task

Subtask

SubtaskSubtask

Subtask

Page 10: Dynamic Load-balancing On Graphics Processors

Task sharing

Work done?

Try to get task

New tasks

?

Perform task

Got task?

Add task

Task Set

No, retry

Check condition

Acquire Task

Add Task

No, continue

Task

Task

Task

Task

Task

Done

Page 11: Dynamic Load-balancing On Graphics Processors

System Model

• CUDA

• Global Memory

• Gather and scatter

• Compare-And-Swap

• Fetch-And-Inc

• Multiprocessors

• Maximum number ofconcurrent thread blocks

Multi-processor

Thread Block

Thread Block

Thread Block

Multi-processor

Thread Block

Thread Block

Thread Block

Multi-processor

Thread Block

Thread Block

Thread Block

Global Memory

Page 12: Dynamic Load-balancing On Graphics Processors

Synchronization

• Blocking

• Uses mutual exclusion to only allow one process at a time to access the object.

• Lockfree

• Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps.

• Waitfree

• Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.

Page 13: Dynamic Load-balancing On Graphics Processors

Load Balancing Methods

• Blocking Task Queue

• Non-blocking Task Queue

• Task Stealing

• Static Task List

Page 14: Dynamic Load-balancing On Graphics Processors

Blocking queue

TB 1

TB 2

TB n

Free

Head

Tail

Page 15: Dynamic Load-balancing On Graphics Processors

Blocking queue

TB 1

TB 2

TB n

Free

Head

Tail

Page 16: Dynamic Load-balancing On Graphics Processors

Blocking queue

T1

TB 1

TB 2

TB n

Free

Head

Tail

Page 17: Dynamic Load-balancing On Graphics Processors

Blocking queue

T1

TB 1

TB 2

TB n

Free

Head

Tail

Page 18: Dynamic Load-balancing On Graphics Processors

Blocking queue

T1

TB 1

TB 2

TB n

Free

Head

Tail

Page 19: Dynamic Load-balancing On Graphics Processors

Non-blocking Queue

T1 T2 T3 T4

TB 1

TB 2

TB 1

TB 2

TB n

Head

Tail

ReferenceP. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems [SPAA01]

Page 20: Dynamic Load-balancing On Graphics Processors

Non-blocking Queue

T1 T2 T3 T4

TB 1

TB 2

TB 1

TB 2

TB n

Head

Tail

Page 21: Dynamic Load-balancing On Graphics Processors

Non-blocking Queue

T1 T2 T3 T4

TB 1

TB 2

TB 1

TB 2

TB n

Head

Tail

Page 22: Dynamic Load-balancing On Graphics Processors

Non-blocking Queue

T1 T2 T3 T4

TB 1

TB 2

TB 1

TB 2

TB n

Head

Tail

Page 23: Dynamic Load-balancing On Graphics Processors

Non-blocking Queue

T1 T2 T3 T4 T5

TB 1

TB 2

TB 1

TB 2

TB n

Head

Tail

Page 24: Dynamic Load-balancing On Graphics Processors

Non-blocking Queue

T1 T2 T3 T4 T5

TB 1

TB 2

TB 1

TB 2

TB n

Head

Tail

Page 25: Dynamic Load-balancing On Graphics Processors

Task stealing

T1

T3 T2

TB 1

TB 2

TB n

ReferenceArora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]

Page 26: Dynamic Load-balancing On Graphics Processors

Task stealing

T1 T4

T3 T2

TB 1

TB 2

TB n

Page 27: Dynamic Load-balancing On Graphics Processors

Task stealing

T1 T4 T5

T3 T2

TB 1

TB 2

TB n

Page 28: Dynamic Load-balancing On Graphics Processors

Task stealing

T1 T4

T3 T2

TB 1

TB 2

TB n

Page 29: Dynamic Load-balancing On Graphics Processors

Task stealing

T1

T3 T2

TB 1

TB 2

TB n

Page 30: Dynamic Load-balancing On Graphics Processors

Task stealing

T3 T2

TB 1

TB 2

TB n

Page 31: Dynamic Load-balancing On Graphics Processors

Task stealing

T2

TB 1

TB 2

TB n

Page 32: Dynamic Load-balancing On Graphics Processors

Static Task List

T1

T2

T3

T4

In

Page 33: Dynamic Load-balancing On Graphics Processors

Static Task List

T1

T2

T3

T4

In

TB 1

TB 2

TB 3

TB 4

Page 34: Dynamic Load-balancing On Graphics Processors

Static Task List

T1

T2

T3

T4

InOut

TB 1

TB 2

TB 3

TB 4

Page 35: Dynamic Load-balancing On Graphics Processors

Static Task List

T1

T2

T3

T4

T5

InOut

TB 1

TB 2

TB 3

TB 4

Page 36: Dynamic Load-balancing On Graphics Processors

Static Task List

T1

T2

T3

T4

T5

T6

InOut

TB 1

TB 2

TB 3

TB 4

Page 37: Dynamic Load-balancing On Graphics Processors

Static Task List

T1

T2

T3

T4

T5

T6

T7

InOut

TB 1

TB 2

TB 3

TB 4

Page 38: Dynamic Load-balancing On Graphics Processors

Octree Partitioning

• Bandwidth bound

Page 39: Dynamic Load-balancing On Graphics Processors

Octree Partitioning

• Bandwidth bound

Page 40: Dynamic Load-balancing On Graphics Processors

Octree Partitioning

• Bandwidth bound

Page 41: Dynamic Load-balancing On Graphics Processors

Octree Partitioning

• Bandwidth bound

Page 42: Dynamic Load-balancing On Graphics Processors

Four-in-a-row

• Computation intensive

Page 43: Dynamic Load-balancing On Graphics Processors

Graphics Processors

8800GT• 14 Multiprocessors

• 57 GB/sec bandwidth

9600GT• 8 Multiprocessors

• 57 GB/sec bandwidth

Page 44: Dynamic Load-balancing On Graphics Processors

Blocking Queue – Octree/9600GT

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0

200

400

600

Time (ms)

ThreadsBlocks

Time (ms)

200

300

400

500

Page 45: Dynamic Load-balancing On Graphics Processors

Blocking Queue – Octree/8800GT

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0

200

400

600

800

Time (ms)

ThreadsBlocks

Time (ms)

200

400

600

800

Page 46: Dynamic Load-balancing On Graphics Processors

Blocking Queue – Four-in-a-row

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0 500

1000 1500 2000 2500

Time (ms)

ThreadsBlocks

Time (ms)

500 1000 1500 2000 2500

Page 47: Dynamic Load-balancing On Graphics Processors

Non-blocking Queue – Octree/9600GT

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0 50

100 150 200 250

Time (ms)

ThreadsBlocks

Time (ms)

100

150

200

Page 48: Dynamic Load-balancing On Graphics Processors

Non-blocking Queue – Octree/8800GT

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0 50

100 150 200 250

Time (ms)

ThreadsBlocks

Time (ms)

100

150

200

Page 49: Dynamic Load-balancing On Graphics Processors

Non-blocking Queue - Four-in-a-row

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0

50

100

150

200

Time (ms)

ThreadsBlocks

Time (ms)

100

150

200

Page 50: Dynamic Load-balancing On Graphics Processors

Task stealing – Octree/9600GT

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0 50

100 150 200 250

Time (ms)

ThreadsBlocks

Time (ms)

0

50

100

150

200

Page 51: Dynamic Load-balancing On Graphics Processors

Task stealing – Octree/8800GT

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0 50

100 150 200 250

Time (ms)

ThreadsBlocks

Time (ms)

50

100

150

200

Page 52: Dynamic Load-balancing On Graphics Processors

Task stealing – Four-in-a-row

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0

50

100

150

Time (ms)

ThreadsBlocks

Time (ms)

50

100

150

Page 53: Dynamic Load-balancing On Graphics Processors

Static List

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 1280

20

40

60

80

100

120

140

Octree 9600GT Octree 8800GTS Four-in-a-row

Threads/Block

Tim

e (m

s)

Page 54: Dynamic Load-balancing On Graphics Processors

Octree Comparison

100 150 200 250 300 350 400 450 50010

100

Blocking Queue Non-Blocking Queue Static ListWork Stealing

Particles (thousands)

Tim

e (m

s)

Page 55: Dynamic Load-balancing On Graphics Processors

Previous work

• Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003

• Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998

• Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer, Graphics Hardware 2005

Page 56: Dynamic Load-balancing On Graphics Processors

Conclusion

• Synchronization plays a significant role in dynamic load-balancing

• Lock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programming

• Locks perform poorly

• It is good that operations such as CAS and FAA have been introduced in the new GPUs

• Work stealing could outperform static load balancing

Page 57: Dynamic Load-balancing On Graphics Processors

Thank you!

http://www.cs.chalmers.se/~dcs