on dynamic load balancing on graphics processors daniel cederman and philippas tsigas chalmers...

57
On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Upload: miles-raybuck

Post on 01-Apr-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

On Dynamic Load Balancing on Graphics Processors

Daniel Cederman and Philippas TsigasChalmers University of Technology

Page 2: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Overview

• Motivation

• Methods

• Experimental evaluation

• Conclusion

Page 3: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

The problem setting

Work

Task Task Task

Task Task Task Task

Task Task Task Task

Offline

Online

Page 4: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Static Load Balancing

Processor Processor Processor Processor

Page 5: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Static Load Balancing

Processor Processor Processor Processor

Task Task Task Task

Page 6: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Static Load Balancing

Processor Processor Processor Processor

Task

Task

Task

Task

Page 7: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Static Load Balancing

Processor Processor Processor Processor

Task

Task

Task

Task

Subtask Subtask Subtask Subtask

Page 8: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Static Load Balancing

Processor Processor Processor Processor

Task

Task

Task

Task

SubtaskSubtask

Subtask

Subtask

Page 9: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Dynamic Load Balancing

Processor Processor Processor Processor

Task

Task

Task

Task

Subtask

SubtaskSubtask

Subtask

Page 10: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Task sharing

Work done?

Try to get task

New tasks

?

Perform task

Got task?

Add task

Task Set

No, retry

Check condition

Acquire Task

Add Task

No, continue

Task

Task

Task

Task

Task

Done

Page 11: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

System Model

• CUDA

• Global Memory

• Gather and scatter

• Compare-And-Swap

• Fetch-And-Inc

• Multiprocessors

• Maximum number ofconcurrent thread blocks

Multi-processor

Thread Block

Thread Block

Thread Block

Multi-processor

Thread Block

Thread Block

Thread Block

Multi-processor

Thread Block

Thread Block

Thread Block

Global Memory

Page 12: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Synchronization

• Blocking

• Uses mutual exclusion to only allow one process at a time to access the object.

• Lockfree

• Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps.

• Waitfree

• Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.

Page 13: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Load Balancing Methods

• Blocking Task Queue

• Non-blocking Task Queue

• Task Stealing

• Static Task List

Page 14: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Blocking queue

TB 1

TB 2

TB n

Free

Head

Tail

Page 15: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Blocking queue

TB 1

TB 2

TB n

Free

Head

Tail

Page 16: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Blocking queue

T1

TB 1

TB 2

TB n

Free

Head

Tail

Page 17: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Blocking queue

T1

TB 1

TB 2

TB n

Free

Head

Tail

Page 18: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Blocking queue

T1

TB 1

TB 2

TB n

Free

Head

Tail

Page 19: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Non-blocking Queue

T1 T2 T3 T4

TB 1

TB 2

TB 1

TB 2

TB n

Head

Tail

ReferenceP. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems [SPAA01]

Page 20: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Non-blocking Queue

T1 T2 T3 T4

TB 1

TB 2

TB 1

TB 2

TB n

Head

Tail

Page 21: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Non-blocking Queue

T1 T2 T3 T4

TB 1

TB 2

TB 1

TB 2

TB n

Head

Tail

Page 22: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Non-blocking Queue

T1 T2 T3 T4

TB 1

TB 2

TB 1

TB 2

TB n

Head

Tail

Page 23: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Non-blocking Queue

T1 T2 T3 T4 T5

TB 1

TB 2

TB 1

TB 2

TB n

Head

Tail

Page 24: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Non-blocking Queue

T1 T2 T3 T4 T5

TB 1

TB 2

TB 1

TB 2

TB n

Head

Tail

Page 25: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Task stealing

T1

T3 T2

TB 1

TB 2

TB n

ReferenceArora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]

Page 26: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Task stealing

T1 T4

T3 T2

TB 1

TB 2

TB n

Page 27: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Task stealing

T1 T4 T5

T3 T2

TB 1

TB 2

TB n

Page 28: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Task stealing

T1 T4

T3 T2

TB 1

TB 2

TB n

Page 29: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Task stealing

T1

T3 T2

TB 1

TB 2

TB n

Page 30: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Task stealing

T3 T2

TB 1

TB 2

TB n

Page 31: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Task stealing

T2

TB 1

TB 2

TB n

Page 32: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Static Task List

T1

T2

T3

T4

In

Page 33: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Static Task List

T1

T2

T3

T4

In

TB 1

TB 2

TB 3

TB 4

Page 34: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Static Task List

T1

T2

T3

T4

InOut

TB 1

TB 2

TB 3

TB 4

Page 35: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Static Task List

T1

T2

T3

T4

T5

InOut

TB 1

TB 2

TB 3

TB 4

Page 36: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Static Task List

T1

T2

T3

T4

T5

T6

InOut

TB 1

TB 2

TB 3

TB 4

Page 37: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Static Task List

T1

T2

T3

T4

T5

T6

T7

InOut

TB 1

TB 2

TB 3

TB 4

Page 38: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Octree Partitioning

• Bandwidth bound

Page 39: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Octree Partitioning

• Bandwidth bound

Page 40: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Octree Partitioning

• Bandwidth bound

Page 41: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Octree Partitioning

• Bandwidth bound

Page 42: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Four-in-a-row

• Computation intensive

Page 43: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Graphics Processors

8800GT• 14 Multiprocessors

• 57 GB/sec bandwidth

9600GT• 8 Multiprocessors

• 57 GB/sec bandwidth

Page 44: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Blocking Queue – Octree/9600GT

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0

200

400

600

Time (ms)

ThreadsBlocks

Time (ms)

200

300

400

500

Page 45: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Blocking Queue – Octree/8800GT

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0

200

400

600

800

Time (ms)

ThreadsBlocks

Time (ms)

200

400

600

800

Page 46: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Blocking Queue – Four-in-a-row

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0 500

1000 1500 2000 2500

Time (ms)

ThreadsBlocks

Time (ms)

500 1000 1500 2000 2500

Page 47: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Non-blocking Queue – Octree/9600GT

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0 50

100 150 200 250

Time (ms)

ThreadsBlocks

Time (ms)

100

150

200

Page 48: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Non-blocking Queue – Octree/8800GT

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0 50

100 150 200 250

Time (ms)

ThreadsBlocks

Time (ms)

100

150

200

Page 49: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Non-blocking Queue - Four-in-a-row

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0

50

100

150

200

Time (ms)

ThreadsBlocks

Time (ms)

100

150

200

Page 50: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Task stealing – Octree/9600GT

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0 50

100 150 200 250

Time (ms)

ThreadsBlocks

Time (ms)

0

50

100

150

200

Page 51: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Task stealing – Octree/8800GT

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0 50

100 150 200 250

Time (ms)

ThreadsBlocks

Time (ms)

50

100

150

200

Page 52: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Task stealing – Four-in-a-row

16 32

48 64

80 96

112 128

16 32

48 64

80 96

112 128

0

50

100

150

Time (ms)

ThreadsBlocks

Time (ms)

50

100

150

Page 53: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Static List

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 1280

20

40

60

80

100

120

140

Octree 9600GT Octree 8800GTS Four-in-a-row

Threads/Block

Tim

e (m

s)

Page 54: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Octree Comparison

100 150 200 250 300 350 400 450 50010

100

Blocking Queue Non-Blocking Queue Static ListWork Stealing

Particles (thousands)

Tim

e (m

s)

Page 55: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Previous work

• Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003

• Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998

• Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer, Graphics Hardware 2005

Page 56: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Conclusion

• Synchronization plays a significant role in dynamic load-balancing

• Lock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programming

• Locks perform poorly

• It is good that operations such as CAS and FAA have been introduced in the new GPUs

• Work stealing could outperform static load balancing

Page 57: On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Thank you!

http://www.cs.chalmers.se/~dcs