numa-friendly data structures (using delegation and ...numa node (multiple cores, shared last level...

27
NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice Herlihy HotPar ‘13 1

Upload: others

Post on 01-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

NUMA-Friendly Stack (using Delegation and Elimination)

Irina Calciu

Justin Gottschlich

Maurice Herlihy

HotPar ‘13

1

Page 2: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Trends for Future Architectures

2

Page 3: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Uniform Memory Access (UMA)

3

Page 4: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Non-Uniform Memory Access (NUMA)

(interconnect)

NUMA NODE (multiple cores, shared Last Level Cache)

NUMA NODE (multiple cores, shared Last Level Cache)

NUMA NODE (multiple cores, shared Last Level Cache)

NUMA NODE (multiple cores, shared Last Level Cache)

Cache coherency maintained between caches on different NUMA nodes

4

Page 5: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Overview

• Motivation

• Algorithms

• Results

• Conclusions

5

Page 6: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Delegation

NUMA node 0 NUMA node 1

Clients Clients

SEQ STACK

Server

6

Page 7: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Delegation

NUMA node 0 NUMA node 1

Server

Client 5

Client 6

Client 7

Client 8

Slots Client 1 Client 2

Client 3 Client 4

Slots

Loops through all slots

SEQ STACK

7

Page 8: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Elimination, Rendezvous

8

Page 9: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Local Rendezvous

NUMA node 0 NUMA node 1

STACK

9

Page 10: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Delegation + Elimination

NUMA node 0 NUMA node 1

Clients Clients

SEQ STACK

Server

10

Page 11: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Delegation + LOCAL Elimination

NUMA node 0 NUMA node 1

Clients

Clients

SEQ STACK

Server

11

Page 12: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Effect of Elimination

Throughput (Better)

50% push 50% pop

90% push 10% pop

12

Page 13: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Effect of Delegation

Throughput (Better)

50% push 50% pop

90% push 10% pop

13

Page 14: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Number of Slots

Throughput (Better)

50% push 50% pop

90% push 10% pop

14

Page 15: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Workloads: Balanced vs. Unbalanced

Throughput (Better)

50% push 50% pop

70% push 30% pop

15

Page 16: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Advantages

• Memory and cache locality

• Reduced bus traffic

• Increased parallelism through elimination

16

Page 17: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Drawbacks

• Communication cost between clients and server thread

o Insignificant compared to the benefits

• Serializing otherwise parallel data structures

o Parallelism through elimination

• Elimination opportunities decrease as workload more unbalanced

17

Page 18: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Open Questions

• Are there other data structures where we can use delegation and elimination?

• Are there data structures where direct access is much better?

• What can we do for those data structures?

18

Page 19: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Thank you! Questions?

19

Page 20: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

References

• A Scalable Lock-free Stack Algorithm

http://www.inf.ufsc.br/~dovicchi/pos-ed/pos/artigos/p206-hendler.pdf

• Flat Combining and the Synchronization-Parallelism Tradeoff

http://www.cs.bgu.ac.il/~hendlerd/papers/flat-combining.pdf

• Fast and Scalable Rendezvousing

http://www.cs.tau.ac.il/~afek/rendezvous.pdf

20

Page 21: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Cache to Cache Traffic

Better

21

Page 22: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Coefficient of Variation

Better

22

Page 23: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Flat Combining

23

Page 24: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Delegation

CLIENT Find corresponding slot (by NUMA node and cpuid) Post message Wait for response Get response

SERVER Loop through all slots: If slot has message:

Take message Process message Send response

Time

24

Page 25: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Delegation

CLIENT Find corresponding slot (by NUMA node and cpuid) try_elimination: if (eliminate) return Post message Wait for response Get response else try_elimination

SERVER Loop through all slots: If slot has message:

Take message Process message Send response

Time

25

Page 26: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Delegation

CLIENT Find corresponding slot (by NUMA node and cpuid) try_elimination: if (eliminate) return if (Acquire slot lock) Post message Wait for response Get response Release slot lock else try_elimination

SERVER Loop through all slots: If slot has message:

Take message Process message Send response

Time

26

Page 27: NUMA-Friendly Data Structures (using Delegation and ...NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4 . Overview

Open Questions

• Performance

• Scalability

• Power

27