data structures optimisation for many-core systems matthew freeman | supervisor: maciej golebiewski...
TRANSCRIPT
DATA STRUCTURES OPTIMISATION FORMANY-CORE SYSTEMSMatthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program 2013-14
Presentation title | Presenter name
The Multi-core Age
2 |
Mobile Phone PC Intel Xeon Phi
CSIRO ‘Bragg’ Compute Cluster
2-4 Cores 4-16 Cores 61 Cores 2048 Cores
Presentation title | Presenter name
Programming for multi-cores
3 |
Problem
CPU Core 1
CPU Core 2
CPU Core 3
CPU Core 4
Machine Instructions Execution
Divide the problem
Presentation title | Presenter name
• The maximum speedup is dependent on % of the problem you can run in parallel
Amdahl's Law
4 |
0 5 10 15 20 25
Single Core Processor
1x Speed
50% 2x speedup
75% 4x speedup
90%
95%
10x speedup
20x speedup
Maximum Speedup
Presentation title | Presenter name
Data structures:
5 |
• Memory (data) is still a shared resource.
Memory (data)
CPU core
Single Core Computer
Memory (data)
CPU core
CPU core CPU core
CPU core
4-Core Computer
Presentation title | Presenter name
Linked-list (Stack) Data Structure
6 |
DataA link to the next data point
EMPTY
A “node” that holds data.
TOP
Presentation title | Presenter name
Add new item (Push)
7 |
Data A EMPTY
We want to add a chunk of data (Data B) to the structure
TOP
Data B
Presentation title | Presenter name
Add new item (Push)
8 |
Data A EMPTY
Steps: For new data B
1) Find the start of the structure (TOP)
Data B
TOP
Presentation title | Presenter name
Add new item
9 |
Data A EMPTY
Data B
Steps: For new data B
2) Link into the structure.
TOP
Presentation title | Presenter name
Add new item
10 |
Data A NULL
TOP (new)
Data B
Steps: For new data B
3) Update TOP.
Presentation title | Presenter name
• Like stacking dinner plates• Only need to keep track of where TOP is to access the rest.
Resulting structure
11 |
Data NULLData Data Data Data
TOP
Presentation title | Presenter name
What happens in multi-core systems?
12 |
Two threads trying to operate on the stack structure:
Thread 1 attempts at time T.Thread 2 attempts at time T + 1 nanosecond.
Because each of the steps takes time to complete, errors occur.
Presentation title | Presenter name
What happens in multi-core systems?
13 |
This causes the interleaving of steps
Thread 1 reads TOP (1)Thread 2 reads TOP (1)Thread 1 sets the next pointer (2)Thread 2 sets the next pointer (2)Thread 1 updates TOP (3)Thread 2 updates TOP (3)
Presentation title | Presenter name14 |
Data A EMPTYTOP
Data C
Data B
Data B is lost forever because it is not linked to TOP anymore (Stack failure)
Thread 1
Thread 2
Presentation title | Presenter name
• Use “data locks”.• Protect the 3 steps.• One thread at a time is granted access to the stack. • Complete an operation and release the lock.
This is the standard approach for multithreaded structures.
How do we fix this?
15 |
Presentation title | Presenter name
Easy to use. 2 lines of code added to fix.- Get Lock- Step 1, 2 ,3.- Release Lock.
× Slow. One thread at a time can use the lock.
This becomes sequential code.This is the code that cannot run in parallel.
Analogy: Merging highway traffic into a single lane.
Locks
16 |
Presentation title | Presenter name
New method
• Lock-free data structure.
• Special low-level instructions allows three steps in one computer instruction.
• Removes the need for locks.
• Called a Compare-Exchange.
Lock-free
17 |
Presentation title | Presenter name
• Downside: Writing lock-free code is difficult (hence the project).
• The Compare-Exchange operation forms the base for writing lock-free code.
• The project takes specifications from research papers to implement.
Lock-free
18 |
Presentation title | Presenter name
Implemented a range of lock-free optimizations for the stack.
Open coding standards (C++, OpenMP)
Benchmarked using a Intel Xeon Phi 61 core processor.
Lock-free structure performed about 2x better for pure stack operations.
Lock-free
19 |
Presentation title | Presenter name
Amdahl’s Law shows that it’s important to optimize sequential sections of code.
The shared data structures are often sequential bottlenecks.
Implementing lock-free data structures reduced this bottleneck.
Summary
20 |