data structures optimisation for many-core systems matthew freeman | supervisor: maciej golebiewski...

DATA STRUCTURES OPTIMISATION FORMANY-CORE SYSTEMSMatthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program 2013-14

Presentation title | Presenter name

The Multi-core Age

2 |

Mobile Phone PC Intel Xeon Phi

CSIRO ‘Bragg’ Compute Cluster

2-4 Cores 4-16 Cores 61 Cores 2048 Cores


Programming for multi-cores

3 |

Problem

CPU Core 1

CPU Core 2

CPU Core 3

CPU Core 4

Machine Instructions Execution

Divide the problem


• The maximum speedup is dependent on % of the problem you can run in parallel

Amdahl's Law

4 |

0 5 10 15 20 25

Single Core Processor

1x Speed

50% 2x speedup

75% 4x speedup

90%

95%

10x speedup

20x speedup

Maximum Speedup


Data structures:

5 |

• Memory (data) is still a shared resource.

Memory (data)

CPU core

Single Core Computer

Memory (data)

CPU core

CPU core CPU core

CPU core

4-Core Computer


Linked-list (Stack) Data Structure

6 |

DataA link to the next data point

EMPTY

A “node” that holds data.

TOP


Add new item (Push)

7 |

Data A EMPTY

We want to add a chunk of data (Data B) to the structure

TOP

Data B


Add new item (Push)

8 |

Data A EMPTY

Steps: For new data B

1) Find the start of the structure (TOP)

Data B

TOP


Add new item

9 |

Data A EMPTY

Data B


2) Link into the structure.

TOP


Add new item

10 |

Data A NULL

TOP (new)

Data B


3) Update TOP.


• Like stacking dinner plates• Only need to keep track of where TOP is to access the rest.

Resulting structure

11 |

Data NULLData Data Data Data

TOP


What happens in multi-core systems?

12 |

Two threads trying to operate on the stack structure:

Thread 1 attempts at time T.Thread 2 attempts at time T + 1 nanosecond.

Because each of the steps takes time to complete, errors occur.


What happens in multi-core systems?

13 |

This causes the interleaving of steps

Thread 1 reads TOP (1)Thread 2 reads TOP (1)Thread 1 sets the next pointer (2)Thread 2 sets the next pointer (2)Thread 1 updates TOP (3)Thread 2 updates TOP (3)

Presentation title | Presenter name14 |

Data A EMPTYTOP

Data C

Data B

Data B is lost forever because it is not linked to TOP anymore (Stack failure)

Thread 1

Thread 2


• Use “data locks”.• Protect the 3 steps.• One thread at a time is granted access to the stack. • Complete an operation and release the lock.

This is the standard approach for multithreaded structures.

How do we fix this?

15 |


Easy to use. 2 lines of code added to fix.- Get Lock- Step 1, 2 ,3.- Release Lock.

× Slow. One thread at a time can use the lock.

This becomes sequential code.This is the code that cannot run in parallel.

Analogy: Merging highway traffic into a single lane.

Locks

16 |


New method

• Lock-free data structure.

• Special low-level instructions allows three steps in one computer instruction.

• Removes the need for locks.

• Called a Compare-Exchange.

Lock-free

17 |


• Downside: Writing lock-free code is difficult (hence the project).

• The Compare-Exchange operation forms the base for writing lock-free code.

• The project takes specifications from research papers to implement.

Lock-free

18 |


Implemented a range of lock-free optimizations for the stack.

Open coding standards (C++, OpenMP)

Benchmarked using a Intel Xeon Phi 61 core processor.

Lock-free structure performed about 2x better for pure stack operations.

Lock-free

19 |


Amdahl’s Law shows that it’s important to optimize sequential sections of code.

The shared data structures are often sequential bottlenecks.

Implementing lock-free data structures reduced this bottleneck.

Summary

20 |

data structures optimisation for many-core systems matthew freeman | supervisor: maciej golebiewski...

Documents

data null data

data b slide

chunk of data data b

data cdata b data b

data point

new data b steps

use data locks

data structures optimisation