introduction to cache-oblivious algorithms

Post on 02-Jul-2015

272 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

In today's world developers are faced with the problem of writing high-performing algorithms that scale efficiently across a range of multi-core processors. Traditional blocked algorithms need to be tuned to each processor, but the discovery of cache-oblivious algorithms give developers new tools to tackle this emerging challenge. In this talk you will learn about the external memory model, the cache-oblivious model, and how to use these tools to create faster, scalable algorithms.

TRANSCRIPT

Introduction To Cache-Oblivious Algorithmsby Christopher Gilbert

http://www.twitter.com/bigdatadev

http://www.github.com/bigdatadev

http://www.cjgilbert.me

“A cache-oblivious algorithm is not oblivious to cache memory”

(However, it is oblivious to the size of the cache)

“Cache-oblivious algorithms are effective on any system, regardless

of memory hierarchy”

“Cache oblivious algorithms do not improve complexity.”

(But they still improve performance)

ProcessorGraphics

Core Core Core Core

Shared L3 Cache

Memory Controller I/O

L1 Cache

L2 Cache

L1 Cache

L2 Cache

L1 Cache

L2 Cache

L1 Cache

L2 Cache

SystemAgent &Memory

Controller

Register

L1 Cache

L2 Cache

L3 Cache

Memory

Disk

Late

ncy

Incr

ease

s

Capacity D

ecreases

Tall Cache Assumption

Fully Associative

Perfect Eviction Policy

MT(N) = (N/B)

Estimating Memory Transfers

struct item_t { uint64_t x; uint64_t y;};

boolpredicate(item_t const* i1, item_t const* i2) { return ((i1->x + i2->y) * i2->x == (i1->y + i2->x) * i2->y);}

size_tkernel(item_t const* begin1, item_t const* end1, item_t const* begin2, item_t const* end2) { size_t count = 0; for (item_t const* pos1 = begin1; pos1 != end1; pos1++) { for (item_t const* pos2 = begin2; pos2 != end2; pos2++) { if (predicate(pos1, pos2)) count += 1; } } return count;}

size_tsimple_parallel(item_t const* begin, size_t count, unsigned thread_count) { size_t res = 0; #pragma omp parallel for reduction(+:res) num_threads(thread_count) for (int i = 0; i < (int)count; i += 1) res += kernel(begin + i, begin + i + 1, begin, begin + count); return res;}

Memory Transfer Estimate

MT(N) = (N2/B)

0

5000

10000

15000

20000

25000

30000

Number Of Threads

Tim

e (m

s)

Recursive Approach

class coba_task : public task {public: task* execute() { if (_count1 + _count2 > 256) { coba_task& a = *new(allocate_child()) coba_task( _begin1, _count1 / 2, _begin2, _count2 / 2 ); coba_task& b = *new(allocate_child()) coba_task( _begin1, _count1 / 2, _begin2 + _count2 / 2, _count2 - _count2 / 2 ); coba_task& c = *new(allocate_child()) coba_task( _begin1 + _count1 / 2, _count1 - _count1 / 2, _begin2 + _count2 / 2, _count2 - _count2 / 2 ); coba_task& d = *new(allocate_child()) coba_task( _begin1 + _count1 / 2, _count1 - _count1 / 2, _begin2, _count2 / 2 ); set_ref_count(5); spawn(b); spawn(c); spawn(d); spawn_and_wait_for_all(a); _result = a.result() + b.result() + c.result() + d.result(); } else { _result = kernel(_begin1, _begin1 + _count1, _begin2, _begin2 + _count2); } return NULL; }};

size_trecursive_parallel(item_t const* begin, size_t count, unsigned thread_count) { coba_task& a = *new(task::allocate_root()) coba_task( begin, count / 2, begin + count / 2, count / 2 ); task::spawn_root_and_wait(a); return a.result();}

Revised Memory Transfer Estimate

MT(N) = (N2/CB)

0

5000

10000

15000

20000

25000

30000

Number Of Threads

Tim

e (m

s)

Exploit both spatial and temporal locality.

Use recursion.

Optimise your memory transfers.

top related