introduction to cache-oblivious algorithms

Introduction To Cache-Oblivious Algorithmsby Christopher Gilbert

http://www.twitter.com/bigdatadev

http://www.github.com/bigdatadev

http://www.cjgilbert.me

“A cache-oblivious algorithm is not oblivious to cache memory”

(However, it is oblivious to the size of the cache)

“Cache-oblivious algorithms are effective on any system, regardless

of memory hierarchy”

“Cache oblivious algorithms do not improve complexity.”

(But they still improve performance)

ProcessorGraphics

Core Core Core Core

Shared L3 Cache

Memory Controller I/O

L1 Cache

L2 Cache

L1 Cache

L2 Cache

L1 Cache

L2 Cache

L1 Cache

L2 Cache

SystemAgent &Memory

Controller

Register

L1 Cache

L2 Cache

L3 Cache

Memory

Disk

Late

ncy

Incr

ease

s

Capacity D

ecreases

Tall Cache Assumption

Fully Associative

Perfect Eviction Policy

MT(N) = (N/B)

Estimating Memory Transfers

struct item_t { uint64_t x; uint64_t y;};

boolpredicate(item_t const* i1, item_t const* i2) { return ((i1->x + i2->y) * i2->x == (i1->y + i2->x) * i2->y);}

size_tkernel(item_t const* begin1, item_t const* end1, item_t const* begin2, item_t const* end2) { size_t count = 0; for (item_t const* pos1 = begin1; pos1 != end1; pos1++) { for (item_t const* pos2 = begin2; pos2 != end2; pos2++) { if (predicate(pos1, pos2)) count += 1; } } return count;}

size_tsimple_parallel(item_t const* begin, size_t count, unsigned thread_count) { size_t res = 0; #pragma omp parallel for reduction(+:res) num_threads(thread_count) for (int i = 0; i < (int)count; i += 1) res += kernel(begin + i, begin + i + 1, begin, begin + count); return res;}

Memory Transfer Estimate

MT(N) = (N2/B)

0

5000

10000

15000

20000

25000

30000

Number Of Threads

Tim

e (m

s)

Recursive Approach

class coba_task : public task {public: task* execute() { if (_count1 + _count2 > 256) { coba_task& a = *new(allocate_child()) coba_task( _begin1, _count1 / 2, _begin2, _count2 / 2 ); coba_task& b = *new(allocate_child()) coba_task( _begin1, _count1 / 2, _begin2 + _count2 / 2, _count2 - _count2 / 2 ); coba_task& c = *new(allocate_child()) coba_task( _begin1 + _count1 / 2, _count1 - _count1 / 2, _begin2 + _count2 / 2, _count2 - _count2 / 2 ); coba_task& d = *new(allocate_child()) coba_task( _begin1 + _count1 / 2, _count1 - _count1 / 2, _begin2, _count2 / 2 ); set_ref_count(5); spawn(b); spawn(c); spawn(d); spawn_and_wait_for_all(a); _result = a.result() + b.result() + c.result() + d.result(); } else { _result = kernel(_begin1, _begin1 + _count1, _begin2, _begin2 + _count2); } return NULL; }};

size_trecursive_parallel(item_t const* begin, size_t count, unsigned thread_count) { coba_task& a = *new(task::allocate_root()) coba_task( begin, count / 2, begin + count / 2, count / 2 ); task::spawn_root_and_wait(a); return a.result();}

Revised Memory Transfer Estimate

MT(N) = (N2/CB)

0

5000

10000

15000

20000

25000

30000

Number Of Threads

Tim

e (m

s)

Exploit both spatial and temporal locality.

Use recursion.

Optimise your memory transfers.

introduction to cache-oblivious algorithms

Technology