by: a. lamarca & r. lander presenter : shai brandes

61
By: A. LaMarca & R. Lander Presenter : Shai Brandes The Influence of Caches on the Performance of Sorting

Upload: una

Post on 14-Jan-2016

29 views

Category:

Documents


3 download

DESCRIPTION

The Influence of Caches on the Performance of Sorting. By: A. LaMarca & R. Lander Presenter : Shai Brandes. Introduction. Sorting is one of the most important operations performed by computers. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

By: A. LaMarca & R. Lander

Presenter : Shai Brandes

The Influence of Caches on the

Performance of Sorting

Page 2: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Introduction

Sorting is one of the most important operations performed by computers.

In the days of magnetic tape storage before modern data-bases, it was almost certainly the most common operation performed by computers as most "database" updating was done by sorting transactions and merging them with a master file.

Page 3: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Introduction cont .

Since the introduction of caches, main memory continued to grow slower relative to processor cycle times.

The time to service a cache miss grew to 100 cycles and more.

Cache miss penalties have grown to the point where good overall performance cannot be achieved without good cache performance.

Page 4: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Introduction cont.

In the article, the authors investigate, both experimentally and analytically , the potential performance gains that cache-conscious design offers in improving the performance of several sorting algorithms.

Page 5: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Introduction cont.

For each algorithm, an implementation variant with potential for good overall performance, was chosen.

Than, the algorithm was optimized, using traditional techniques to minimize the number of instruction executed.

This algorithm forms the baseline for comparison.

Memory optimizations were applied to the comparison sort baseline algorithm, in order to improve cache performance.

Page 6: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Performance measures

The authors concentrate on three performance measures:

Instruction count Cache misses Overall performance

The analyses presented here are only approximation, since cache misses cannot be analyzed precisely due to factors such as variation in process scheduling and the operating system’s virtual to physical page mapping policy.

Page 7: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Main lesson

The main lesson from the article is that because of the cache miss penalties are growing larger with each new generation of processors, improving an algorithm’s overall performance requires increasing the number of instruction executed, while at the same time, reducing the number of cache misses.

Page 8: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Design parameters of caches

Capacity – total number of blocks the cache can hold.

Block size – the number of bytes that are loaded from and written to memory at a

time.Associativity – in an N-way set associative

cache, a particular block can be loaded in N different cache locations.

Replacement policy – which block do we remove from the cache as a new block is loaded

Page 9: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Which cache are we investigating?

In modern machines, more than one cache is placed between the main memory and the processor.

processor

memory

Direct map N-way associative

Full associative

Page 10: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Which cache are we investigating?

The largest miss penalty is typically incurred to the cache closest to the main memory, which is usually direct-mapped.

Thus, we will focus on improving the performance of direct-mapped caches.

Page 11: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Improve the cache hit ratio

Temporal locality – there is a good chance that an accessed data will be accessed again in the near future.

Spatial locality - there is a good chance that subsequently accessed data items are located near each other in memory.

Page 12: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Cache misses

Compulsory miss – occur when a block is first accessed and loaded to the cache.

Capacity miss – caused by the fact that the cache is not large enough to hold all the accessed blocks at one time.

Conflict miss – occur when two or more blocks, which are mapped to the same location in the cache, are accessed.

Page 13: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Measurements

n – the number of keys to be sorted

C – the number of blocks in the cache

B – the number of keys that fit in a cache block

• • •

B keys

Cache block

Page 14: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Mergesort

Two sorted lists can be merged into a single list by repeatedly adding the smaller key to a single sorted list:

136 245

1 2 3 4 5 6

Page 15: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Mergesort

By treating a set of unordered keys as a set of sorted lists of length 1, the keys can be repeatedly merged together until a single sorted set of keys remains.

The iterative mergesort was chosen as the base algorithm.

Page 16: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Mergesort base algorithm

Mergesort makes [log2n] passes over the array, where the i-th pass merges sorted subarrays of length 2i-1 into sorted subarrays of size 2i.

1348

1 4 8 3

1 4 3 8

i=1

i=2

Page 17: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Improvements to the base algorithm

1. Alternating the merging process from one array to another to avoid unnecessary copying.

2. Loop unrolling

3. Sorting subarrays of size 4 with a fast in-line sorting method.

Thus, the number of passes is [log2(n/4)].

If this number is odd, then an additional copy pass is needed to return the sorted array to the input array.

Page 18: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

The problem with the algorithm

The base mergesort has the potential for terrible cache performance: if a pass is large enough to wrap around the in the cache, keys will be ejected before they are used again.

n ≤ BC/2 →the entire sort will be performed in the cache – only Compulsory misses.

BC/2 < n ≤ BC →temporal reuse drops off sharply

BC < n →no temporal reuse

In each pass:1. The block is accessed in the input array (r/w)2. The block is accessed in the auxiliary array (w/r).→ 2 cache misses per block→ 2/B cache misses per key

Page 19: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

1 4 3 2

1 4 2 3

1 2 3 4 Input array n keys

Auxiliary array

i=1 i=2

Cache after pass i=1

Cache block

Read

1 miss

Write

1 miss

Read

1 miss

Write

1 miss

No cache misses!

4 cache misses

1 4 221 4 3 3Read key=1

n ≤ BC/2

Page 20: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Mergesort analysis

For n≤BC/2 → 2/B misses per key

The entire sort will be performed in the cache –

only Compulsory misses

Page 21: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Base Mergesort analysis cont.

For BC/2<n (misses per key): 2/B • [log2(n/4)] + 1/B + 2/B • ([log2(n/4)] mod 2)

In each pass, each key is moved from a source array to a destination array.

Every B-th key visited in the source array results in one cache miss. Every B-th key written to the destination array results in one cache miss.

Number of merge passes

Initial pass of sorting groups of 4 keys.

1 compulsory miss per block.

thus, 1/B misses per key

If number of iteration is odd, we need to copy the sorted array to the input array

Page 22: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

1st Memory optimizationTiled mergesort

Improve temporal locality :Phase 1- subarrays of legnth BC/2 are sorted

using mergesort.

Phase 2- Return the arrays to the base mergesort to complete the sorting

of the entire array.

Avoid the final copy if [log2(n/4)]is odd: subarrays of size 2 are sorted in-line if log2(n) is odd.

Page 23: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

tiled-mergesort example

11428127591611153610134

12814 57912 3111516 461013

Phase 1 - mergesort every BC / 2 keys

Phase 2- regular Mergesort

12345678910111213141516

1257891214 3461011131615

Page 24: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Tiled Mergesort analysis

For n≤BC/2 → 2/B misses per key

The entire sort will be performed in the cache –

only Compulsory misses

Page 25: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Tiled Mergesort analysis cont.

For BC/2<n (misses per key): 2/B • [log2(2n/BC)] + 2/B + 0

number of iteration is forced to be even.

no need to copy the sorted array to the input array

Initial pass of mergesorting groups of BC/2 keys.

Each merge is done in the cache with 2 compulsory misses per block.

Number of merge passes

each pass is large enough to wrap around the in the cache, keys will be ejected before they are used again.

2 compulsory miss per block.thus, 2/B misses per key

Page 26: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Tiled mergesort cont.

The problem:

In phase 2 – no reuse is achieved across passes since the set size is larger than the cache.

The solution: multi-mergesort

Page 27: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

2nd Memory optimization multi-mergesort

We replace the final [log2(n/(BC/2))] merge passes of tiled mergesort with a single pass that merges all the subarrays at once.

The last pass uses a memory optimized heap which holds the heads of the subarrays.

The number of misses per key due to the use of the heap is negligible for practical values of n, B and C.

Page 28: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

multi-mergesort example

11428127591611153610134

12814 57912 3111516 461013

Phase 1 - mergesort every BC / 2 keys

Phase 2- multi Mergesort all [n/(BC/2)] subarrays at once

12345678910111213141516

Page 29: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Multi Mergesort analysis

For n≤BC/2 → 2/B misses per key

The entire sort will be performed in the cache –

only Compulsory misses

Page 30: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Multi Mergesort analysis cont.

For BC/2<n (misses per key): 2/B + 2/B

Initial pass of mergesorting groups of BC/2 keys.

Each merge is done in the cache with 2 compulsory misses per block.

number of iteration is forced to be odd → That way, in the next phase we will multi-merge keys from the auxiliary array to the input array

a single pass that merges all the [n/(BC/2)] subarrays at once.

2 compulsory miss per block.thus, 2/B misses per key

Page 31: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Performance

Set size in keys

Inst

ruct

ion

s p

er k

ey

Base

Tiled

multi

Cache size

Multi-merge all subarrays in a single pass

10000 100000

100

0

Page 32: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Performance cont.

Cac

he

mis

ses

per

key Base

Tiledmulti

10000 100000Set size in keys

Cache size

2

1

0

Increase in cache misses:

set size is larger than cache

constant number of cache misses per key!

66% fewer misses than the base

Page 33: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Performance cont.

Tim

e (c

ycle

s pe

r ke

y)

Set size in keys10000 100000

Cache size

200

BaseTiledmulti

Worst performance due to the large number of cache misses

Executes up to 55% faster than Base

Due to increase in instruction count

0

Page 34: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Quicksort - Divide and conquer algorithm.

2

Choose a pivot

1 7 3 5 64

Partition the set around the pivot

Quicksort left region Quicksort right region

At the end of the pass the pivot is in its final position.

Page 35: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Quicksort base algorithm

Implementation of optimized Quicksort which was developed by Sedgewick:

Rather than sorting small subsets in the natural course of quicksort recursion, they are left unsorted until the very end, at which time they are sorted using

insertion sort in a single final pass over the entire array

Page 36: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Insertion sort

Sort by repeatedly taking the next item and inserting it into the final data structure in its proper order with respect to items already inserted.

1 3 4 2

Page 37: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Quicksort base algorithm cont.

Quicksort make sequential passes →all keys in a block are always used

→exellent spatial localityDivide and conquer

→ if subarray is small enough to fit in the cache – quicksort will incur at most 1 cache miss per block before the subset is fully sorted

→exellent temporal locality

Page 38: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

1st Memory optimizationmemory tuned quicksort

Remove Sedgewick’s insertion sort in the final pass.

Instead, sort small subarrays when they are first encountered using insertion sort.

Motivation:

When a small subarray is encountered it has just been part of a recent partition

→ all of its keys should be in the cache

Page 39: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

2nd Memory optimizationmulti quicksort

n ≤ BC → 1 cache miss per blockProblem:

Larger sets of keys incur a substantial number of misses.

Solution:a single multi-partition pass is performed: divides the full set into a number of subsets which are likely to be cache sized or smaller

Page 40: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Feller

If k points are placed randomly in a range of length 1:

P( subrangei ≥ X ) = (1 - X)k

Page 41: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

multi quicksort cont.

multi-partition the array into 3n / (BC) pieces.

→ (3n / (BC)) – 1 pivots.

→ P( subseti ≥ BC) = (1– BC/n) (3n / (BC)) – 1

lim n → ∞[(1– BC/n) (3n / (BC)) – 1 ]= e-3

→the percentage of subsets that are larger than the cache is less than 5%.

feller

Page 42: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis

We analyze the algorithm in two parts:1. Assumption:

partitioning an array of size m costs : m > BC → m/B misses m ≤ BC → 0 misses

2. Correct the assumption:estimate the undercounted and over-counted cache misses

Page 43: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis cont.

M(n) = the expected number of misses:

0 n ≤ BC

M(n)=

n/b + 1/n •∑[M(i)+M(n-i-1)] else0≤i<n-1

Assumption: partitioning an array of size n > BC costs

n / B misses

n places to locate the pivot:P(pivot is in the i-th place) =1/n

number of misses in the left region

number of misses in the left region

Page 44: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis cont.

The recurrence solves to:

0 n ≤

BC

M(n)= 2(n+1)/B

•ln[(n+1)/(BC+2)]+O(1/n)else

Page 45: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis cont.First correction

Undercounting the misses when the subproblem first reaches size ≤ BC .

We counted it as 0, but this subproblem may have no part in the cache!

We add n/B more misses, since there are approximately n keys in ALL the subproblems that first reaches size ≤ BC.

Page 46: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis cont.Second correction

In the very first partitioning there are n/B misses, but not for the subsequent partitioning !

In the end of partitioning, some of the array in the LEFT subproblem is still in the cache.

→ there are hits that we counted as missesNote: by the time the algorithm reaches the right

subproblem, its data has been removed from the cache

Page 47: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis Second correction cont.

The expected number of subproblems of size > BC:

0 n ≤

BC

N(n)=

1 + 1/n •∑[N(i)+N(n-i-1)] else

0≤i<n-1

n>BC thus, this array itself is a subproblem larger than the cache…

n places to locate the pivot:P(pivot is in the i-th place) =1/n

number of subproblems of size > BC in the left / right region

Page 48: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis Second correction cont.

The recurrence solves to:

0 n ≤ BC

N(n)= (n+1)/(BC+2) – 1 else

Page 49: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis Second correction cont.

In each subproblems of size n > BC:

pivot

L R

array

Left sub-problem

cache

On average, BC/2 keys are in the cache (1/2 cache)

R progresses left→ it can access to these blue cache blocksL progresses right →

eventually will access blocks that map to the blue blocks in the cache and replace them

Page 50: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis Second correction cont.

Assumption: R points to a key in the block located at the right end of the cache

Reminder: this is a direct map cache, the i-th block will be in the i (mod C)

R

Page 51: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis Second correction cont.

2 possible scenarios. the first:

i blocks

R points to a key in a block which is mapped to this cache block.R progresses to the blue blocks on the left

L points to a key in a block which is mapped to this cache block.L progresses and replaces the blocks on the right

On average, there will be

] c/2 + i / [ 2 hits

X X X X

Page 52: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis Second correction cont.

The second scenario:

i blocksR points to a key in a block which is mapped to this cache block.R progresses to the blue blocks on the left

L points to a key in a block which is mapped to this cache block.L progresses and replaces the blocks on the right

On average, there will bei + [c/2 - i ] / 2

] =c/2 + i / [ 2 hits

X X

Page 53: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis Second correction cont.

Number of hits:

1/(c/2) •∑ [c/2 + i ] / 2 ~ 3C/80≤i<c/2

=

L can start on any block with equal probability

Average number of hits

Page 54: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis Second correction cont.

Number of hits not acounted for the computation of M(n) :

3C/8 • N(n)

The expected number of sub-problems of size > BC

number of hits after a partition

the expected number of misses

Page 55: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Memory tuned quicksort analysis Second correction cont.

The expected number of misses per key for n>BC:

[M(n) + (n/B) - 3C/8 •N(n) ]/n

= 2/B • ln(n/BC) + 5/8B + 3C/8n

misses per key

Page 56: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Base quicksort analysis

Number of cache misses per key:

2/B • ln(n/BC) + 5/8B + 3C/8n +1/B

Base QS makes an extra pass at the end to perform the insertion sort.

Same as Memory tuned quicksort

Page 57: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Multi quicksort analysis cont.

Number of cache misses per key

for n≤BC:

1/B

Compulsory misses

Page 58: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Multi quicksort analysis cont.

Number of cache misses per key for n>BC:

2/B + 2/B

We partition the input to k=3n/BC pieces.Assumption: Each partition is smaller than the cacheWe hold k linked lists, one for each partition.

100 keys can fit in one linked list node (minimize storage waste).Each partitioned key is moved to the a linked list:

1 miss per block in the input array1 miss per block in the linked list

Each partition is returned to the input array and sorted in place

Page 59: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Performance

Set size in keys

Inst

ruct

ion

s p

er k

ey

Base

Memory tuned

multi

Constant number of additional instructions

10000 100000

150

0

Multi partition

Cache size

Page 60: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Performance cont.

Cac

he

mis

ses

per

key Base

Memory tunedmulti

10000 100000Set size in keys

Cache size

1

0

Multi partition usually produce subsets smaller than cache.

1 miss per key!

Page 61: By: A. LaMarca & R. Lander Presenter :  Shai Brandes

Performance cont.

Tim

e (c

ycle

s pe

r ke

y)

Set size in keys10000 100000

Cache size

200

BaseTiledmulti

0

Due to increase in instruction cost.If larger sets were sorted it would have outperformed the other 2 variants