cache aware hybrid sorter

Cache-Aware Hybrid Sorter

Manny Ko

Outline

• Sorting in CG

• Quick radix sort refresher

• Issues with radix sort – Incoherent memory access during parts of it

– Originally only for integers

• Two-phase sort – Cache-aware stream splitting

– Cache friendly merge using Loser Tree

– A lot faster than STL sort (several times)

Sort in CG

• Depth-sort for transparency Patney [2010]

• Better Z-cull

• Collision detection [Lin 2000]

• Minimizing state-changes

• Ray coherency Garanzha & Loop [2010]

• HPC to handle irregular workloads

• PBGI ?

Inspirations

• Out-of-core sorts, e.g. AlphaSort Nyberg[95]

• GPU based stream processing

• Cache-Aware algorthims

• Came out of my work on fast kd-tree builder

Importance of Memory

• GPUs and CPU cores are faster and faster;

• Tons of cores and more are coming

• For GFLOPS Moore’s Law still holds

• NOT for bandwidth to memory

– While GFLOPS doubles or triples every 18m

– Bandwidth barely moves (~15%)

• Bandwidth equals power; pushing electrons

Real-time Rendering

• Have been focusing on cache and memory patterns for a while

• CG researchers like Ingo Wald et al. have tackled that in ray-tracing

STL Sort

• Quicksort based

– Memory access pattern less than ideal

– Not sequential and lots of branching

• Will not dwell too much on this

Radix Sort

• The only practical O(dN) sort algorithm

– d is the # of radix digits, e.g. for 32b word and 1 bit per pass d is 32.

• No branching (almost) at least for integers

Counting Sort – Pass 1

• For radix = 2 we allocate two counters

• Each pass we go through the input and count the # of inputs that has 0s and 1s

• Extract digit (1 bit) and use that as index to increment the right counter – no branching

• d is a key design parameter

Pass 2 - Scatter

• At the end of the pass the counter for 0s will give us the offset to insert the 1s

• We go through the input using the counters to guide us where to scatter into the output buf

Number of Passes

• Original radix-sort each radix digit requires 1 pass through the input and 1 scatter pass

• Swap input and output; repeat d times

• Each of the passes is a stable-sort

Prefix-Sum

• Radix-2 is simple; in general we have to compute the prefix-sum for the counters

• Key building block for GPU computing

• A big topic on its own

• Our array is only 256 entries long, so we didn’t use fancy SIMD method

Access Patterns

• Pass 1 – pure sequential read. Good

– Very parallelizable too.

• Pass 2 – random scatter. Not so good

• Each pass requires one complete round trip from and to memory

Random Scatter

• Idea: utilize the cache

• Split the input into sub-streams

• Sub-streams defined by cache size/fast memory

Cache Resident Passes

• When we swap input and outputs

– Output from previous pass still in cache

Stream Merging

• Sorted sub-streams with be merged

• Merge is streaming friendly:

– Input are read sequentially

– Output is generated sequentially

• This is where the fun is

• We will get back to this. I promise.

Cache-Aware Hybrid Sort

• Cache-aware because we use the actual cache size of the machine to split the input

• Hybrid: radix sort sub-streams then merge

Cache sizing

• cpuid instruction

• code in the book ‘Game Engine Gems II’, AK Peters 2011.

Stream Spliting

• Depends on # of threads

• General strategy is to keep the output of each scatter pass completely within the cache

Substream Sorting

• Each byte is a digit

• Radix-256 sort – allocate 256 counters

– 1kb or 2kb (64b); fits in L1 cache

– Actually we allocate 4 sets of counters

• d logically is 4 but we do it all in 1 pass

• form the 4 sets of prefix-sums

• 4 scatter-passes

Floats

• Radix-sort original designed for ints

• What if we treat float as int? Casting?

• Almost works, if all the floats are postive

• IEEE is sign-exponent-mantissa.

• sign bit makes all negative number appears to be larger than the positive ones

Float example

2.0 is 0x40000000

-2.0 is 0xc0000000

-4.0 is 0xc0800000

Which implies -4.0 > -2.0 > 2.0,

just the opposite of what we want

Terdiman’s Solution

• Usual solution [Terdiman 2000] treats high byte special and use a test in the inner loop

• Modern CPUs do not like branching

• GPUs likes it even less

Herf’s Hack

1. always invert the sign bit

2. If the sign bit was set, then invert the exponent and mantissa

2.0 is 0x40000000 -> 0xc0000000

-2.0 is 0xc0000000 -> 0x3fffffff

-4.0 is 0xc0800000 -> 0x3f7fffff

We get the correct ordering

Herf’s FloatFlip

U32 FloatFlip(U32 f)

{

U32 mask = -int32(f >> 31) | 0x80000000;

return (f ^ mask);

}

My Version

int32 mask = (int32(f) >> 31) | 0x8000000;

Utilize the sign extension while shifting signed numbers. Generates better code.

Parallel Sorting

• Each substream can be sorted in parallel

• We allocate 1 core per substream

• We size the substream so that it fits into each core’s L2 or L1 cache (or GPU share memory)

• At the end of substream sort phase we have read the input from memory (disk) twice

/*! -- RadixSorter: a builder class to aid with the use of radix-sorter. -- It splits the input stream into substreams that fits into cache. -- Mostly it holds the indices and temporaries for reuse. -- It currently only supports sorting of <key,index> pairs. Caller can either -- request for the sorted indices or request the original values to be moved. */

class RadixSorter { typedef size_t* Indices; static const size_t kStreams = 4; public: static const size_t kNumThreads = 4; // # of threads RadixSorter( int count ); ~RadixSorter(); /// reallocate internal storage to prepare for a stream of length 'count': void Resize( int count ); /// deallocate all storage: void Clear(); /// initialize the sorter for 'values' : void SortInit( float* values, int count ); /// sort 'values' : void Sort( float* values, int count ); /// sort sub-stream ‘s': void SortStream( int s ); void MergeStreams();

public: size_t m_blockSizes[kStreams]; //!< size of each sub-stream float* m_streams[kStreams]; //!< our sub-streams of work float* m_temp[kNumThreads]; //!< working buffer carved from output buffer float* m_outbuf; int m_count; //!< max size of the input sequence bool m_inited;

Stream Merging

• Usually performed using a priority-queue, most likely a heap-based PQ

• I tried to find the best PQ implementation

• Disappointing, the gain from radix-sort was almost negated by the merge phase

Loser-Tree

• Comes to the rescue

• Thanks Knuth

– The Art of Computer Programming Vol. 3

• Almost forgotten and I am a Knuth fan

• It is a kind of tournament-tree

Tournament Tree

• Single elimination

• Loser-tree is a tournament tree where the loser is kept in each round

• Winner moves on (in a register)

Our Tree

• Node consist of a float key and a payload of stream_id

• Linearized binary-tree, no pointers

– Navigation up and down is by shifts and adds

• Initialized by inserting the head of each substream into the tree

– Size_of_tree = 2 x # of substreams

• Let the play begins!

Winner

• Winner rises to the top

– We remove the winner and output the key

• We use the winner’s stream_id to pull the next item from the stream

• Key idea: new winner can only come from players that had faced the previous winner – i.e. the path from the root to the original position of the winner

Repeat Matches

• Repeat those matches, a new winner emerges

Access Pattern of Merge

• Each substream is accessed sequentially

• Output is written sequentially

• Modern CPUs and GPUs like these sorts of patterns due to their pre-fetch, write-coalesce and caching logic

• Tree is small and fits into the L1 cache or even register file

Performance (1 core)

• Serially sort all substreams

• Merge using Loser-Tree on same thread

• Small data set: 2.1..3.5 times faster than STL

– The poor access pattern of quicksort is less problematic when everything fits into cache

Scalability (4-cores)

Threaded (Q6600)

Serial (Q6600)

Threaded (I7) Serial (I7)

1 stream 5.12 5 3.89 3.62

2 streams 6.90 10.04 4.20 7.1

3 streams 8.08 15.07 4.56 10.69

4 streams 10.97 20.55 4.86 14.2

4 + merge 16.4 26.01 9.61 19.0

Multi-Core Performances

• One million entries: Q6600

– STL took 76ms,

– radix-sort 28 ms

– 4-core: 16.4ms = where 5-6ms in merge

• One million entries: I7

– STL (58ms),

– hybrid (9.6ms)

• 6 times faster than STL

Threading Overhead

• The 1 stream vs. serial time is 5.12 vs. 5

– So only .12ms of threading overhead

Related Work

• Funnel-Sort, Brodal [2008]

• GPU radix-sort, Satish [2009]

cache aware hybrid sorter

Documents

sort pass

radix sort substreams

l1 cache

pass d

cache sizefastmemory

floats radixsort original

use of radix

scatter pass swap input