high performance comparison-based sorting algorithm on many-core gpus

19
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer System and Architecture ICT, CAS, China

Upload: kadeem

Post on 20-Jan-2016

51 views

Category:

Documents


0 download

DESCRIPTION

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs. Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer System and Architecture ICT, CAS, China. Outline. GPU computation model Our sorting algorithm - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

High Performance Comparison-Based Sorting Algorithm

on Many-Core GPUs

Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne

Key Laboratory of Computer System and ArchitectureICT, CAS, China

Page 2: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Outline

GPU computation modelOur sorting algorithm

– A new bitonic-based merge sort, named Warpsort

Experiment resultsconclusion

Page 3: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

GPU computation model

Massively multi-threaded, data-parallel many-core architecture

Important features:– SIMT execution model

Avoid branch divergence

– Warp-based schedulingimplicit hardware synchronization among threads within a warp

– Access patternCoalesced vs. non-coalesced

Page 4: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Why merge sort ?

Similar case with external sorting– Limited shared memory on chip vs. limited main

memory

Sequential memory access– Easy to meet coalesced requirement

Page 5: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Why bitonic-based merge sort ?

Massively fine-grained parallelism – Because of the relatively high complexity, bitonic

network is not good at sorting large arrays– Only used to sort small subsequences in our

implementation

Again, coalesced memory access requirement

Page 6: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Problems in bitonic network naïve implementation

– Block-based bitonic network

– One element per thread Some problems

– in each stagen elements produce only n/2

compare-and-swap operations

Form both ascending pairs and descending pairs

– Between stagessynchronization

Phase

Stage

0 1 2

0 0 1 0 1 2

block

thread

Too many branch divergences and synchronization operations

Page 7: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

What we use ?

Warp-based bitonic network– each bitonic network is assigned to an independent warp,

instead of a blockBarrier-free, avoid synchronization between stages

– threads in a warp perform 32 distinct compare-and-swap operations with the same order

Avoid branch divergencesAt least 128 elements per warp

And further a complete comparison-based sorting algorithm: GPU-Warpsort

Page 8: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Overview of GPU-Warpsort

...

...

...

...

......

merege by a warp merge by a warp merge by a warp

bitonic sort by a warp bitonic sort by a warp

Input

split into independent subsequences split into independent subsequences

merge by a warp

...Output

merge by a warp merge by a warp

merge by a warp

merge by a warp merge by a warp

merge by a warp

Step 2

Step 1

Step 3

Step 4

Divide input seq into small tiles, and each followed by a warp-based bitonic sort

Divide input seq into small tiles, and each followed by a warp-based bitonic sort

Merge, until the parallelism is insufficient.

Merge, until the parallelism is insufficient.

Split into small subsequences Split into small subsequences

Merge, and form the outputMerge, and form the output

Page 9: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Step1: barrier-free bitonic sort

divide the input array into equal-sized tiles

Each tile is sorted by a warp-based bitonic network– 128+ elements per tile to

avoid branch divergence– No need for

__syncthreads() – Ascending pairs +

descending pairs– Use max() and min() to

replace if-swap pairs

bitonic_warp_128_(key_t *keyin, key_t *keyout) { //phase 0 to log(128)-1 for(i=2;i<128;i*=2){ for(j=i/2;j>0;j/=2){ k0 ← position of preceding element in each pair to form ascending order if(keyin[k0]>keyin[k0+j]) swap(keyin[k0],keyin[k0+j]); k1 ← position of preceding element in each pair to form descending order if(keyin[k1]<keyin[k1+j]) swap(keyin[k1],keyin[k1+j]); } } //special case for the last phase for(j=128/2;j>0;j/=2){ k0 ← position of preceding element in the thread's first pair to form ascending order if(keyin[k0]>keyin[k0+j]) swap(keyin[k0],keyin[k0+j]); k1 ← position of preceding element in the thread's second pair to form ascending order if(keyin[k1]>keyin[k1+j]) swap(keyin[k1],keyin[k1+j]); }}

Page 10: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Step 2: bitonic-based merge sort t-element merge sort

– Allocate a t-element buffer in shared memory

– Load the t/2 smallest elements from seq A and B, respectively

– Merge

– Output the lower t/2 elements

– Load the next t/2 smallest elements from A or B

t = 8 in this example

6 4 2 0

0 2 4 6 8 10 12 14

1 3 5 7 9 11 13 15

1 3 5 7

barrier-free bitonic merge network

0 1 2 3 4 5 6 7

buf(shared memory)

Output

A[3]<B[3]?

14 12 10 8 4 5 6 7

Sequence A

Sequence B

Yes, then load the next 4 elements from A

15 13 11 9 4 5 6 7

No, then load the next 4 elements from B

buf

buf

buf

barrier-free bitonic merge network

barrier-free bitonic merge network

Page 11: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Step 3: split into small tiles

Problem of merge sort– the number of pairs decreases geometrically– Can not fit this massively parallel platform

Method– Divide the large seqs into independent small tiles

which satisfy:

( , ), ( , ) : ,a subsequence x i b subsequence y j a b

0 ,0 ,0 .x l y l i j s

Page 12: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Step 3: split into small tiles (cont.)

How to get the splitters?– Sample the input sequence randomly

... ... ... ... ... ... ... ... ... ... ... ... ... ...

...

... ... ... ... ... ... ... ... ...

...

sort

Input sequence

Sample sequence

Sorted sample sequence

Splitters

Page 13: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Step 4: final merge sort

Subsequences (0,i), (1,i),…, (l-1,i) are merged into Si

Then,S0, S1,…, Sl are assembled into a totally sorted array

s

0,0 0,1 0,2 0,3 ... 0,s-2 0,s-1

1,0 1,1 1,2 1,3 ... 1.s-2 1,s-1

...

l-1,0 l-1,1 l-1,2 l-1,3 ... l-1,s-2 l-1,s-1

l

Page 14: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Experimental setup Host

– AMD Opteron880 @ 2.4 GHz, 2GB RAMGPU

– 9800GTX+, 512 MB Input sequence

– Key-only and key-value configurations32-bit keys and values

– Sequence size: from 1M to 16M elements– Distributions

Zero, Sorted, Uniform, Bucket, and Gaussian

Page 15: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Performance comparison Mergesort

– Fastest comparison-based sorting algorithm on GPU (Satish, IPDPS’09)

– Implementations already compared by Satish are not included

Quicksort– Cederman, ESA’08

Radixsort– Fastest sorting algorithm on

GPU (Satish, IPDPS’09) Warpsort

– Our implementation

0

50

100

150

200

250

300

350

400

450

ko kv ko kv ko kv ko kv ko kv

1M 2M 4M 8M 16MSequence Size

Tim

e (m

sec)

mergesort radixsort warpsort quicksort

0

10

20

30

40

50

60

70

1M 2M 4M 8M 16MSequence Size

Sor

ting

Rat

e (m

illio

ns/s

ec)

warpsortradixsortmergesort

Page 16: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Performance results

Key-only– 70% higher performance than quicksort

Key-value– 20%+ higher performance than mergesort– 30%+ for large sequences (>4M)

Page 17: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Results under different distributions

Uniform, Bucket, and Gaussian distribution almost get the same performance

Zero distribution is the fastest

Not excel on Sorted distribution– Load imbalance

0

50

100

150

200

250

300

350

400

450

1M 2M 4M 8M 16MSequence Size

Tim

e (m

sec)

35

40

45

50

55

60

65

70

Sor

ting

Rat

e (m

illio

ns/s

ec)

Time_zero Time_uniform Time_gaussian Time_bucket Time_sortedRate_zero Rate_uniform Rate_gaussian Rate_bucket Rate_sorted

Page 18: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Conclusion We present an efficient comparison-based sorting algorithm for

many-core GPUs– carefully map the tasks to GPU architecture

Use warp-based bitonic network to eliminate barriers

– provide sufficient homogeneous parallel operations for each threadavoid thread idling or thread divergence

– totally coalesced global memory accesses when fetching and storing the sequence elements

The results demonstrate up to 30% higher performance – Compared with previous optimized comparison-based algorithms

Page 19: High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Thanks