split primitive on the gpu. split primitive split can be defined as performing ::...

Split Primitive on the GPU

Split Primitive

Split can be defined as performing ::append(x,List[category(x)])

for each x, List holds elements of same category together

Split Sequential Algorithm

I. Count the number of elements falling into each bin– for each element x of list L do

• histogram[category(x)]++ [Possible Clashes on a category]

II. Find starting index for each bin (Prefix Sum)– for each category ‘m’ do

• startIndex[m] = startIndex[m – 1]+histogram[m-1]

III. Assign each element to the output– for each element x of list L do [Initialize localIndex[x]=0]

• itemIndex = localIndex[category(x)]++ [Possible Clashes on

a category]• globalIndex = startIndex[category(x)]• outArray[globalIndex+itemIndex] = x

Split Operation in Parallel

• In order to parallelize the above split algorithm, we require a clash free method for building histogram on the GPU

• Above can be achieved on a parallel machine using one of the following two methods– Personal Histograms for each processors, followed

by merging the histograms– Atomic Operations on Histogram array(s)

Global Memory Atomic Split• Code :

__global__ void globalHist ( unsigned int *histogram, int* gArray, int *category )

{ int curElement; int curCategory;

for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) { curElement= gArray[blockIdx.x * blockDim.x

* i + threadIdx.x]; curCategory = category[curElement]; atomicInc(&histogram[curCategory],99999); }}

• Global Memory too slow to access• Single Histogram in Global Memory (Number of clashes is data dependent)• Overuse of Shared Memory limits the maximum number of categories to 64

Non-Atomic Approach (He et al.)• A Histogram for each ‘Thread’ • Combine all the histograms to get the final histogram

__global__ void nonAtomicHistogram( int* gArray, int *category, unsigned int *tHistGlobal )

{ int curElement, curCategory; __shared__ unsigned int tHist[NUMBINS*NUMTHREADS];

for ( int i=0; i < NUMBINS; i++ ) tHist[threadIDx.x*NUMBINS+i] = 0;

for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) {

curElement = gArray[blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD + ( i * NUMTHREADS ) + threadIdx.x];

curCategory = category[curElement];tHist[tx*NUMBINS+curCategory]++;

} for ( int i=0; i<NUMBINS; i++ ) tHistGlobal[i * NUMBLOCKS * NUMTHREADS + blockIdx.x*NUMTHREADS +

threadIdx.x] = tHist[tx*NUMBINS+i]; }

Shared Memory Atomic

• Global Atomic does not use the fast shared memory available• Non-Atomic approach overuses the shared memory

• Incorporating atomic operations on fast shared memory may perform better compared to above two approaches

• Shared Memory Atomic can be performed using one of the below mentioned techniques– H/W Atomic Operations– Clash Serial Atomic Operations– Thread Serial Atomic Operations

SM Atomic :: H/W Atomic• Latest GPUs (G2xx and later) support atomic operations on the Shared Memory

__global__ void histkernel ( unsigned int *blockHists, int* gArray, unsigned int *category )

{const int numThreads = blockDim.x * gridDim.x;extern __shared__ int sharedmem[];unsigned int* s_Hist = (unsigned int *)&sharedmem;unsigned int curElement, curCategory;

for(int pos = threadIdx.x; pos < NUMBINS; pos += blockDim.x)s_Hist[pos] = 0;

__syncthreads();for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ )

{ curElement = gArray[ blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD )

+ ( i * NUMTHREADS ) + threadIdx.x]; curCategory = category[curElement]; atomicInc(&s_Hist[category],9999999); } __syncthreads(); for(int pos = threadIdx.x; pos < NUMBINS; pos += blockDim.x) blockHists[ blockIdx.x + gridDim.x * pos ] = s_Hist[pos];}

SM Atomic :: Thread Serial

• Threads can be serialized within a ‘warp’ in order to avoid clashes. …………. for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ){

curElement = gArray[ blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD + ( i *

NUMTHREADS ) + threadIdx.x]; curCategory = category[curElement];

for ( int i=0; i < WARPSIZE; i++ ) if ( threadIdx.x == i )

s_Hist[curCategory]++;}………….

SM Atomic :: Clash Serial• Each thread writes to the common histogram of the block until it succeeds.• A Thread is tagged by its thread ID in order to find out if the thread successfully updated the

histogram

//Mainfor(int pos = globalTid; pos < NUMELEMENTS; pos += numThreads) {

unsigned int curElement = gArray[pos]; unsigned int curCategory = category[curElement]; addData256(s_Hist, curCategory, threadTag); }

//Clash serializing function for a Warp__device__ void addData256(volatile unsigned int *s_WarpHist,

unsigned int data, unsigned int threadTag){ unsigned int count; do{ count = s_WarpHist[data] & 0x07FFFFFFU; count = threadTag | (count + 1); s_WarpHist[data] = count; }while(s_WarpHist[data] != count);}

Comparison of Histogram Methodsfor 16 Million Elements

Split using Shared Atomic

• Shared Atomic used to build Block-level histograms

• Parallel Prefix Sum used to compute starting index

• Split is performed by each block for same set of elements used in Step 1

Comparison of Split Methods

• Global Atomic suffers for low number of categories• Non-Atomic can do maximum of 64 categories in one pass

(multiple-pass for higher categories)• Shared Atomic performs better than other 2 GPU methods and CPU

for a wide range of categories• Shared Memory limits maximum number of bins to 2048 (for power

of 2 bins)

Multi Level Split

• Bins higher than 2K are broken into sub-bins

• Hierarchy of bins is created and split is performed at each level for different sub-bins

• Number of splits to be performed grow exponentially

• With 2 levels we can perform split for up to 4Million bins

8 bits 8 bits 8 bits8 bits

32 bit Bin broken into 4 sub-bins of 8 bits

Results for Bins up to 4 Million

Multi Level Split performed on GTX280. Bins from 4K to 512K are handled with 2 passes and results for 1M and 2M bins for 1M elements are computed using 3 passes for better performance

MLS :: Right to Left• Using an iterative approach

requires constant number of splits at each level

• Highly scalable due to its iterative nature and ideal number of bins can be chosen for best performance

• Dividing the bins from Right-to-Left requires to preserve the order of elements from previous pass

• Complete list of elements is re-arranged at each level

Ordered Atomic• Atomic operations perform

safe reads/writes by serializing the clashes, but do not guarantee required order of operation

• Ordered atomic serializes the clashes in a fixed order provided by the user

• In case of a clash at higher levels in Right-to-Left Split, elements should be inserted in order of their existing position in the list

Split on 4 Billion bins• Right to Left split

can be used for splitting integers to 4 billion bins ( sorting? )

• Integers can be sorted to desired number of bits

( Keys can be 8, 16, 24, 32 bit long, 64 bit too )

SplitSort Comparison with other GPU Sorting Implementations

Sorting 64 Bit numbers on the GPU

Conclusion• Various histogram methods implemented on

shared memory• Split operation now handles millions and billions

of bins using Left-to-Right and Right-to-Left methods of Multi-Level-Split

• Shared memory split operation faster and scalable than previous implementation (He et al.)

• Fastest Sorting achieved with extension of split to billions of bins

• Variable bit-length sorting helpful with keys of varying size ( bit-length )

split primitive on the gpu. split primitive split can be defined as performing ::...

Documents

int curelement int curcategory

x slide

x curcategory

unsigned int thistnumbins

void nonatomichistogram

element x of list

split primitive split

void globalhist unsigned