by dominik seifert b97902122. overview data alignment hashtable murmurhash function “stupid...
TRANSCRIPT
GPGPUAssignment 2
String matching
by Dominik SeifertB97902122
Overview
Data Alignment Hashtable
MurMurHash Function“Stupid Parallel Hashing”
Lookup The Complete Algorithm The Little Things
Data Alignment (1/3)The Alignment Trap
x86 supports thisBut GPUs don’t!Word-sized pointers are always word-size-
aligned!
Data Alignment (2/3)
Copy all words (corpus & query each) into a new array, consisting of 4-byte chunksImproves memory access patternsAllows us to always consider 4 bytes at a time
Needs more space but who cares!
Keep old offsets and translate to new offsets with: AlignedWordOffset = OrigWordOffset / 4 + WordIndex
What’s the size of the i’th string? strlen(i’th string) == offset(i+1) – offset(i) - 1
Data Alignment (3/3)
AlignedWordOffset = OrigWordOffset / 4 + WordIndex NewSize = 4 x (TotalSize / 4 + WordCount)
Example:
TotalSize = 10 WordCount = 3
NewSize = 4 x (10 / 4 + 3) = 4 x 5 = 20
Original String (10 bytes):
Aligned String (5 x 4= 20 bytes):
HashtableMotivation and overviewA hash is an index into an array that contains a value
Hashtables are perfect for exact matchingSimpleBuild time: O(1)Lookup time: O(1)Databases always use hashtables if they don’t need to
support range queriesTrees are too much work, slower and way harder to
parallelize
Idea: Build hashtable of all corpus wordsSearch for every query word
HashtableMurMurHash Function (1/2)Simple
Only a few lines (available online)Fast
Always considers 4 bytes at a timeConflict-resilient
Very few strings have the same hash
I improved it slightly for my case:6 lines were removed which handle strings of sizes that are not divisible by 4 (since all my aligned string sizes are divisible by 4)
Largest bucket size for corpus (found out through trial & error): 4Hashtable of query strings has largest bucket size 6
Inverting the lookup was slower!
HashtableMurMurHash function (2/2)
HashtableStupid Parallel Hashing (1/2) No space optimization constraint Available space: About 900 MB (without the required space for input &
output) Outline:
Create H layers, each of about 900/H MB in size (Should be a prime number!)
A layer is an array that maps hash to index For each layer L:
Place all previously conflicting words in L Amount of layers = Largest bucket size: 4
Conflicting parallel writes = race conditionCUDA C Programming Guide, section 4.1:
If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device and which thread performs the final write is undefined.
One thread will always succeed!!!
HashtableStupid Parallel Hashing (2/2)Note:
Rows = LayersColumns = Buckets
Input
Layer 1
Layer 2
Layer 3
= Occupied / Conflicted
= Occupied = Empty (-1)
Lookup
Problems
Slowest kernel!
Needs too many registers!
Did not benefit from shm! (But should)
The Complete Algorithm
1. Align words into 4 byte chunks2. Compute hashes of all Corpus words3. For each hashtable layer L (Total of 4):
Place all previously conflicting words in LUse templates to determine the layer
number4. Lookup the index for every word in every
layer L until the next word matches or the current layer has no such hash
Four kernels:
The Little Things (1/2)
A previous presenter inspired this idea:
Init: Allocate & memset (using max sizes)Cleanup: Free all arrays
The Little Things (2/2)
Compare Words:
I did not really use shared memoryDid not improve performance even though it should have
due to load balancing Every thread roughly reads average word size Vs. some threads reading only 1 byte and some reading 100 bytes
Did not investigate further since speed was already very fast
ReferencesMurMurHash: https://
sites.google.com/site/murmurhash/MurmurHash2.cpp?attredirects=0
Real-time Parallel Hashing on the GPUACM Transactions on Graphics (Proceedings of ACM
SIGGRAPH Asia 2009) by Dan A. Alcantara, Andrei Sharf, Fatemeh
Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta
I took some ideas from it but did not implement it at all
Needs atomicAdd