by dominik seifert b97902122. overview data alignment hashtable murmurhash function “stupid...

GPGPUAssignment 2

String matching

by Dominik SeifertB97902122

Overview

Data Alignment Hashtable

MurMurHash Function“Stupid Parallel Hashing”

Lookup The Complete Algorithm The Little Things

Data Alignment (1/3)The Alignment Trap

x86 supports thisBut GPUs don’t!Word-sized pointers are always word-size-

aligned!

Data Alignment (2/3)

Copy all words (corpus & query each) into a new array, consisting of 4-byte chunksImproves memory access patternsAllows us to always consider 4 bytes at a time

Needs more space but who cares!

Keep old offsets and translate to new offsets with: AlignedWordOffset = OrigWordOffset / 4 + WordIndex

What’s the size of the i’th string? strlen(i’th string) == offset(i+1) – offset(i) - 1

Data Alignment (3/3)

AlignedWordOffset = OrigWordOffset / 4 + WordIndex NewSize = 4 x (TotalSize / 4 + WordCount)

Example:

TotalSize = 10 WordCount = 3

NewSize = 4 x (10 / 4 + 3) = 4 x 5 = 20

Original String (10 bytes):

Aligned String (5 x 4= 20 bytes):

HashtableMotivation and overviewA hash is an index into an array that contains a value

Hashtables are perfect for exact matchingSimpleBuild time: O(1)Lookup time: O(1)Databases always use hashtables if they don’t need to

support range queriesTrees are too much work, slower and way harder to

parallelize

Idea: Build hashtable of all corpus wordsSearch for every query word

HashtableMurMurHash Function (1/2)Simple

Only a few lines (available online)Fast

Always considers 4 bytes at a timeConflict-resilient

Very few strings have the same hash

I improved it slightly for my case:6 lines were removed which handle strings of sizes that are not divisible by 4 (since all my aligned string sizes are divisible by 4)

Largest bucket size for corpus (found out through trial & error): 4Hashtable of query strings has largest bucket size 6

Inverting the lookup was slower!

HashtableMurMurHash function (2/2)

HashtableStupid Parallel Hashing (1/2) No space optimization constraint Available space: About 900 MB (without the required space for input &

output) Outline:

Create H layers, each of about 900/H MB in size (Should be a prime number!)

A layer is an array that maps hash to index For each layer L:

Place all previously conflicting words in L Amount of layers = Largest bucket size: 4

Conflicting parallel writes = race conditionCUDA C Programming Guide, section 4.1:

If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device and which thread performs the final write is undefined.

One thread will always succeed!!!

HashtableStupid Parallel Hashing (2/2)Note:

Rows = LayersColumns = Buckets

Input

Layer 1

Layer 2

Layer 3

= Occupied / Conflicted

= Occupied = Empty (-1)

Lookup

Problems

Slowest kernel!

Needs too many registers!

Did not benefit from shm! (But should)

The Complete Algorithm

1. Align words into 4 byte chunks2. Compute hashes of all Corpus words3. For each hashtable layer L (Total of 4):

Place all previously conflicting words in LUse templates to determine the layer

number4. Lookup the index for every word in every

layer L until the next word matches or the current layer has no such hash

Four kernels:

The Little Things (1/2)

A previous presenter inspired this idea:

Init: Allocate & memset (using max sizes)Cleanup: Free all arrays

The Little Things (2/2)

Compare Words:

I did not really use shared memoryDid not improve performance even though it should have

due to load balancing Every thread roughly reads average word size Vs. some threads reading only 1 byte and some reading 100 bytes

Did not investigate further since speed was already very fast

ReferencesMurMurHash: https://

sites.google.com/site/murmurhash/MurmurHash2.cpp?attredirects=0

Real-time Parallel Hashing on the GPUACM Transactions on Graphics (Proceedings of ACM

SIGGRAPH Asia 2009) by Dan A. Alcantara, Andrei Sharf, Fatemeh

Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta

I took some ideas from it but did not implement it at all

Needs atomicAdd

https://sites.google.com/site/murmurhash/MurmurHash2.cpp?attredirects=0



by dominik seifert b97902122. overview data alignment hashtable murmurhash function “stupid...

Documents