by dominik seifert b97902122. overview data alignment hashtable murmurhash function “stupid...

15
GPGPU Assignment 2 String matching by Dominik Seifert B97902122

Upload: mary-owen

Post on 26-Dec-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

GPGPUAssignment 2

String matching

by Dominik SeifertB97902122

Page 2: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

Overview

Data Alignment Hashtable

MurMurHash Function“Stupid Parallel Hashing”

Lookup The Complete Algorithm The Little Things

Page 3: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

Data Alignment (1/3)The Alignment Trap

x86 supports thisBut GPUs don’t!Word-sized pointers are always word-size-

aligned!

Page 4: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

Data Alignment (2/3)

Copy all words (corpus & query each) into a new array, consisting of 4-byte chunksImproves memory access patternsAllows us to always consider 4 bytes at a time

Needs more space but who cares!

Keep old offsets and translate to new offsets with: AlignedWordOffset = OrigWordOffset / 4 + WordIndex

What’s the size of the i’th string? strlen(i’th string) == offset(i+1) – offset(i) - 1

Page 5: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

Data Alignment (3/3)

AlignedWordOffset = OrigWordOffset / 4 + WordIndex NewSize = 4 x (TotalSize / 4 + WordCount)

Example:

TotalSize = 10 WordCount = 3

NewSize = 4 x (10 / 4 + 3) = 4 x 5 = 20

Original String (10 bytes):

Aligned String (5 x 4= 20 bytes):

Page 6: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

HashtableMotivation and overviewA hash is an index into an array that contains a value

Hashtables are perfect for exact matchingSimpleBuild time: O(1)Lookup time: O(1)Databases always use hashtables if they don’t need to

support range queriesTrees are too much work, slower and way harder to

parallelize

Idea: Build hashtable of all corpus wordsSearch for every query word

Page 7: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

HashtableMurMurHash Function (1/2)Simple

Only a few lines (available online)Fast

Always considers 4 bytes at a timeConflict-resilient

Very few strings have the same hash

I improved it slightly for my case:6 lines were removed which handle strings of sizes that are not divisible by 4 (since all my aligned string sizes are divisible by 4)

Largest bucket size for corpus (found out through trial & error): 4Hashtable of query strings has largest bucket size 6

Inverting the lookup was slower!

Page 8: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

HashtableMurMurHash function (2/2)

Page 9: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

HashtableStupid Parallel Hashing (1/2) No space optimization constraint Available space: About 900 MB (without the required space for input &

output) Outline:

Create H layers, each of about 900/H MB in size (Should be a prime number!)

A layer is an array that maps hash to index For each layer L:

Place all previously conflicting words in L Amount of layers = Largest bucket size: 4

Conflicting parallel writes = race conditionCUDA C Programming Guide, section 4.1:

If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device and which thread performs the final write is undefined.

One thread will always succeed!!!

Page 10: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

HashtableStupid Parallel Hashing (2/2)Note:

Rows = LayersColumns = Buckets

Input

Layer 1

Layer 2

Layer 3

= Occupied / Conflicted

= Occupied = Empty (-1)

Page 11: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

Lookup

Problems

Slowest kernel!

Needs too many registers!

Did not benefit from shm! (But should)

Page 12: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

The Complete Algorithm

1. Align words into 4 byte chunks2. Compute hashes of all Corpus words3. For each hashtable layer L (Total of 4):

Place all previously conflicting words in LUse templates to determine the layer

number4. Lookup the index for every word in every

layer L until the next word matches or the current layer has no such hash

Four kernels:

Page 13: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

The Little Things (1/2)

A previous presenter inspired this idea:

Init: Allocate & memset (using max sizes)Cleanup: Free all arrays

Page 14: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

The Little Things (2/2)

Compare Words:

I did not really use shared memoryDid not improve performance even though it should have

due to load balancing Every thread roughly reads average word size Vs. some threads reading only 1 byte and some reading 100 bytes

Did not investigate further since speed was already very fast

Page 15: By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little

ReferencesMurMurHash: https://

sites.google.com/site/murmurhash/MurmurHash2.cpp?attredirects=0

Real-time Parallel Hashing on the GPUACM Transactions on Graphics (Proceedings of ACM

SIGGRAPH Asia 2009) by Dan A. Alcantara, Andrei Sharf, Fatemeh

Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta

I took some ideas from it but did not implement it at all

Needs atomicAdd