parallel suffix array construction by accelerated sampling

21
Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick

Upload: janet

Post on 25-Feb-2016

38 views

Category:

Documents


1 download

DESCRIPTION

Parallel Suffix Array Construction by Accelerated Sampling. Matthew Felice Pace University of Warwick. Joint work with Alexander Tiskin University of Warwick. Outline. Introduction Difference Covers Sequential Suffix Array Construction Bulk-Synchronous Parallel (BSP) Model - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Suffix Array Construction by Accelerated Sampling

Parallel Suffix Array Construction by Accelerated Sampling

Matthew Felice Pace University of Warwick

Joint work with Alexander TiskinUniversity of Warwick

Page 2: Parallel Suffix Array Construction by Accelerated Sampling

Outline

• Introduction• Difference Covers• Sequential Suffix Array Construction• Bulk-Synchronous Parallel (BSP) Model• Suffix Array Construction in BSP• Conclusion

Page 3: Parallel Suffix Array Construction by Accelerated Sampling

Introduction

• What is a suffix array?• A data structure, denoted by   , that holds the lexicographic

order of all the suffixes of a given string   of size   .

• Suffix array construction related to sorting.• Naïve solution is to radix sort all the suffixes in   .

• We assume that a given string of size   is over or an indexed alphabet.

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

8

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

8 4

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

8 4 1

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

8 4 1 5

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

8 4 1 5 6 2 0 7 3

Page 4: Parallel Suffix Array Construction by Accelerated Sampling

Introduction

• Manber and Myers [1990] presented the first suffix array construction algorithm (SACA) running in   .

• Kärkkäinen and Sanders [2003], Kim et al. [2003], Ko and Aluru [2003] all developed SACAs running in   .

• Kärkkäinen et al. [2006] extend their algorithm to run on a p

processor BSP machine with optimal   local computation and communication costs and requiring supersteps.

• We reduce the number of supersteps required to xxxxx while preserving the optimal computation and communication costs.

Page 5: Parallel Suffix Array Construction by Accelerated Sampling

Introduction

• The idea behind the SACAs having linear worst case running time is to use recursion1. Divide the indices of the input string   into two nonempty

disjoint sets.

2. Form string   and   from the characters indexed by each set.

3. Recursively construct   .4. Use   to construct   .5. Merge   and   to obtain   .

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

c d b $a b a b d a $

Page 6: Parallel Suffix Array Construction by Accelerated Sampling

Difference Covers

• Given a positive integer   , let   denote the set of integers   .

• Then   can be defined such that for any   , there exists   such that       .

•   is known as a difference cover of   .

• Let   , i.e.   , then we can have e.g. xxxxxxxxxxxx, but not   .

0 ≡ 1 – 1 (mod 4) 0 ≡ 1 – 1 (mod 4)

1 ≡ 3 – 2 (mod 4) 1 ≡ 1 – 0 (mod 4)

2 ≡ 3 – 1 (mod 4) 3 ≡ 0 – 1 (mod 4)

3 ≡ 1 – 2 (mod 4) 2 ??

Page 7: Parallel Suffix Array Construction by Accelerated Sampling

Difference Covers

• Colbourn and Ling [2000] give a method for computing the difference cover   of   , for any positive integer   , in time   .•  

• Lemma 1 [Kärkkäinen and Sanders 2003]• If   is a difference cover of   , and   and   are integers, then

there exists   such that   and   are both in   .

• Let   and   , then

i j l (i + l) mod 3 (j + l) mod 3

30 35 3 (30 + 3) mod 4 = 1 (35 + 3) mod 4 = 2

20 35 2 (20 + 2) mod 4 = 2 (35 + 2) mod 4 = 1

Page 8: Parallel Suffix Array Construction by Accelerated Sampling

Sequential Suffix Array Construction• Given string   of size   , and a positive integer   , we construct the

suffix array as follows:

• Construct difference cover   of   (e.g. for   ,   ).

• Partition the set of indices into sets.

• Denote every character , , such that , as a sample character, and for each such character define a super-character corresponding to   .

x[0] x[1] x[2] x[3] x[4] x[5] … x[n-1]

x[0] x[1] x[2] x[3] x[4] x[5] … x[n-1]

x[2] x[3] x[5] x[6] … -1

x[3] x[4] x[6] x[7] -1

x[0] x[1] x[2] x[3] x[4] x[5] … x[n-1]

Page 9: Parallel Suffix Array Construction by Accelerated Sampling

Sequential Suffix Array Construction• Construct string   of super-characters, of size   .

• Construct   , identical to   with each super-character replaced by its rank in the sorted list of super-characters.

• Recursively call algorithm on string   , with parameter   .

• When algorithm returns with   fill array   with the rank of each suffix of   .

x[1:3] x[4:6] … x[n-2:n] x[2:4] x[5:7] … x[n-1:n+1]

4 8 3 3 … 2

Page 10: Parallel Suffix Array Construction by Accelerated Sampling

Sequential Suffix Array Construction• For each   , find an   such that

asdfkjhiuhoknmkjnkj (e.g. and , then )

• Then for each   ,   , define tuple   ,

and sort the tuples separately for each   .

x[0] x[1] x[2] x[3] x[4] x[5] … x[n-1]

rank[1] rank[4] …

Page 11: Parallel Suffix Array Construction by Accelerated Sampling

Sequential Suffix Array Construction• Sort all the suffixes of   by first   characters to get sets of

suffixes having an identical prefix.

• Each set of suffixes with an identical prefix can be divided into subsets of suffixes whose order within the subset has already been found.

• Merge the subsets of each set of suffixes with identical prefixes, using Lemma 1.

• Suffix array is obtained in time   .

aaa aab …

x[0:2] x[10:12] x[5:7] x[12:14] x[1:3]

⁞ ⁞ ⁞ ⁞ ⁞

Page 12: Parallel Suffix Array Construction by Accelerated Sampling

Sequential Suffix Array Construction• The size of the string decreases by a factor of   in each level

of recursion.

n

  1

• This requires levels of recursion.

Page 13: Parallel Suffix Array Construction by Accelerated Sampling

BSP model

• Model developed to allow rigorous parallel algorithm design over diverse physical systems• p processors each with local memory• Global communication environment• Barrier synchronisation

comm env

P P PP...

M M MM...

Page 14: Parallel Suffix Array Construction by Accelerated Sampling

BSP model

• A BSP machine is defined by 3 parameters• p – number of processors• g – inverse bandwidth of the network• l – network latency

• Algorithms run in supersteps, each of which is measured by• comp – maximum computation over all processors• comm – maximum communication over all processors

• Total cost of an algorithm having S supersteps is

Page 15: Parallel Suffix Array Construction by Accelerated Sampling

Suffix Array Construction in BSP

• Sequential algorithm divided into four steps• Three integer sorting steps• Final merging step

• Integer sorting in BSP requires   superstep with comp and comm, using a technique called regular sampling. [Chan and Dehne 1999]

• We can perform the final merging step using the same technique.

• Therefore, we can perform each level of recursion in supersteps.

Page 16: Parallel Suffix Array Construction by Accelerated Sampling

Suffix Array Construction in BSP

• The size of the string decreases by a factor of   in each level of recursion.

n

• This requires   levels, i.e. supersteps.

Page 17: Parallel Suffix Array Construction by Accelerated Sampling

Suffix Array Construction in BSP

• However, by decreasing the sampling frequency at each level of recursion we can accelerate the rate by which the size of the input string in successive levels of recursion decreases.

• By setting   , the size of the input string converges towards   super-exponentially.

Page 18: Parallel Suffix Array Construction by Accelerated Sampling

Suffix Array Construction in BSP

Page 19: Parallel Suffix Array Construction by Accelerated Sampling

Suffix Array Construction in BSP

• However, by decreasing the sampling frequency at each level of recursion we can accelerate the rate by which the size of the input string in successive levels of recursion decreases.

• By setting   , the size of the input string converges towards   super-exponentially.

• Therefore, we only require   supersteps to construct the suffix array of a given string.

Page 20: Parallel Suffix Array Construction by Accelerated Sampling

Conclusion

• Presented an algorithm for constructing suffix arrays in parallel on a   processor machine.

• Algorithm requires optimal   local computation and communication costs.

• Reduced the number of supersteps required to a near optimal   .

• Open questions• Can we construct suffix arrays in   supersteps?• Can we apply the accelerated sampling technique to other

algorithms?

Page 21: Parallel Suffix Array Construction by Accelerated Sampling

Thank you!