fine tuning the enhanced suffix arrays

25
Fine Tuning the Enhanced Suffix Arrays Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood 1

Upload: loan

Post on 23-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda. Fine Tuning the Enhanced Suffix Arrays. Table of Contents. Suffix array The enhanced suffix array Our accomplishment: Minimal Perfect Hashing Function The exact pattern matching problem - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 1

Fine Tuning the Enhanced Suffix ArraysAyat A.DawoodCIS, Nile UniversityJoined work with: Mohamed AbouelHoda

Page 2: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 2

Table of Contents

Suffix array The enhanced suffix array Our accomplishment:

Minimal Perfect Hashing Function The exact pattern matching problem Improving the bucket table

representation

Page 3: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 3

Suffix array Array of integers

in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$.

e.g., S = acaaacatat$

S(Suftab[i]) Suftab Iaaacatat$ 2 0aacatat$ 3 1acaaacatat$ 0 2acatat$ 4 3atat$ 6 4at$ 8 5caaacatat$ 1 6catat$ 5 7tat$ 7 8t$ 9 9$ 10 10

Page 4: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 4

Suffix array Array of integers

in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$.

e.g., S = acaaacatat$

S(Suftab[i]) Suftab Iaaacatat$ 2 0aacatat$ 3 1acaaacatat$ 0 2acatat$ 4 3atat$ 6 4at$ 8 5caaacatat$ 1 6catat$ 5 7tat$ 7 8t$ 9 9$ 10 10

Page 5: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 5

Enhanced suffix array Basically it is the suffix

array enhanced with a set of tables.

Using those tables, best performance and complexity are achieved

lcptab[i] stores the length of longest common prefix of the suffixes suftab[i] and suftab[i-1].

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$

1 0 2

acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$

0 1 6

catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10

Page 6: Fine Tuning the Enhanced Suffix Arrays

6

Enhanced suffix array: l-interval

L-interval: interval of suffixes sharing the same prefixAyat A.Dawood

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$

1 0 2

acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$

0 1 6

catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10

1-[0..5]

Page 7: Fine Tuning the Enhanced Suffix Arrays

7

Enhanced suffix array: l-interval

Ayat A.Dawood

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$

1 0 2

acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$

0 1 6

catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10

1-[0..5]

2-[0..1]

a

L-interval: interval of suffixes sharing the same prefix

Page 8: Fine Tuning the Enhanced Suffix Arrays

8

Enhanced suffix array: l-interval

Ayat A.Dawood

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$

1 0 2

acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$

0 1 6

catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10

0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

L-interval: interval of suffixes sharing the same prefix

Page 9: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 9

Our accomplishment

Improvement (Fine Tuning): Alphabet-independent exact pattern

matching. Improving bucket table representation Improving access to the lcp-table.

Improvements are achieved using minimal perfect hashing techniques.

Page 10: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 10

Minimal perfect hashing(MPHF) Storing n static keys from universe U

in O(n) space with O(1) access time.[Botelho et. al]

Look up table requires O(|U|) space to achieve constant access time

Page 11: Fine Tuning the Enhanced Suffix Arrays

11

Exact pattern matching problem

Ayat A.Dawood

0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$

1 0 2

acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$

0 1 6

catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10

e.g., pattern = aca

Page 12: Fine Tuning the Enhanced Suffix Arrays

12

Exact pattern matching problem

Ayat A.Dawood

0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$

1 0 2

acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$

0 1 6

catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10

e.g., pattern = aca

Page 13: Fine Tuning the Enhanced Suffix Arrays

13

Exact pattern matching problem

Ayat A.Dawood

0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$

1 0 2

acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$

0 1 6

catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10

e.g., pattern = aca

Page 14: Fine Tuning the Enhanced Suffix Arrays

14

Exact pattern matching problem

Ayat A.Dawood

0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$

1 0 2

acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$

0 1 6

catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10

e.g., pattern = aca

Page 15: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 15

Exact pattern matching problem Using normal method: takes O(nm) Using the enhanced suffix arrays, it

can be achieved in O(|∑|m) [AbouElHoda et. al]

Other modification to the enhanced suffix arrays allows it to be done in O(m log (|∑|)). [Kim et. al],[Fischer et. al]

Page 16: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 16

Exact pattern matching problem Our work:

Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

MPHF table

MPHF table

Page 17: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 17

Exact pattern matching problem Our work:

Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

Page 18: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 18

Exact pattern matching problem Our work:

Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

Page 19: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 19

Improving the bucket table representation

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$

1 0 2

acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$

0 1 6

catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10

Bucket table0 aa2 ac4 at

ag6 ca

ctcccg

8 tatctgttgagtgcgg

Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array

Page 20: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 20

Improving the bucket table representation

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$

1 0 2

acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$

0 1 6

catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10

Bucket table0 aa2 ac4 at

ag6 ca

ctcccg

8 tatctgttgagtgcgg

Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array

Page 21: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 21

Improving the bucket table representation cont’ Problem:

Space consumption of the look up table is prohibitive for large d and ∑ (d ^ |∑|).

Solution: Use minimal perfect hashing techniques

to store the look up table.

Page 22: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 22

Improving the bucket table representation cont’ Results:

For the bacterial ecoli genome (size = 5400 bp) and for d= 12

Reduction comparing to lookup table

MPHF size in

bits

Lookup table

size in bits

No. of keys

Alphabet size

46% reduction 7231956.638

1677216 3474814

4 (A,T,C,G)

93% reduction 17590331.64

244140625

8451811

5(A,T,C,G,*N)*N for undefined nucleotide or dummy

character

Page 23: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 23

Conclusion

Exact pattern matching problem Improving the bucket table

representation. Improving access to the lcp-table.

Page 24: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 24

Questions???

Page 25: Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood 25

Improving access to the lcp-table To reduce space, lcp- table is

stored in 1 byte. If a common prefix is longer

than 255, then it is stored in another table.

To access this table, it is accessed sequential or using binary search

Our Enhancement: Use MPHF to store the extra

table to access it in constant time.

02

32

0

257279

300260

lcp-table

Extra lcp-table