factor oracle, suffix oracle 1 factor oracle suffix oracle

22
Factor Oracle, Suffix Oracl e 1 Factor Oracle Suffix Oracle

Upload: clement-rice

Post on 18-Dec-2015

329 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 1

Factor OracleSuffix Oracle

Page 2: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 2

Outline

Factor oracle definition Construction methods Suffix oracle Factor oracle for a set of words Applications in string matching

Page 3: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 3

Data Structures that Represent the Factors of a String

Suffix trie – tree representing all the suffixes of the string.

Suffix automaton (DAWG) – the minimal automaton recognizing all the suffixes of the string.

Both the suffix automaton and the factor oracle can be obtained from the suffix trie.

Figure 1. Suffix Trie of the string abbc

Figure 2. Suffix Automaton of the string abbc

Figure 3. Factor Oracle for the string abbc

Page 4: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 4

Factor Oracle – Basic Ideas

The factor oracle is a data structure used for indexing all the factors of a given word.

An automaton built on a string p that acts like an oracle on the factors of the string.

If a string is accepted by the automaton it may be a factor of p – weak factor recognition.

All the correct factors are accepted.

Page 5: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 5

Factor Oracle – Example

Factor oracle for the string abbbaab. All states are considered final. The word abba is accepted although it is not a

factor of abbbaab.

Figure 4. Factor oracle for abbbaab

ba is a factor of baab so a transition from 2 to 5 by a is added

Page 6: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 6

Factor Oracle – Formal Definition

Definition 1. The factor oracle of a string is the automaton built by the algorithm Build_Oracle,

where all the states are terminal.

Figure 5. High level construction algorithm of Oracle(p). The algorithm has a quadratic time complexity.

mpppp 21

Page 7: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 7

Factor Oracle – Properties

1. Acyclic homogenous deterministic automaton.

2. Recognizes at least the factors of p, the string that it was built for.

3. Has the fewest states possible (for a string p of length m there are precisely m+1 states).

4. Has a linear number of transitions (the total number ranges between m and 2m-1).

Page 8: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 8

Factor Oracle – Construction

In the sequential construction the letters of the word are read from left to right and the automaton is upgraded at each step.

We denote the longest suffix of that appears at least twice in it.

We define a function on the states of the automaton called supply function that maps each state i of Oracle(p) to the state j where the reading of ends.

ip pppipref 21)( )(irepet p

)(irepet p

pS

Page 9: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 9

Factor Oracle – Construction Algorithm

Buid_Oracle_Sequential( )1. create initial state 0, set2. for i=1 to m do3. create new state i4. add a new transition from i-1 to i by 5. set 6. while and there is no transition from k by do7. add new transition from k to i by 8. set9. endwhile 10. if k = -1 then set 11. else set 12. endfor

mpppp 21

ip)1( iSk p

1k

)(kSk p

),()( ip pkiS

1)0( pS

ipip

0)( iS p

Page 10: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 10

Construction of the Factor Oracle for the string abbbaab

0)1(,1)0( pp SS 0)2( pS

Add a new transition from 0 to 2 by b

2)3( pSNo new transition is needed

3)4( pS

Page 11: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 11

Construction of the Factor Oracle for the string abbbaab

1)5( pS 1)6( pS

2)7( pS

Add new transitions from 3 and 2 to 5 by a Add new transition from 1 to 6 by a

No new transition is needed

Page 12: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 12

Suffix Oracle - Definition

We mark some states in the factor oracle for the string p as final in order to recognize suffixes of p. The new structure is called suffix oracle.

A state q of the suffix oracle is terminal if and only if there is a path labeled by a suffix of p from the initial state leading to q.

Terminal states are determined by following the supply function from state m of Oracle( ).mpppp 21

Page 13: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 13

Suffix Oracle – Example

The suffix oracle is a little more complicated to implement than the factor oracle. Also, it requires more memory space.

Figure 6. Suffix oracle for the string abbbaab.

Double circled states are terminal.

Page 14: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 14

Factor Oracle for a Set of Words

The factor oracle can be extended for a set of words so that it contains at least all the factors of the words from the set.

We set an order on the words from the set, in order to avoid the uniqueness problem.

The oracle is built on a trie of all the words which is updated similarly to the factor oracle for one word.

The supply function maps each state i of the oracle to the state j where the reading of the longest repeated suffix that appears in one of the words ends.

Page 15: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 15

Factor Oracle for a Set of Words Example

Figure 7. Trie for the set {abbba, baaa} Figure 8. Intermediate phase in the construction of the factor oracle for

the set {abbba, baaa}

Figure 9. Factor oracle for the set {abbba, baaa}

Page 16: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 16

Backward Oracle Matching Algorithm

Version of the BDM algorithm using the factor oracle instead of the suffix automaton.

Fast in practice for very long patterns and small alphabets.

Preprocessing phase linear in time and space complexity.

Optimal on average (conjecture.)

Page 17: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 17

BOM – Main Idea

The search uses the oracle of the reversed pattern. The search stops when the word is no longer recognized by the oracle (which

shows it is certainly not a factor of the reversed pattern).

The search window is shifted beyond the point the search failed (safe shift).

Page 18: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 18

BOM – Facts The suffix oracle of the reversed pattern can be

used instead of the factor oracle. The shifts are longer but there are more operations needed.

Worst case complexity of BOM is O(mn), where m is the length of the pattern, and n the total length of the text.

Because the factor oracle accepts some words that are not really factors of the pattern in some cases the total number of inspections is greater than in BDM.

TurboBOM combines BOM with KMP to obtain an algorithm linear in the worst case.

Page 19: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 19

Factor Oracle – Applications

Finding the repeats in a string

Data compression

Bioinformatics

Machine improvisation

Page 20: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 20

Factor Oracle – Open Problems

What is the automaton-independent characterization of the language recognized by the oracle.

Figure 10. The factor oracle for the string abbb accepts exactly all the factors of the string.

Page 21: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 21

Factor Oracle – Open Problems

The factor oracle is not the minimal homogenous automaton which recognizes at least the factors of the string.

Figure 11. The factor oracle for the string abcacdace has 8 extra transitions

Figure 12. A similar automaton with 7 extra transitions

Page 22: Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle 22

References 1. Cyril Allauzen, Maxime Crochemore, Mathieu Raffinot Efficient

Experimental String Matching by Weak Factor Recognition in Proceedings of 12th conference on Combinatorial Pattern Matching, 2001

2. Cyril Allauzen, Mathieu Raffinot Oracle des facteurs d’un ensemble de mots Technical report 99-11, Institut Gaspard Monge Universite Marne la Valee, 1999

3. Loek Cleophas, Gerard Zwaan, Bruce Watson Constructing Factor Oracles in Proceedings of the Prague Stringology Conference 2003, 2003

4. Arnaud Levebvre, Thierry Lecroq Computing repeated factors with a factor oracle in Proceedings of 11th Australian Workshop on Combinatorial Algorithms, 2000

5. G. Assayag, S. Dubnov Using Factor Oracles for Machine Improvisation Soft Computing, 2004