constant-delay enumeration for nondeterministic document ... · constant-delay enumeration for...

113
Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli , Pierre Bourhis , Stefan Mengel , Matthias Niewerth March th, Télécom ParisTech CNRS CRIStAL CNRS CRIL Universität Bayreuth /

Upload: others

Post on 06-Oct-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Constant-Delay Enumeration for NondeterministicDocument Spanners

Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3, Matthias Niewerth4

March 27th, 20191Télécom ParisTech

2CNRS CRIStAL

3CNRS CRIL

4Universität Bayreuth1/16

Page 2: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Problem: Finding Patterns in Text

• We have a long text T:Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and [email protected] Affiliation Associate professor of computer science (office C201-4) in the DIG team ofTélécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer scienceawarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• We want to find a pattern P in the text T:→ Example: find email addresses

• Write the pattern as a regular expression:

P := [a-z0-9.]∗ @ [a-z0-9.]∗

→ How to find the pattern P eciently in the text T?

2/16

Page 3: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Problem: Finding Patterns in Text

• We have a long text T:Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and [email protected] Affiliation Associate professor of computer science (office C201-4) in the DIG team ofTélécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer scienceawarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• We want to find a pattern P in the text T:→ Example: find email addresses

• Write the pattern as a regular expression:

P := [a-z0-9.]∗ @ [a-z0-9.]∗

→ How to find the pattern P eciently in the text T?

2/16

Page 4: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Problem: Finding Patterns in Text

• We have a long text T:Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and [email protected] Affiliation Associate professor of computer science (office C201-4) in the DIG team ofTélécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer scienceawarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• We want to find a pattern P in the text T:→ Example: find email addresses

• Write the pattern as a regular expression:

P := [a-z0-9.]∗ @ [a-z0-9.]∗

→ How to find the pattern P eciently in the text T?

2/16

Page 5: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Problem: Finding Patterns in Text

• We have a long text T:Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and [email protected] Affiliation Associate professor of computer science (office C201-4) in the DIG team ofTélécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer scienceawarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• We want to find a pattern P in the text T:→ Example: find email addresses

• Write the pattern as a regular expression:

P := [a-z0-9.]∗ @ [a-z0-9.]∗

→ How to find the pattern P eciently in the text T?

2/16

Page 6: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Solution: Automata

• Convert the regular expression P to an automaton A

P := [a-z0-9.]∗ @ [a-z0-9.]∗

1start 2 3 4

• •

[a-z0-9.] [a-z0-9.]

@

• Then, evaluate the automaton on the text TE m a i l a 3 n m @ a 3 n m . n e t A f f i l i a t i o n

• The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P→ This is very ecient in T and reasonably ecient in P

3/16

Page 7: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Solution: Automata

• Convert the regular expression P to an automaton A

P := [a-z0-9.]∗ @ [a-z0-9.]∗

1start 2 3 4

• •

[a-z0-9.] [a-z0-9.]

@

• Then, evaluate the automaton on the text TE m a i l a 3 n m @ a 3 n m . n e t A f f i l i a t i o n

• The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P→ This is very ecient in T and reasonably ecient in P

3/16

Page 8: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Solution: Automata

• Convert the regular expression P to an automaton A

P := [a-z0-9.]∗ @ [a-z0-9.]∗

1start 2 3 4

• •

[a-z0-9.] [a-z0-9.]

@

• Then, evaluate the automaton on the text TE m a i l a 3 n m @ a 3 n m . n e t A f f i l i a t i o n

• The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P→ This is very ecient in T and reasonably ecient in P

3/16

Page 9: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Solution: Automata

• Convert the regular expression P to an automaton A

P := [a-z0-9.]∗ @ [a-z0-9.]∗

1start 2 3 4

• •

[a-z0-9.] [a-z0-9.]

@

• Then, evaluate the automaton on the text T

E m a i l a 3 n m @ a 3 n m . n e t A f f i l i a t i o n

• The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P→ This is very ecient in T and reasonably ecient in P

3/16

Page 10: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Solution: Automata

• Convert the regular expression P to an automaton A

P := [a-z0-9.]∗ @ [a-z0-9.]∗

1start 2 3 4

• •

[a-z0-9.] [a-z0-9.]

@

• Then, evaluate the automaton on the text TE m a i l a 3 n m @ a 3 n m . n e t A f f i l i a t i o n

• The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P

→ This is very ecient in T and reasonably ecient in P

3/16

Page 11: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Solution: Automata

• Convert the regular expression P to an automaton A

P := [a-z0-9.]∗ @ [a-z0-9.]∗

1start 2 3 4

• •

[a-z0-9.] [a-z0-9.]

@

• Then, evaluate the automaton on the text TE m a i l a 3 n m @ a 3 n m . n e t A f f i l i a t i o n

• The complexity is O(|A| × |T|), i.e., linear in T and polynomial in P→ This is very ecient in T and reasonably ecient in P

3/16

Page 12: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Actual Problem: Extracting all Patterns

• This only tests if the pattern occurs in the text!→ “YES”

• Goal: find all substrings in the text T which match the pattern P0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

E m a i l A f f i l i a t i o n

→ One match: [5, 20〉

4/16

Page 13: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Actual Problem: Extracting all Patterns

• This only tests if the pattern occurs in the text!→ “YES”

• Goal: find all substrings in the text T which match the pattern P

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

E m a i l A f f i l i a t i o n

→ One match: [5, 20〉

4/16

Page 14: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Actual Problem: Extracting all Patterns

• This only tests if the pattern occurs in the text!→ “YES”

• Goal: find all substrings in the text T which match the pattern P0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

E m a i l a 3 n m @ a 3 n m . n e t A f f i l i a t i o n

→ One match: [5, 20〉

4/16

Page 15: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Actual Problem: Extracting all Patterns

• This only tests if the pattern occurs in the text!→ “YES”

• Goal: find all substrings in the text T which match the pattern P0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

E m a i l a 3 n m @ a 3 n m . n e t A f f i l i a t i o n

→ One match: [5, 20〉

4/16

Page 16: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formal Problem Statement

• Problem description:

• Input:• A text T

Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. Frenchnational. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP [email protected] Associate professor of computer science (office C201-4) in the DIG team of TélécomParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded byTélécom ParisTech on March 14, 2016. Former student of the École normale supérieure. [email protected] Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• A pattern P given as a regular expression

P := [a-z0-9.]∗ @ [a-z0-9.]∗

• Output: the list of substrings of T that match P:[186, 200〉, [483, 500〉, . . .

• Goal: be very ecient in T and reasonably ecient in P

5/16

Page 17: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formal Problem Statement

• Problem description:• Input:

• A text TAntoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. Frenchnational. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP [email protected] Associate professor of computer science (office C201-4) in the DIG team of TélécomParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded byTélécom ParisTech on March 14, 2016. Former student of the École normale supérieure. [email protected] Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• A pattern P given as a regular expression

P := [a-z0-9.]∗ @ [a-z0-9.]∗

• Output: the list of substrings of T that match P:[186, 200〉, [483, 500〉, . . .

• Goal: be very ecient in T and reasonably ecient in P

5/16

Page 18: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formal Problem Statement

• Problem description:• Input:

• A text TAntoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. Frenchnational. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP [email protected] Associate professor of computer science (office C201-4) in the DIG team of TélécomParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded byTélécom ParisTech on March 14, 2016. Former student of the École normale supérieure. [email protected] Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• A pattern P given as a regular expression

P := [a-z0-9.]∗ @ [a-z0-9.]∗

• Output: the list of substrings of T that match P:[186, 200〉, [483, 500〉, . . .

• Goal: be very ecient in T and reasonably ecient in P

5/16

Page 19: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formal Problem Statement

• Problem description:• Input:

• A text TAntoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. Frenchnational. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP [email protected] Associate professor of computer science (office C201-4) in the DIG team of TélécomParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded byTélécom ParisTech on March 14, 2016. Former student of the École normale supérieure. [email protected] Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• A pattern P given as a regular expression

P := [a-z0-9.]∗ @ [a-z0-9.]∗

• Output: the list of substrings of T that match P:[186, 200〉, [483, 500〉, . . .

• Goal: be very ecient in T and reasonably ecient in P

5/16

Page 20: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formal Problem Statement

• Problem description:• Input:

• A text TAntoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. Frenchnational. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP [email protected] Associate professor of computer science (office C201-4) in the DIG team of TélécomParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded byTélécom ParisTech on March 14, 2016. Former student of the École normale supérieure. [email protected] Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• A pattern P given as a regular expression

P := [a-z0-9.]∗ @ [a-z0-9.]∗

• Output: the list of substrings of T that match P:[186, 200〉, [483, 500〉, . . .

• Goal: be very ecient in T and reasonably ecient in P

5/16

Page 21: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formal Problem Statement

• Problem description:• Input:

• A text TAntoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. Frenchnational. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP [email protected] Associate professor of computer science (office C201-4) in the DIG team of TélécomParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded byTélécom ParisTech on March 14, 2016. Former student of the École normale supérieure. [email protected] Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• A sequential document spanner P given as a regular expression

P := x`[a-z0-9.]∗ @ [a-z0-9.]∗a x

• Output: the list of tuples of substrings of T that match P:[186, 200〉, [483, 500〉, . . .

• Goal: be very ecient in T and reasonably ecient in P

5/16

Page 22: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formal Problem Statement

• Problem description:• Input:

• A text TAntoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07. Frenchnational. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and XMPP [email protected] Associate professor of computer science (office C201-4) in the DIG team of TélécomParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer science awarded byTélécom ParisTech on March 14, 2016. Former student of the École normale supérieure. [email protected] Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• A pattern P given as a regular expression

P := [a-z0-9.]∗ @ [a-z0-9.]∗

• Output: the list of substrings of T that match P:[186, 200〉, [483, 500〉, . . .

• Goal: be very ecient in T and reasonably ecient in P

5/16

Page 23: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l

[〉

o

[〉

l

[〉→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 24: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T[〉 l

[〉

o

[〉

l

[〉→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 25: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T[

l

[

〉 o

[〉

l

[〉→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 26: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T[

l

[〉

o

[

〉 l

[〉→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 27: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T[

l

[〉

o

[〉

l

[

→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 28: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l [〉 o

[〉

l

[〉→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 29: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l [

o

[

〉 l

[〉→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 30: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l [

o

[〉

l

[

→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 31: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l

[〉

o [〉 l

[〉→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 32: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l

[〉

o [

l

[

→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 33: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l

[〉

o

[〉

l [〉

→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 34: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l

[〉

o

[〉

l

[〉

→ Complexity is O(|T|2 × |A| × |T|)

→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 35: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l

[〉

o

[〉

l

[〉

→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 36: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l

[〉

o

[〉

l

[〉

→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:

• Consider the text T:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 37: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l

[〉

o

[〉

l

[〉

→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 38: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l

[〉

o

[〉

l

[〉

→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 39: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l

[〉

o

[〉

l

[〉

→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 40: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Measuring the Complexity

• Naive algorithm: Run the automaton A on each substring of T

[〉

l

[〉

o

[〉

l

[〉

→ Complexity is O(|T|2 × |A| × |T|)→ Can be optimized to O(|T|2 × |A|)

• Problem: We may need to output Ω(|T|2) matching substrings:• Consider the text T:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

• Consider the pattern P := a∗

• The number of matches is Ω(|T|2)

→ We need a dierent way to measure complexity

6/16

Page 41: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Enumeration Algorithms

Idea: In real life, we do not want to compute all the matcheswe just need to be able to enumerate matches quickly

→ Formalization: enumeration algorithms

7/16

Page 42: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Enumeration Algorithms

Idea: In real life, we do not want to compute all the matcheswe just need to be able to enumerate matches quickly

→ Formalization: enumeration algorithms

7/16

Page 43: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Enumeration Algorithms

Idea: In real life, we do not want to compute all the matcheswe just need to be able to enumerate matches quickly

→ Formalization: enumeration algorithms

7/16

Page 44: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Enumeration Algorithms

Idea: In real life, we do not want to compute all the matcheswe just need to be able to enumerate matches quickly

→ Formalization: enumeration algorithms

7/16

Page 45: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Enumeration Algorithms

Idea: In real life, we do not want to compute all the matcheswe just need to be able to enumerate matches quickly

→ Formalization: enumeration algorithms

7/16

Page 46: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Enumeration Algorithms

Idea: In real life, we do not want to compute all the matcheswe just need to be able to enumerate matches quickly

→ Formalization: enumeration algorithms

7/16

Page 47: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formalizing Enumeration Algorithms

Antoine Amarilli Description Name AntoineAmarilli. Handle: a3nm. Identity Born1990-02-07. French national. Appearance asof 2017. Auth OpenPGP. OpenId. Bitcoin.Contact Email and XMPP [email protected] Associate professor ...

Text T

[a-z0-9.]∗@[a-z0-9.]∗

Pattern P

Phase 1:Preprocessing

Index structure

Phase 2:Enumeration

[42, 57〉,[1337, 1351〉

Results

8/16

Page 48: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formalizing Enumeration Algorithms

Antoine Amarilli Description Name AntoineAmarilli. Handle: a3nm. Identity Born1990-02-07. French national. Appearance asof 2017. Auth OpenPGP. OpenId. Bitcoin.Contact Email and XMPP [email protected] Associate professor ...

Text T

[a-z0-9.]∗@[a-z0-9.]∗

Pattern P

Phase 1:Preprocessing

Index structure

Phase 2:Enumeration

[42, 57〉,[1337, 1351〉

Results

8/16

Page 49: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formalizing Enumeration Algorithms

Antoine Amarilli Description Name AntoineAmarilli. Handle: a3nm. Identity Born1990-02-07. French national. Appearance asof 2017. Auth OpenPGP. OpenId. Bitcoin.Contact Email and XMPP [email protected] Associate professor ...

Text T

[a-z0-9.]∗@[a-z0-9.]∗

Pattern P

Phase 1:Preprocessing

Index structure

Phase 2:Enumeration

[42, 57〉,[1337, 1351〉

Results

8/16

Page 50: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formalizing Enumeration Algorithms

Antoine Amarilli Description Name AntoineAmarilli. Handle: a3nm. Identity Born1990-02-07. French national. Appearance asof 2017. Auth OpenPGP. OpenId. Bitcoin.Contact Email and XMPP [email protected] Associate professor ...

Text T

[a-z0-9.]∗@[a-z0-9.]∗

Pattern P

Phase 1:Preprocessing

Index structure

Phase 2:Enumeration

[42, 57〉,[1337, 1351〉

Results

8/16

Page 51: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formalizing Enumeration Algorithms

Antoine Amarilli Description Name AntoineAmarilli. Handle: a3nm. Identity Born1990-02-07. French national. Appearance asof 2017. Auth OpenPGP. OpenId. Bitcoin.Contact Email and XMPP [email protected] Associate professor ...

Text T

[a-z0-9.]∗@[a-z0-9.]∗

Pattern P

Phase 1:Preprocessing

Index structure

Phase 2:Enumeration

[42, 57〉,

[1337, 1351〉

Results8/16

Page 52: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formalizing Enumeration Algorithms

Antoine Amarilli Description Name AntoineAmarilli. Handle: a3nm. Identity Born1990-02-07. French national. Appearance asof 2017. Auth OpenPGP. OpenId. Bitcoin.Contact Email and XMPP [email protected] Associate professor ...

Text T

[a-z0-9.]∗@[a-z0-9.]∗

Pattern P

Phase 1:Preprocessing

Index structure

Phase 2:Enumeration

[42, 57〉,[1337, 1351〉

Results

8/16

Page 53: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Formalizing Enumeration Algorithms

Antoine Amarilli Description Name AntoineAmarilli. Handle: a3nm. Identity Born1990-02-07. French national. Appearance asof 2017. Auth OpenPGP. OpenId. Bitcoin.Contact Email and XMPP [email protected] Associate professor ...

Text T

[a-z0-9.]∗@[a-z0-9.]∗

Pattern P

Phase 1:Preprocessing

Index structure

Phase 2:Enumeration

[42, 57〉,[1337, 1351〉

Results

Two performance criteria:

• Total time for phase 1

• Delay between two results in phase 2... as a function of the text and pattern

8/16

Page 54: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Complexity of Enumeration Algorithms

• Recall the inputs to our problem:• A text T

Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and [email protected] Affiliation Associate professor of computer science (office C201-4) in the DIG team ofTélécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer scienceawarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• A pattern P given as a regular expression

P := [a-z0-9.]∗ @ [a-z0-9.]∗

• What is the delay of the naive algorithm?

→ it is the maximal time to find the next matching substring→ i.e. O(|T|2 × |A|), e.g., if only the beginning and end match

→ Can we do better?

9/16

Page 55: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Complexity of Enumeration Algorithms

• Recall the inputs to our problem:• A text T

Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and [email protected] Affiliation Associate professor of computer science (office C201-4) in the DIG team ofTélécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer scienceawarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• A pattern P given as a regular expression

P := [a-z0-9.]∗ @ [a-z0-9.]∗

• What is the delay of the naive algorithm?

→ it is the maximal time to find the next matching substring→ i.e. O(|T|2 × |A|), e.g., if only the beginning and end match

→ Can we do better?

9/16

Page 56: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Complexity of Enumeration Algorithms

• Recall the inputs to our problem:• A text T

Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and [email protected] Affiliation Associate professor of computer science (office C201-4) in the DIG team ofTélécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer scienceawarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• A pattern P given as a regular expression

P := [a-z0-9.]∗ @ [a-z0-9.]∗

• What is the delay of the naive algorithm?

→ it is the maximal time to find the next matching substring→ i.e. O(|T|2 × |A|), e.g., if only the beginning and end match

→ Can we do better?

9/16

Page 57: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Complexity of Enumeration Algorithms

• Recall the inputs to our problem:• A text T

Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and [email protected] Affiliation Associate professor of computer science (office C201-4) in the DIG team ofTélécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer scienceawarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• A pattern P given as a regular expression

P := [a-z0-9.]∗ @ [a-z0-9.]∗

• What is the delay of the naive algorithm?

→ it is the maximal time to find the next matching substring

→ i.e. O(|T|2 × |A|), e.g., if only the beginning and end match

→ Can we do better?

9/16

Page 58: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Complexity of Enumeration Algorithms

• Recall the inputs to our problem:• A text T

Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and [email protected] Affiliation Associate professor of computer science (office C201-4) in the DIG team ofTélécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer scienceawarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• A pattern P given as a regular expression

P := [a-z0-9.]∗ @ [a-z0-9.]∗

• What is the delay of the naive algorithm?

→ it is the maximal time to find the next matching substring→ i.e. O(|T|2 × |A|), e.g., if only the beginning and end match

→ Can we do better?

9/16

Page 59: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Complexity of Enumeration Algorithms

• Recall the inputs to our problem:• A text T

Antoine Amarilli Description Name Antoine Amarilli. Handle: a3nm. Identity Born 1990-02-07.French national. Appearance as of 2017. Auth OpenPGP. OpenId. Bitcoin. Contact Email and [email protected] Affiliation Associate professor of computer science (office C201-4) in the DIG team ofTélécom ParisTech, 46 rue Barrault, F-75634 Paris Cedex 13, France. Studies PhD in computer scienceawarded by Télécom ParisTech on March 14, 2016. Former student of the École normale supérieure.More Résumé Location Other sites Blogging: a3nm.net/blog Git: a3nm.net/git ...

• A pattern P given as a regular expression

P := [a-z0-9.]∗ @ [a-z0-9.]∗

• What is the delay of the naive algorithm?

→ it is the maximal time to find the next matching substring→ i.e. O(|T|2 × |A|), e.g., if only the beginning and end match

→ Can we do better?9/16

Page 60: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Results for Enumerating Pattern Matches

• Existing work has shown the best possible bounds:

Theorem [Florenzano et al., 2018]We can enumerate all matches of a pattern P on a text T with:

• Preprocessing linear in T

and exponential in P

• Delay constant (independent from T)

and exponential in P

→ Problem: Only ecient for deterministic automata!

• Our contribution is:

TheoremWe can enumerate all matches of a pattern P on a text T with:

• Preprocessing in O(|T| × Poly(P))

• Delay polynomial in P and independent from T

10/16

Page 61: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Results for Enumerating Pattern Matches

• Existing work has shown the best possible bounds:

Theorem [Florenzano et al., 2018]We can enumerate all matches of a pattern P on a text T with:

• Preprocessing linear in T

and exponential in P

• Delay constant (independent from T)

and exponential in P

→ Problem: Only ecient for deterministic automata!

• Our contribution is:

TheoremWe can enumerate all matches of a pattern P on a text T with:

• Preprocessing in O(|T| × Poly(P))

• Delay polynomial in P and independent from T

10/16

Page 62: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Results for Enumerating Pattern Matches

• Existing work has shown the best possible bounds in T:

Theorem [Florenzano et al., 2018]We can enumerate all matches of a pattern P on a text T with:

• Preprocessing linear in T and exponential in P• Delay constant (independent from T) and exponential in P

→ Problem: Only ecient for deterministic automata!

• Our contribution is:

TheoremWe can enumerate all matches of a pattern P on a text T with:

• Preprocessing in O(|T| × Poly(P))

• Delay polynomial in P and independent from T

10/16

Page 63: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Results for Enumerating Pattern Matches

• Existing work has shown the best possible bounds in T:

Theorem [Florenzano et al., 2018]We can enumerate all matches of a pattern P on a text T with:

• Preprocessing linear in T and exponential in P• Delay constant (independent from T) and exponential in P

→ Problem: Only ecient for deterministic automata!

• Our contribution is:

TheoremWe can enumerate all matches of a pattern P on a text T with:

• Preprocessing in O(|T| × Poly(P))

• Delay polynomial in P and independent from T

10/16

Page 64: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Results for Enumerating Pattern Matches

• Existing work has shown the best possible bounds in T:

Theorem [Florenzano et al., 2018]We can enumerate all matches of a pattern P on a text T with:

• Preprocessing linear in T and exponential in P• Delay constant (independent from T) and exponential in P

→ Problem: Only ecient for deterministic automata!

• Our contribution is:

TheoremWe can enumerate all matches of a pattern P on a text T with:

• Preprocessing in O(|T| × Poly(P))

• Delay polynomial in P and independent from T

10/16

Page 65: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Automaton Formalism

• We use automata that read letters and capture variables

→ Example: P := •∗ α a∗ β •∗

1 2 3

α

a

β

• Semantics of the automaton A:• Reads letters from the text• Guesses variables at positions in the text→ Output: tuples 〈α : i, β : j〉 such that

A has an accepting run reading α at position i and β at j

• Assumption: There is no run for which A readsthe same capture variable twice at the same position

• Challenge: Because of nondeterminism we can havemany dierent runs of A producing the same tuple!

11/16

Page 66: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Automaton Formalism

• We use automata that read letters and capture variables→ Example: P := •∗ α a∗ β •∗

1 2 3

α

a

β

• Semantics of the automaton A:• Reads letters from the text• Guesses variables at positions in the text→ Output: tuples 〈α : i, β : j〉 such that

A has an accepting run reading α at position i and β at j

• Assumption: There is no run for which A readsthe same capture variable twice at the same position

• Challenge: Because of nondeterminism we can havemany dierent runs of A producing the same tuple!

11/16

Page 67: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Automaton Formalism

• We use automata that read letters and capture variables→ Example: P := •∗ α a∗ β •∗

1 2 3

α

a

β

• Semantics of the automaton A:• Reads letters from the text• Guesses variables at positions in the text→ Output: tuples 〈α : i, β : j〉 such that

A has an accepting run reading α at position i and β at j

• Assumption: There is no run for which A readsthe same capture variable twice at the same position

• Challenge: Because of nondeterminism we can havemany dierent runs of A producing the same tuple!

11/16

Page 68: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Automaton Formalism

• We use automata that read letters and capture variables→ Example: P := •∗ x`

α

a∗ a x

β

•∗

1 2 3

x`

a

a x

• Semantics of the automaton A:• Reads letters from the text• Guesses variables at positions in the text→ Output: tuples 〈α : i, β : j〉 such that

A has an accepting run reading α at position i and β at j

• Assumption: There is no run for which A readsthe same capture variable twice at the same position

• Challenge: Because of nondeterminism we can havemany dierent runs of A producing the same tuple!

11/16

Page 69: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Automaton Formalism

• We use automata that read letters and capture variables→ Example: P := •∗ α a∗ β •∗

1 2 3

α

a

β

• Semantics of the automaton A:• Reads letters from the text• Guesses variables at positions in the text→ Output: tuples 〈α : i, β : j〉 such that

A has an accepting run reading α at position i and β at j

• Assumption: There is no run for which A readsthe same capture variable twice at the same position

• Challenge: Because of nondeterminism we can havemany dierent runs of A producing the same tuple!

11/16

Page 70: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Automaton Formalism

• We use automata that read letters and capture variables→ Example: P := •∗ α a∗ β •∗

1 2 3

α

a

β

• Semantics of the automaton A:• Reads letters from the text• Guesses variables at positions in the text

→ Output: tuples 〈α : i, β : j〉 such thatA has an accepting run reading α at position i and β at j

• Assumption: There is no run for which A readsthe same capture variable twice at the same position

• Challenge: Because of nondeterminism we can havemany dierent runs of A producing the same tuple!

11/16

Page 71: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Automaton Formalism

• We use automata that read letters and capture variables→ Example: P := •∗ α a∗ β •∗

1 2 3

α

a

β

• Semantics of the automaton A:• Reads letters from the text• Guesses variables at positions in the text→ Output: tuples 〈α : i, β : j〉 such that

A has an accepting run reading α at position i and β at j

• Assumption: There is no run for which A readsthe same capture variable twice at the same position

• Challenge: Because of nondeterminism we can havemany dierent runs of A producing the same tuple!

11/16

Page 72: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Automaton Formalism

• We use automata that read letters and capture variables→ Example: P := •∗ α a∗ β •∗

1 2 3

α

a

β

• Semantics of the automaton A:• Reads letters from the text• Guesses variables at positions in the text→ Output: tuples 〈α : i, β : j〉 such that

A has an accepting run reading α at position i and β at j

• Assumption: There is no run for which A readsthe same capture variable twice at the same position

• Challenge: Because of nondeterminism we can havemany dierent runs of A producing the same tuple!

11/16

Page 73: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Automaton Formalism

• We use automata that read letters and capture variables→ Example: P := •∗ α a∗ β •∗

1 2 3

α

a

β

• Semantics of the automaton A:• Reads letters from the text• Guesses variables at positions in the text→ Output: tuples 〈α : i, β : j〉 such that

A has an accepting run reading α at position i and β at j

• Assumption: There is no run for which A readsthe same capture variable twice at the same position

• Challenge: Because of nondeterminism we can havemany dierent runs of A producing the same tuple!

11/16

Page 74: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof idea: Product DAG

Compute a product DAG of the text T and of the automaton A

Example: Text T := aaaba and P := •∗ α a∗ β •∗,

match 〈α : 0, β : 3〉

1

2

3

α

a

β

a a a b a

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3 3

α

β

α

β

α

β

α

β

α

β

α

β

→ Each path in the product DAG corresponds to a match

→ Challenge: Enumerate paths but avoid duplicate matchesand do not waste time to ensure constant delay

12/16

Page 75: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof idea: Product DAG

Compute a product DAG of the text T and of the automaton A

Example: Text T := aaaba and P := •∗ α a∗ β •∗,

match 〈α : 0, β : 3〉

1

2

3

α

a

β

a a a b a

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3 3

α

β

α

β

α

β

α

β

α

β

α

β

→ Each path in the product DAG corresponds to a match

→ Challenge: Enumerate paths but avoid duplicate matchesand do not waste time to ensure constant delay

12/16

Page 76: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof idea: Product DAG

Compute a product DAG of the text T and of the automaton A

Example: Text T := aaaba and P := •∗ α a∗ β •∗,

match 〈α : 0, β : 3〉

1

2

3

α

a

β

a a a b a

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3 3

α

β

α

β

α

β

α

β

α

β

α

β

→ Each path in the product DAG corresponds to a match

→ Challenge: Enumerate paths but avoid duplicate matchesand do not waste time to ensure constant delay

12/16

Page 77: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof idea: Product DAG

Compute a product DAG of the text T and of the automaton A

Example: Text T := aaaba and P := •∗ α a∗ β •∗,

match 〈α : 0, β : 3〉

1

2

3

α

a

β

a a a b a

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3 3

α

β

α

β

α

β

α

β

α

β

α

β

→ Each path in the product DAG corresponds to a match

→ Challenge: Enumerate paths but avoid duplicate matchesand do not waste time to ensure constant delay

12/16

Page 78: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof idea: Product DAG

Compute a product DAG of the text T and of the automaton A

Example: Text T := aaaba and P := •∗ α a∗ β •∗,

match 〈α : 0, β : 3〉

1

2

3

α

a

β

a a a b a

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3 3

α

β

α

β

α

β

α

β

α

β

α

β

→ Each path in the product DAG corresponds to a match

→ Challenge: Enumerate paths but avoid duplicate matchesand do not waste time to ensure constant delay

12/16

Page 79: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof idea: Product DAG

Compute a product DAG of the text T and of the automaton A

Example: Text T := aaaba and P := •∗ α a∗ β •∗,

match 〈α : 0, β : 3〉

1

2

3

α

a

β

a a a b a

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3 3

α

β

α

β

α

β

α

β

α

β

α

β

→ Each path in the product DAG corresponds to a match

→ Challenge: Enumerate paths but avoid duplicate matchesand do not waste time to ensure constant delay

12/16

Page 80: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof idea: Product DAG

Compute a product DAG of the text T and of the automaton A

Example: Text T := aaaba and P := •∗ α a∗ β •∗, match 〈α : 0, β : 3〉

1

2

3

α

a

β

a a a b a

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3 3

α

β

α

β

α

β

α

β

α

β

α

β

→ Each path in the product DAG corresponds to a match

→ Challenge: Enumerate paths but avoid duplicate matchesand do not waste time to ensure constant delay

12/16

Page 81: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof idea: Product DAG

Compute a product DAG of the text T and of the automaton A

Example: Text T := aaaba and P := •∗ α a∗ β •∗,

match 〈α : 0, β : 3〉

1

2

3

α

a

β

a a a b a

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3 3

α

β

α

β

α

β

α

β

α

β

α

β

→ Each path in the product DAG corresponds to a match

→ Challenge: Enumerate paths but avoid duplicate matchesand do not waste time to ensure constant delay 12/16

Page 82: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredients

Several ingredients to do this ecient

• Prune non-accepting paths

• Use shortcuts (pointers) to skip long paths

• Flashlight search

13/16

Page 83: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: jump pointers to save time

• Issue: When we can’t assign variables, we do not make progress

· · ·

· · ·

· · ·

· · ·α α α α α

• Idea: Directly jump to the reachable statesat the next position where we can assign a variable

• Challenge: Preprocessing in linear time in T and polynomial in A:→ Compute for each state the next position where we can reach

some state that can assign a variable→ Compute at each position i the transitive closure to all positions j

such that j is the next position of some state at i (there are ≤ |A|)

14/16

Page 84: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: jump pointers to save time

• Issue: When we can’t assign variables, we do not make progress

· · ·

· · ·

· · ·

· · ·α α α α α

• Idea: Directly jump to the reachable statesat the next position where we can assign a variable

• Challenge: Preprocessing in linear time in T and polynomial in A:→ Compute for each state the next position where we can reach

some state that can assign a variable→ Compute at each position i the transitive closure to all positions j

such that j is the next position of some state at i (there are ≤ |A|)

14/16

Page 85: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: jump pointers to save time

• Issue: When we can’t assign variables, we do not make progress

· · ·

· · ·

· · ·

· · ·α α α α α

• Idea: Directly jump to the reachable statesat the next position where we can assign a variable

• Challenge: Preprocessing in linear time in T and polynomial in A:→ Compute for each state the next position where we can reach

some state that can assign a variable→ Compute at each position i the transitive closure to all positions j

such that j is the next position of some state at i (there are ≤ |A|)

14/16

Page 86: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: jump pointers to save time

• Issue: When we can’t assign variables, we do not make progress

· · ·

· · ·

· · ·

· · ·α α α α α

• Idea: Directly jump to the reachable statesat the next position where we can assign a variable

• Challenge: Preprocessing in linear time in T and polynomial in A:→ Compute for each state the next position where we can reach

some state that can assign a variable→ Compute at each position i the transitive closure to all positions j

such that j is the next position of some state at i (there are ≤ |A|)

14/16

Page 87: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: jump pointers to save time

• Issue: When we can’t assign variables, we do not make progress

· · ·

· · ·

· · ·

· · ·α α α α α

• Idea: Directly jump to the reachable statesat the next position where we can assign a variable

• Challenge: Preprocessing in linear time in T and polynomial in A:→ Compute for each state the next position where we can reach

some state that can assign a variable→ Compute at each position i the transitive closure to all positions j

such that j is the next position of some state at i (there are ≤ |A|)

14/16

Page 88: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: jump pointers to save time

• Issue: When we can’t assign variables, we do not make progress

· · ·

· · ·

· · ·

· · ·α α α α α

• Idea: Directly jump to the reachable statesat the next position where we can assign a variable

• Challenge: Preprocessing in linear time in T and polynomial in A:→ Compute for each state the next position where we can reach

some state that can assign a variable→ Compute at each position i the transitive closure to all positions j

such that j is the next position of some state at i (there are ≤ |A|)

14/16

Page 89: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: jump pointers to save time

• Issue: When we can’t assign variables, we do not make progress

· · ·

· · ·

· · ·

· · ·α α α α α

• Idea: Directly jump to the reachable statesat the next position where we can assign a variable

• Challenge: Preprocessing in linear time in T and polynomial in A:→ Compute for each state the next position where we can reach

some state that can assign a variable→ Compute at each position i the transitive closure to all positions j

such that j is the next position of some state at i (there are ≤ |A|)

14/16

Page 90: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: jump pointers to save time

• Issue: When we can’t assign variables, we do not make progress

· · ·

· · ·

· · ·

· · ·α α α α α

• Idea: Directly jump to the reachable statesat the next position where we can assign a variable

• Challenge: Preprocessing in linear time in T and polynomial in A:

→ Compute for each state the next position where we can reachsome state that can assign a variable

→ Compute at each position i the transitive closure to all positions jsuch that j is the next position of some state at i (there are ≤ |A|)

14/16

Page 91: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: jump pointers to save time

• Issue: When we can’t assign variables, we do not make progress

· · ·

· · ·

· · ·

· · ·α α α α α

• Idea: Directly jump to the reachable statesat the next position where we can assign a variable

• Challenge: Preprocessing in linear time in T and polynomial in A:→ Compute for each state the next position where we can reach

some state that can assign a variable

→ Compute at each position i the transitive closure to all positions jsuch that j is the next position of some state at i (there are ≤ |A|)

14/16

Page 92: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: jump pointers to save time

• Issue: When we can’t assign variables, we do not make progress

· · ·

· · ·

· · ·

· · ·α α α α α

• Idea: Directly jump to the reachable statesat the next position where we can assign a variable

• Challenge: Preprocessing in linear time in T and polynomial in A:→ Compute for each state the next position where we can reach

some state that can assign a variable→ Compute at each position i the transitive closure to all positions j

such that j is the next position of some state at i (there are ≤ |A|)14/16

Page 93: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: jump pointers to save time

• Issue: When we can’t assign variables, we do not make progress

· · ·

· · ·

· · ·

· · ·α α α α α

• Idea: Directly jump to the reachable statesat the next position where we can assign a variable

• Challenge: Preprocessing in linear time in T and polynomial in A:→ Compute for each state the next position where we can reach

some state that can assign a variable→ Compute at each position i the transitive closure to all positions j

such that j is the next position of some state at i (there are ≤ |A|)14/16

Page 94: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: jump pointers to save time

• Issue: When we can’t assign variables, we do not make progress

· · ·

· · ·

· · ·

· · ·α α α α α

• Idea: Directly jump to the reachable statesat the next position where we can assign a variable

• Challenge: Preprocessing in linear time in T and polynomial in A:→ Compute for each state the next position where we can reach

some state that can assign a variable→ Compute at each position i the transitive closure to all positions j

such that j is the next position of some state at i (there are ≤ |A|)14/16

Page 95: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: jump pointers to save time

• Issue: When we can’t assign variables, we do not make progress

· · ·

· · ·

· · ·

· · ·α α α α α

• Idea: Directly jump to the reachable statesat the next position where we can assign a variable

• Challenge: Preprocessing in linear time in T and polynomial in A:→ Compute for each state the next position where we can reach

some state that can assign a variable→ Compute at each position i the transitive closure to all positions j

such that j is the next position of some state at i (there are ≤ |A|)14/16

Page 96: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: flashlight search

• Issue: Finding which variable sets we can assign at position i?

i i+ 1

α

γ

β

α

α

β

γ

α?

β?

γ?0 1

γ?0 1

0 1β?

γ?0 1

γ?0 1

0 1

0 1

• Idea: Explore a decision tree on the variables (built on the fly)

• At each decision tree node, find the reachable states whichhave all required variables (1) and no forbidden variables (0)→ Assumption: we don’t see the same variable twice on a path

15/16

Page 97: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: flashlight search

• Issue: Finding which variable sets we can assign at position i?

i i+ 1

α

γ

β

α

α

β

γ

α?

β?

γ?0 1

γ?0 1

0 1β?

γ?0 1

γ?0 1

0 1

0 1

• Idea: Explore a decision tree on the variables (built on the fly)

• At each decision tree node, find the reachable states whichhave all required variables (1) and no forbidden variables (0)→ Assumption: we don’t see the same variable twice on a path

15/16

Page 98: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: flashlight search

• Issue: Finding which variable sets we can assign at position i?

i i+ 1

α

γ

β

α

α

β

γ

α?

β?

γ?0 1

γ?0 1

0 1β?

γ?0 1

γ?0 1

0 1

0 1

• Idea: Explore a decision tree on the variables (built on the fly)

• At each decision tree node, find the reachable states whichhave all required variables (1) and no forbidden variables (0)→ Assumption: we don’t see the same variable twice on a path

15/16

Page 99: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: flashlight search

• Issue: Finding which variable sets we can assign at position i?

i i+ 1

α

γ

β

α

α

β

γ

α?

β?

γ?0 1

γ?0 1

0 1β?

γ?0 1

γ?0 1

0 1

0 1

• Idea: Explore a decision tree on the variables (built on the fly)

• At each decision tree node, find the reachable states whichhave all required variables (1) and no forbidden variables (0)

→ Assumption: we don’t see the same variable twice on a path

15/16

Page 100: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: flashlight search

• Issue: Finding which variable sets we can assign at position i?

i i+ 1

α

γ

β

α

α

β

γ

α?

β?

γ?0 1

γ?0 1

0 1β?

γ?0 1

γ?0 1

0 1

0 1

• Idea: Explore a decision tree on the variables (built on the fly)

• At each decision tree node, find the reachable states whichhave all required variables (1) and no forbidden variables (0)→ Assumption: we don’t see the same variable twice on a path

15/16

Page 101: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: flashlight search

• Issue: Finding which variable sets we can assign at position i?

i i+ 1

α

γ

β

α

α

β

γ

α?

β?

γ?0 1

γ?0 1

0 1β?

γ?0 1

γ?0 1

0 1

0 1

• Idea: Explore a decision tree on the variables (built on the fly)

• At each decision tree node, find the reachable states whichhave all required variables (1) and no forbidden variables (0)→ Assumption: we don’t see the same variable twice on a path

15/16

Page 102: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Proof ingredient: flashlight search

• Issue: Finding which variable sets we can assign at position i?

i i+ 1

α

γ

β

α

α

β

γ

α?

β?

γ?1

γ?0 1

0 1β?

γ?0 1

1

0 1

• Idea: Explore a decision tree on the variables (built on the fly)

• At each decision tree node, find the reachable states whichhave all required variables (1) and no forbidden variables (0)→ Assumption: we don’t see the same variable twice on a path

15/16

Page 103: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Summary and Future Work

Page 104: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Main Result and Future Work

TheoremGiven a sequential document spanner P and text T, we canenumerate with:

Preprocessing O(|P|ω+1 × |T|)Delay O(|V3| × |P|2)

V : Set of Variablesω : Exponent for Boolean matrix multiplication

Extensions and future work:

• Extending the results from text to trees

PODS 2019

• Supporting updates on the input data• Enumerating results in a relevant order?• Testing how well our methods perform in practice

Rémi Dupré

Thanks for your attention!

16/16

Page 105: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Main Result and Future Work

TheoremGiven a sequential document spanner P and text T, we canenumerate with:

Preprocessing O(|P|ω+1 × |T|)Delay O(|V3| × |P|2)

Extensions and future work:

• Extending the results from text to trees

PODS 2019

• Supporting updates on the input data• Enumerating results in a relevant order?• Testing how well our methods perform in practice

Rémi Dupré

Thanks for your attention!

16/16

Page 106: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Main Result and Future Work

TheoremGiven a sequential document spanner P and text T, we canenumerate with:

Preprocessing O(|P|ω+1 × |T|)Delay O(|V3| × |P|2)

Extensions and future work:

• Extending the results from text to trees

PODS 2019• Supporting updates on the input data

• Enumerating results in a relevant order?• Testing how well our methods perform in practice

Rémi Dupré

Thanks for your attention!

16/16

Page 107: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Main Result and Future Work

TheoremGiven a sequential document spanner P and text T, we canenumerate with:

Preprocessing O(|P|ω+1 × |T|)Delay O(|V3| × |P|2)

Extensions and future work:

• Extending the results from text to trees

PODS 2019

• Supporting updates on the input data

• Enumerating results in a relevant order?• Testing how well our methods perform in practice

Rémi Dupré

Thanks for your attention!

16/16

Page 108: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Main Result and Future Work

TheoremGiven a sequential document spanner P and text T, we canenumerate with:

Preprocessing O(|P|ω+1 × |T|)Delay O(|V3| × |P|2)

Extensions and future work:

• Extending the results from text to trees PODS 2019• Supporting updates on the input data

• Enumerating results in a relevant order?• Testing how well our methods perform in practice

Rémi Dupré

Thanks for your attention!

16/16

Page 109: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Main Result and Future Work

TheoremGiven a sequential document spanner P and text T, we canenumerate with:

Preprocessing O(|P|ω+1 × |T|)Delay O(|V3| × |P|2)

Extensions and future work:

• Extending the results from text to trees PODS 2019• Supporting updates on the input data

• Enumerating results in a relevant order?

• Testing how well our methods perform in practice

Rémi Dupré

Thanks for your attention!

16/16

Page 110: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Main Result and Future Work

TheoremGiven a sequential document spanner P and text T, we canenumerate with:

Preprocessing O(|P|ω+1 × |T|)Delay O(|V3| × |P|2)

Extensions and future work:

• Extending the results from text to trees PODS 2019• Supporting updates on the input data

• Enumerating results in a relevant order?• Testing how well our methods perform in practice

Rémi Dupré

Thanks for your attention!

16/16

Page 111: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Main Result and Future Work

TheoremGiven a sequential document spanner P and text T, we canenumerate with:

Preprocessing O(|P|ω+1 × |T|)Delay O(|V3| × |P|2)

Extensions and future work:

• Extending the results from text to trees PODS 2019• Supporting updates on the input data

• Enumerating results in a relevant order?• Testing how well our methods perform in practice Rémi Dupré

Thanks for your attention!

16/16

Page 112: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

Main Result and Future Work

TheoremGiven a sequential document spanner P and text T, we canenumerate with:

Preprocessing O(|P|ω+1 × |T|)Delay O(|V3| × |P|2)

Extensions and future work:

• Extending the results from text to trees PODS 2019• Supporting updates on the input data

• Enumerating results in a relevant order?• Testing how well our methods perform in practice Rémi Dupré

Thanks for your attention!16/16

Page 113: Constant-Delay Enumeration for Nondeterministic Document ... · Constant-Delay Enumeration for Nondeterministic Document Spanners Antoine Amarilli1, Pierre Bourhis2, Stefan Mengel3,

References i

Florenzano, F., Riveros, C., Ugarte, M., Vansummeren, S., and Vrgoc,D. (2018).Constant delay algorithms for regular document spanners.In PODS.