[ 1 ] may 11, 2010 c. brabrand & j. g. thomsen regular expressions coplas diku, denmark pattern...
TRANSCRIPT
[ 1 ] May 11, 2010C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark
Pattern Matching on Stringsusing Regular Expressions
Claus Brabrand[ [email protected] ]
IT University of Copenhagen
Jakob G. Thomsen[ [email protected] ]
Aarhus University
Num = 0 | [1-9][0-9]* Email = [a-z]+ "@" [a-z]+ ("." [a-z]+ )*
[ 3 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Outline
Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)
Regular Expressions:The Recording Construction
Ambiguity:Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
[ 4 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Introduction & Motivation
Pattern matching an indispensable problemMany applications need to "parse" dynamic input
1) URLs:
2) Log Files:
3) DBLP:
http://first.dk/index.php?id=141&view=details
13/02/2010 66.249.65.107 get /support.html20/02/2010 42.116.32.64 post /search.html
<article> <title>Three Models for the...</title> <author>Noam Chomsky</author> <year>1956</year></article>
protocol host path query-string
(list of key-value pairs)
[ 5 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Outline
Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)
Regular Expressions:The Recording Construction
Ambiguity:Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
[ 6 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Language classes (+formalisms):
Type-3 regular expressions "enough" for:URLs, log files, DBLP, ...
"Trade" (excess) expressivity for:declarativity, simplicity, and static safety !
The Chomsky Hierarchy (1956)
[ 7 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Type-0: java.net.URL
Turing-Complete programming (e.g., Java)[ "unrestricted grammars" (e.g., rewriting systems) ]
Cyclomatic complexity (of official "java.net.URL"):
88 bug reports on Sun's Bug Repository !Bug reports span more than a decade !
[ 8 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Type-1: Context-Sensitivity
Not widely used (or studied?) formalism
Presumeably because:Restricts expressivity w/o offering extra safety?
- ? -
[ 9 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Type-2: Context-Free Grammars
Conceptually harder than regexpsEssentially (Type-3) Regular Expressions + recursion
The ultimate end-all scientific argument:We d:
regexps 12 times more popular !
(conjecture!)
[ 10 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Type-?: Regexp Capture Groups
Capturing groups (Perl, PHP, Java regex, ...):Syntax: (i.e., in parentheses)
Back-references:Syntax: (i.e., "index of" capturing group)
Beyond regularity !: is non-regular
In fact, not even context-free !!!: is non-context-free
(R)
\7
(a*)b\1
(.*).\1
{ an b an | n 0 }
{ | , * }
[ 11 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Type-?: Regexp Capture Groups
Interpretation with back-tracking:NP-complete (exponential worst-case): :-(
regexp " a?nan " vs. string " an "
1 minute0.02 msecs
3.000.000:1 on strings of length 29 !!!
[ 12 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Type-3: Regular Expressions
Closure properties:Union
Concatenation
Iteration
Restriction
Intersection
Complement
...
Decidability properties:...
...
Containment: L(R) L(R')
Ambiguity
...
...
Declarative ! Safe ! Simple !
[ 13 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Outline
Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)
Regular Expressions:The Recording Construction
Ambiguity:Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
[ 14 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Regular Expressions
Syntax:
Semantics:
where:
L1 L2 is concatenation (i.e., { 1 2 | 1L1, 2L2 })
L* = i0 Li where L0 = { } and Li = L Li-1
[ 15 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Common Extensions (sugar)
Any character (aka, dot):"." as c1|c2|...|cn, ci
Character ranges:"[a-z]" as a|b|...|z
One-or-more regexps:"R+" as RR*
Optional regexp:"R?" as |R
Various repetitions; e.g.:"R{2,3}" as RRR?
[ 16 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Outline
Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)
Regular Expressions:The Recording Construction
Ambiguity:Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
[ 17 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Recording
Syntax:"x " is a recording identifier
(it "remembers" the substring it matches)
Semantics:
Example (simplified emails):
Matching against string:
yields:
[a-z]+ "@" [a-z]+ ("." [a-z]+)*
user = "obama" domain = "whitehouse.gov"&
NB: cannot use DFAs / NFAs !- only recognition (yes / no)- not how (i.e., "the structure")
<user = > <domain = >
[ 18 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Recording (structured)
Another example (with nested recordings):
Matching against string: yields:
<date = <day = [0-9]{2} > "/" <month = [0-9]{2} > "/" <year = [0-9]{4} >>
"26/06/1992"
date.day = 26
date.month = 06
date.year = 1992
date = 26/06/1992
[ 19 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Recording (structured, lists)
Yet another example (yielding lists):
Matching against string:
yields a list structure:
<name = [a-z]+ > " & " <name = [a-z]+ >
"obama & bush"
name = [obama,bush]
( <name = [a-z]+ > "\n" )*
<name = [a-z]+ > (" & " <name = [a-z]+ > )*
[ 20 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Outline
Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)
Regular Expressions:The Recording Construction
Ambiguity:Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
[ 21 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Abstract Syntax Trees (ASTs)
[ 22 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Ambiguity
Definition:R ambiguous iff
T,T'ASTR: T T' ||T|| = ||T'||
where ||||: AST * (the flattening) is:
T
R
T'
R'
=
[ 23 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Characterization of Ambiguity
Theorem:R unambiguous iff
NB: sound & complete !
R* = | RR*
[ 24 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Examples
Ambiguous:
a|aL(a) L(a) = { a } Ø
a*a*L(a*) L(a*) = { an } Ø
Unambiguous:
a|aaL(a) L(aa) = Ø
a*ba*L(a*) L(ba*) = Ø
[ 25 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Ambiguity Examples
a?b+|(ab)*
(a|ab)(ba|a)
(aa|aaa)*
*** ambiguous concatenation: (a|ab) <--> (ba|a) shortest ambiguous string: "aba"
*** ambiguous choice: a?b+ <-|-> (ab)* shortest ambiguous string: "ab"
*** ambiguous star: (aa|aaa)* shortest ambiguous string: "aaaaa"
[ 27 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Outline
Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)
Regular Expressions:The Recording Construction
Ambiguity:Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
[ 28 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Disambiguation
1) Manual rewriting:Always possible :-)
Tedious :-(
Error-prone :-(
Not structure-preserving :-(
3) Disambiguators:From characterization:
concat: 'L', 'R'
choice: '|L', '|R'
star: '*L', '*R'
(partial-order on ASTs)
2) Restriction:R1 - R2
And then encode...:
RC as: * - RR1 & R2 as: (R1
C|R2C)C
4) Default disamb:concat, choice, and star are all left-biassed (by default) !
(Our tool does this)
[ 30 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Outline
Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)
Regular Expressions:The Recording Construction
Ambiguity:Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
[ 31 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Type Inference
Type Inference: R : (L,S)
[ 32 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Examples (Type Inference)
Regexp:
Usage:
Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")"
class Person { // auto-generated String name; int age; static Person match(String s) { ... } public String toString() { ... }}
String s = "obama (48)";
Person p = Person.match(s);print(p.name + " is " + p.age + "y old");
compile(our tool)
[ 33 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Examples (Type Inference)
Usage:
People = ( $Person "\n" )*
class People { // auto-generated String[] name; int[] age; static Person match(String s) { ... } public String toString() { ... }}
compile(our tool)
String s = "obama (48) \n bush (63) \n ";
People p = People.match(s);println("Second name is " + p[1].name);
Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")"
[ 34 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Examples (Type Inference)
Usage:
People = ( <person = $Person > "\n" )* ;
class People { // auto-generated Person[] person; class Person { // nested class String name; int age; }... }
compile(our tool)
String s = "obama (48) \n bush (63) \n ";
People people = People.match(s);for (p : people.person) println(p.name);
Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")"
[ 35 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Outline
Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)
Regular Expressions:The Recording Construction
Ambiguity:Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
[ 36 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
URLs
URLs:
Regexp:
Query string further structured (list of key-value pairs):
"http://www.google.com/search?q=record&hl=en"protocol host path query-string
(list of key-value pairs)
Host = <host = [a-z]+ ("." [a-z]+ )* > ;Path = <path = [a-z/.]* > ;Query = <query = [a-z&=]* > ;URL = "http://" $Host "/" $Path "?" $Query ;
KeyVal = <key = [a-z]* > "=" <val = [a-z]* > ;Query = $KeyVal ("&" $KeyVal)* ;
(list of key-value pairs)
[ 37 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
URLs (Usage Example)
Regexp:
Usage (example):
Host = <host = [a-z]+ ("." [a-z]+ )* > ;Path = <path = [a-z/.]* > ;KeyVal = <key = [a-z]* > "=" <val = [a-z]* > ;Query = $KeyVal ("&" $KeyVal)* ;URL = "http://" $Host "/" $Path "?" $Query ;
String s = "http://www.google.com/search?q=record";URL url = URL.match(s);print("Host is: " + url.host);if (url.key.length>0) print("1st key: " + url.key[0]);for (String val : url.val) println("value = " + val);
[ 38 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Log Files
13/02/2010 66.249.65.107 /support.html20/02/2010 42.116.32.64 /search.html...
Date = <date = <day = $Day > "/" <month = $Month > "/" <year = [0-9]{4} > > ;IP = <ip = [0-9]{1,3} ("." [0-9]{1,3} ){3} > ;Entry = <entry = $Date " " $IP " " $Path "\n" > ;Log = $Entry * ;
Log log = Log.match(log_file);for (Entry e : log.entry) if (e.date.month == 02 && e.date.day == 29) print("Access on LEAP YEAR from IP# " + e.ip);
Format
Regexp
Usage
[ 39 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Log Files (cont'd, ambiguity)
Assume we forgot "/" (between day & month):
Ambiguity:
i.e. "1/01" (January 1) vs. "10/1" (January 10) :-)
*** ambiguous concatenation: <day> <--> <month> shortest ambiguous string: "101"
Day = 0?[1-9] | [1-2][0-9] | 30 | 31 ;Month = 0?[1-9] | 10 | 11 | 12 ;
Date = <date = <day = $Day > // no slash ! <month = $Month > "/" <year = [0-9]{4} > > ;
Regexp
Error
[ 40 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
DBLP (Format)
DBLP (XML) Format:<article> <author>Noam Chomsky</author> <title>Three Models for the Description of Language</title> <year>1956</year> <journal>IRE Transactions on Information Theory</journal></article><article> <author>Claus Brabrand</author> <author>Jakob G Thomsen</author> <title>Typed and Unambiguous Pattern Matching on Strings using Regular Expressions</title> <year>2010</year> <note>Submitted</note></article>...
[ 41 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
DBLP (Regexp)
DBLP Regexp:
Ambiguity !:
EITHER 2 publications (.* = "") OR 1 publication (.* = gray part) !!!
Author = "<author>" <author = [a-z]* > "</author>" ;Title = "<title>" <title = [a-z]* > "</title>" ;Article = "<article>" $Author* $Title .* "</article>" ;DBLP = <pub = $Article > * ;
*** ambiguous star: <pub>* shortest ambiguous string: "<article><title></title></article> <article><title></title></article>"
[ 42 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
DBLP (Disambiguated)
DBLP Regexp:
Disambiguated (using "(R1-R2)"):
Unambiguous! :-)
Article = "<article>" $Author* $Title (.* - (.* "</article>" .*)) "</article>" ;
Author = "<author>" <author = [a-z]* > "</author>" ;Title = "<title>" <title = [a-z]* > "</title>" ;Article = "<article>" $Author* $Title .* "</article>" ;DBLP = <pub = $Article > * ;
[ 43 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
DBLP (Usage Example)
DBLP Regexp:
Usage (example):DBLP dblp = DBLP.match(readXMLfile("DBLP.xml"));for (Article a: dblp.article) print("Title: " + a.title);
Author = "<author>" <author = [a-z]* > "</author>" ;Title = "<title>" <title = [a-z]* > "</title>" ;Article = "<article>" $Author* $Title .* "</article>" ;DBLP = <article = $Article > * ;
[ 44 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Outline
Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)
Regular Expressions:The Recording Construction
Ambiguity:Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
[ 45 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Evaluation
Evaluation summary:
Also, (Type-3) regexps expressive "enough"for: URLs, Log files, DBLP, ...
[ MatMult ][ NP-Complete ][ Frisch&Cardelli'04 ]
[ 46 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Type-3 vs. Type-0 (URLs)
Regexps vs. Java:
Regexps are 8 times more concise !
[ 47 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
java.util.regex vs. Our approach
Efficiency(on DBLP):
java.util.regex:Exponential O(2||) 2,500 chars in 2 mins !
In contrast; ours:Linear (on DBLP) 1,200,000 chars in 6 secs !
2 mins
10 msecs
[ 48 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Related Work
Recording (with lists in general):"x as R" in XDuce; "x::R" in CDuce; and "x@R" in Scala and HaRP
Ambiguity:[Book+Even+Greibach+Ott'71] and [Hosoya'03] for XDuce but indirectly via NFAa, not directly (syntax-directed)
Disambiguation:[Vansummeren'06] but with global, not local disambiguation
Type inference:Exact type inference in XDuce & CDuce(soundness+completeness proof in [Vansummeren'06])but not for stand-alone and non-intrusive usage (Java)
[ 49 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010
Conclusion
For string pattern matching, it is possible to:
In conclusion:
i.e., ambiguity checking and type inference !+ stand-alone & non-intrusive language integration (Java) !
We conclude that if regular expressions are sufficiently expressive, they provide a simple, declarative, and safe means for pattern matching on strings, capable of extracting highly structural information in a statically type-safe and unambiguous manner.
"trade (excess) expressivity for safety+simplicity"
[ 50 ] May 11, 2010C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark
</Talk>
Questions ? Complaints ?
[ http://www.cs.au.dk/~gedefar/reg-exp-rec/ ]