understanding advanced regular expressions

117
Deeper down the rabbit hole Advanced Regular Expressions Jakob Westhoff <[email protected]> @jakobwesthoff PHPBarcamp.at May 3, 2010 http://westhoffswelt.de jakob@westhoffswelt.de slide: 1 / 26

Upload: westhoff

Post on 09-May-2015

3.191 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Understanding advanced regular expressions

Deeper down the rabbit holeAdvanced Regular Expressions

Jakob Westhoff <[email protected]>@jakobwesthoff

PHPBarcamp.atMay 3, 2010

http://westhoffswelt.de [email protected] slide: 1 / 26

Page 2: Understanding advanced regular expressions

About Me

Jakob Westhoff

PHP developer for several years

Computer science student at the TU Dortmund

Co-Founder of the PHP Usergroup Dortmund

Active in different Open Source projects

http://westhoffswelt.de [email protected] slide: 2 / 26

Page 3: Understanding advanced regular expressions

Asking the audience

Who does already work with regular expressions?

Regular expressions like this:

/ [ a−zA−Z]+/

Or like this:

(?P<image >(?: none | i n h e r i t ) | ( ? : u r l \(\ s ∗ ( ? : ’ | ” )? ( ? : \ \ [ ’ ” \ \ ) ] | \ \ [ ˆ \ ’ ” \ \ ) ] | [ ˆ ’ ” \ \ ) ] ) ∗ ( ? : ’ | ” ) ?\ s ∗\)) )

http://westhoffswelt.de [email protected] slide: 3 / 26

Page 4: Understanding advanced regular expressions

Asking the audience

Who does already work with regular expressions?

Regular expressions like this:

/ [ a−zA−Z]+/

Or like this:

(?P<image >(?: none | i n h e r i t ) | ( ? : u r l \(\ s ∗ ( ? : ’ | ” )? ( ? : \ \ [ ’ ” \ \ ) ] | \ \ [ ˆ \ ’ ” \ \ ) ] | [ ˆ ’ ” \ \ ) ] ) ∗ ( ? : ’ | ” ) ?\ s ∗\)) )

http://westhoffswelt.de [email protected] slide: 3 / 26

Page 5: Understanding advanced regular expressions

Asking the audience

Who does already work with regular expressions?

Regular expressions like this:

/ [ a−zA−Z]+/

Or like this:

(?P<image >(?: none | i n h e r i t ) | ( ? : u r l \(\ s ∗ ( ? : ’ | ” )? ( ? : \ \ [ ’ ” \ \ ) ] | \ \ [ ˆ \ ’ ” \ \ ) ] | [ ˆ ’ ” \ \ ) ] ) ∗ ( ? : ’ | ” ) ?\ s ∗\)) )

http://westhoffswelt.de [email protected] slide: 3 / 26

Page 6: Understanding advanced regular expressions

Goals of this session

Learn advanced techniques to use in (PCRE) regularexpressions

AssertionsOnce only subpatternsConditional subpatternsPattern recursion. . .

Learn howto to handle Unicode in your regular expressions

http://westhoffswelt.de [email protected] slide: 4 / 26

Page 7: Understanding advanced regular expressions

Goals of this session

Learn advanced techniques to use in (PCRE) regularexpressions

AssertionsOnce only subpatternsConditional subpatternsPattern recursion. . .

Learn howto to handle Unicode in your regular expressions

http://westhoffswelt.de [email protected] slide: 4 / 26

Page 8: Understanding advanced regular expressions

Goals of this session

Learn advanced techniques to use in (PCRE) regularexpressions

AssertionsOnce only subpatternsConditional subpatternsPattern recursion. . .

Learn howto to handle Unicode in your regular expressions

http://westhoffswelt.de [email protected] slide: 4 / 26

Page 9: Understanding advanced regular expressions

What Regular Expressions are. . .

In theoretical computer science:

Express regular languagesLanguages which can be described by deterministic finite stateautomataType 3 grammars in the Chomsky hierarchy

http://westhoffswelt.de [email protected] slide: 5 / 26

Page 10: Understanding advanced regular expressions

What Regular Expressions are. . .

In theoretical computer science:

Express regular languagesLanguages which can be described by deterministic finite stateautomataType 3 grammars in the Chomsky hierarchy

http://westhoffswelt.de [email protected] slide: 5 / 26

Page 11: Understanding advanced regular expressions

What Regular Expressions are. . .

In theoretical computer science:

Express regular languagesLanguages which can be described by deterministic finite stateautomataType 3 grammars in the Chomsky hierarchy

http://westhoffswelt.de [email protected] slide: 5 / 26

Page 12: Understanding advanced regular expressions

What Regular Expressions are. . .

In practical day to day usage:

“[. . . ]regular expressions provide concise and flexible means foridentifying strings of text of interest, such as particular characters,words, or patterns of characters.”

– Wikipedia [1]

http://westhoffswelt.de [email protected] slide: 6 / 26

Page 13: Understanding advanced regular expressions

What Regular Expressions are. . .

In practical day to day usage:

“[. . . ]regular expressions provide concise and flexible means foridentifying strings of text of interest, such as particular characters,words, or patterns of characters.”

– Wikipedia [1]

http://westhoffswelt.de [email protected] slide: 6 / 26

Page 14: Understanding advanced regular expressions

What Regular Expressions are. . .

In practical day to day usage:

“[. . . ]regular expressions provide concise and flexible means foridentifying strings of text of interest, such as particular characters,words, or patterns of characters.”

– Wikipedia [1]

http://westhoffswelt.de [email protected] slide: 6 / 26

Page 15: Understanding advanced regular expressions

What Regular Expressions are. . .

In practical day to day usage:

“[. . . ]regular expressions provide concise and flexible means foridentifying strings of text of interest, such as particular characters,words, or patterns of characters.”

– Wikipedia [1]

http://westhoffswelt.de [email protected] slide: 6 / 26

Page 16: Understanding advanced regular expressions

What Regular Expressions are. . .

In practical day to day usage:

“[. . . ]regular expressions provide concise and flexible means foridentifying strings of text of interest, such as particular characters,words, or patterns of characters.”

– Wikipedia [1]

http://westhoffswelt.de [email protected] slide: 6 / 26

Page 17: Understanding advanced regular expressions

Building Blocks of a Regular Expression

Basic structure of every regular expression

/[a-z]+/im

Delimiter

Equal characters of arbitrary choice (must be escaped inexpression)May be ( and ) in PCRE

Expression

Modifier

A sequence of characters providing processing instructions

http://westhoffswelt.de [email protected] slide: 7 / 26

Page 18: Understanding advanced regular expressions

Building Blocks of a Regular Expression

Basic structure of every regular expression

/[a-z]+/im

Delimiter

Equal characters of arbitrary choice (must be escaped inexpression)May be ( and ) in PCRE

Expression

Modifier

A sequence of characters providing processing instructions

http://westhoffswelt.de [email protected] slide: 7 / 26

Page 19: Understanding advanced regular expressions

Building Blocks of a Regular Expression

Basic structure of every regular expression

/[a-z]+/im

Delimiter

Equal characters of arbitrary choice (must be escaped inexpression)May be ( and ) in PCRE

Expression

Modifier

A sequence of characters providing processing instructions

http://westhoffswelt.de [email protected] slide: 7 / 26

Page 20: Understanding advanced regular expressions

Building Blocks of a Regular Expression

Basic structure of every regular expression

/[a-z]+/im

Delimiter

Equal characters of arbitrary choice (must be escaped inexpression)May be ( and ) in PCRE

Expression

Modifier

A sequence of characters providing processing instructions

http://westhoffswelt.de [email protected] slide: 7 / 26

Page 21: Understanding advanced regular expressions

Building Blocks of a Regular Expression

Basic structure of every regular expression

/[a-z]+/im

Delimiter

Equal characters of arbitrary choice (must be escaped inexpression)May be ( and ) in PCRE

Expression

Modifier

A sequence of characters providing processing instructions

http://westhoffswelt.de [email protected] slide: 7 / 26

Page 22: Understanding advanced regular expressions

Building Blocks of a Regular Expression

Basic structure of every regular expression

/[a-z]+/im

Delimiter

Equal characters of arbitrary choice (must be escaped inexpression)May be ( and ) in PCRE

Expression

Modifier

A sequence of characters providing processing instructions

http://westhoffswelt.de [email protected] slide: 7 / 26

Page 23: Understanding advanced regular expressions

Building Blocks of a Regular Expression

Basic structure of every regular expression

/[a-z]+/im

Delimiter

Equal characters of arbitrary choice (must be escaped inexpression)May be ( and ) in PCRE

Expression

Modifier

A sequence of characters providing processing instructions

http://westhoffswelt.de [email protected] slide: 7 / 26

Page 24: Understanding advanced regular expressions

Getting everybody up to speed

., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions

^, $ - Start and end of subject (or line in multiline mode)

foo|bar - Logical Or

(foo)(bar) - Subpattern grouping

/(foo|bar)baz(\1)/ - Backreferences

[a-z], [^a-z] - Character classes

http://westhoffswelt.de [email protected] slide: 8 / 26

Page 25: Understanding advanced regular expressions

Getting everybody up to speed

., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions

^, $ - Start and end of subject (or line in multiline mode)

foo|bar - Logical Or

(foo)(bar) - Subpattern grouping

/(foo|bar)baz(\1)/ - Backreferences

[a-z], [^a-z] - Character classes

http://westhoffswelt.de [email protected] slide: 8 / 26

Page 26: Understanding advanced regular expressions

Getting everybody up to speed

., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions

^, $ - Start and end of subject (or line in multiline mode)

foo|bar - Logical Or

(foo)(bar) - Subpattern grouping

/(foo|bar)baz(\1)/ - Backreferences

[a-z], [^a-z] - Character classes

http://westhoffswelt.de [email protected] slide: 8 / 26

Page 27: Understanding advanced regular expressions

Getting everybody up to speed

., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions

^, $ - Start and end of subject (or line in multiline mode)

foo|bar - Logical Or

(foo)(bar) - Subpattern grouping

/(foo|bar)baz(\1)/ - Backreferences

[a-z], [^a-z] - Character classes

http://westhoffswelt.de [email protected] slide: 8 / 26

Page 28: Understanding advanced regular expressions

Getting everybody up to speed

., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions

^, $ - Start and end of subject (or line in multiline mode)

foo|bar - Logical Or

(foo)(bar) - Subpattern grouping

/(foo|bar)baz(\1)/ - Backreferences

[a-z], [^a-z] - Character classes

http://westhoffswelt.de [email protected] slide: 8 / 26

Page 29: Understanding advanced regular expressions

Getting everybody up to speed

., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions

^, $ - Start and end of subject (or line in multiline mode)

foo|bar - Logical Or

(foo)(bar) - Subpattern grouping

/(foo|bar)baz(\1)/ - Backreferences

[a-z], [^a-z] - Character classes

http://westhoffswelt.de [email protected] slide: 8 / 26

Page 30: Understanding advanced regular expressions

Getting everybody up to speed

., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions

^, $ - Start and end of subject (or line in multiline mode)

foo|bar - Logical Or

(foo)(bar) - Subpattern grouping

/(foo|bar)baz(\1)/ - Backreferences

[a-z], [^a-z] - Character classes

http://westhoffswelt.de [email protected] slide: 8 / 26

Page 31: Understanding advanced regular expressions

Getting everybody up to speed

., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions

^, $ - Start and end of subject (or line in multiline mode)

foo|bar - Logical Or

(foo)(bar) - Subpattern grouping

/(foo|bar)baz(\1)/ - Backreferences

[a-z], [^a-z] - Character classes

http://westhoffswelt.de [email protected] slide: 8 / 26

Page 32: Understanding advanced regular expressions

Getting everybody up to speed

., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions

^, $ - Start and end of subject (or line in multiline mode)

foo|bar - Logical Or

(foo)(bar) - Subpattern grouping

/(foo|bar)baz(\1)/ - Backreferences

[a-z], [^a-z] - Character classes

http://westhoffswelt.de [email protected] slide: 8 / 26

Page 33: Understanding advanced regular expressions

Getting everybody up to speed

., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions

^, $ - Start and end of subject (or line in multiline mode)

foo|bar - Logical Or

(foo)(bar) - Subpattern grouping

/(foo|bar)baz(\1)/ - Backreferences

[a-z], [^a-z] - Character classes

http://westhoffswelt.de [email protected] slide: 8 / 26

Page 34: Understanding advanced regular expressions

Getting everybody up to speed

., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions

^, $ - Start and end of subject (or line in multiline mode)

foo|bar - Logical Or

(foo)(bar) - Subpattern grouping

/(foo|bar)baz(\1)/ - Backreferences

[a-z], [^a-z] - Character classes

http://westhoffswelt.de [email protected] slide: 8 / 26

Page 35: Understanding advanced regular expressions

Grouping Without Subpattern Creation

Grouping might be needed without creating a subpattern

/(?:foobar)*/

http://westhoffswelt.de [email protected] slide: 9 / 26

Page 36: Understanding advanced regular expressions

Grouping Without Subpattern Creation

Grouping might be needed without creating a subpattern

/(?:foobar)*/

http://westhoffswelt.de [email protected] slide: 9 / 26

Page 37: Understanding advanced regular expressions

Subpattern identification

Subpatterns are numbered by opening paranthesis

/(foo(bar)(baz))/1 foobarbaz2 bar3 baz

Matches available from within PHP

$matches = ar ray (0 => ” fooba rbaz ” ,1 => ” fooba rbaz ” ,2 => ” bar ” ,3 => ”baz” ,

)

http://westhoffswelt.de [email protected] slide: 10 / 26

Page 38: Understanding advanced regular expressions

Subpattern identification

Subpatterns are numbered by opening paranthesis

/(foo(bar)(baz))/1 foobarbaz2 bar3 baz

Matches available from within PHP

$matches = ar ray (0 => ” fooba rbaz ” ,1 => ” fooba rbaz ” ,2 => ” bar ” ,3 => ”baz” ,

)

http://westhoffswelt.de [email protected] slide: 10 / 26

Page 39: Understanding advanced regular expressions

Subpattern identification

Subpatterns are numbered by opening paranthesis

/(foo(bar)(baz))/1 foobarbaz2 bar3 baz

Matches available from within PHP

$matches = ar ray (0 => ” fooba rbaz ” ,1 => ” fooba rbaz ” ,2 => ” bar ” ,3 => ”baz” ,

)

http://westhoffswelt.de [email protected] slide: 10 / 26

Page 40: Understanding advanced regular expressions

Subpattern identification

Subpatterns are numbered by opening paranthesis

/(foo(bar)(baz))/1 foobarbaz2 bar3 baz

Matches available from within PHP

$matches = ar ray (0 => ” fooba rbaz ” ,1 => ” fooba rbaz ” ,2 => ” bar ” ,3 => ”baz” ,

)

http://westhoffswelt.de [email protected] slide: 10 / 26

Page 41: Understanding advanced regular expressions

Subpattern identification

Subpatterns are numbered by opening paranthesis

/(foo(bar)(baz))/1 foobarbaz2 bar3 baz

Matches available from within PHP

$matches = ar ray (0 => ” fooba rbaz ” ,1 => ” fooba rbaz ” ,2 => ” bar ” ,3 => ”baz” ,

)

http://westhoffswelt.de [email protected] slide: 10 / 26

Page 42: Understanding advanced regular expressions

Subpattern identification

Subpatterns are numbered by opening paranthesis

/(foo(bar)(baz))/1 foobarbaz2 bar3 baz

Matches available from within PHP

$matches = ar ray (0 => ” fooba rbaz ” ,1 => ” fooba rbaz ” ,2 => ” bar ” ,3 => ”baz” ,

)

http://westhoffswelt.de [email protected] slide: 10 / 26

Page 43: Understanding advanced regular expressions

Subpattern Naming

PCRE allows custom naming

/(?P<firstname>[A-Za-z]+) (?P<lastname>[A-Za-z]+)/

Result with input Jakob Westhoff

ar ray (0 => ’ Jakob Westhof f ’ ,’ f i r s t n ame ’ => ’ Jakob ’ ,1 => ’ Jakob ’ ,’ l a s tname ’ => ’ Westhof f ’ ,2 => ’ Westhof f ’ ,

)

http://westhoffswelt.de [email protected] slide: 11 / 26

Page 44: Understanding advanced regular expressions

Subpattern Naming

PCRE allows custom naming

/(?P<firstname>[A-Za-z]+) (?P<lastname>[A-Za-z]+)/

Result with input Jakob Westhoff

ar ray (0 => ’ Jakob Westhof f ’ ,’ f i r s t n ame ’ => ’ Jakob ’ ,1 => ’ Jakob ’ ,’ l a s tname ’ => ’ Westhof f ’ ,2 => ’ Westhof f ’ ,

)

http://westhoffswelt.de [email protected] slide: 11 / 26

Page 45: Understanding advanced regular expressions

Assertions

Formulate assertions on the matched string withoutconsuming them

Example

/foo(?=foo)/

Input

foofoofoo

Match

foofoofoo

http://westhoffswelt.de [email protected] slide: 12 / 26

Page 46: Understanding advanced regular expressions

Assertions

Formulate assertions on the matched string withoutconsuming them

Example

/foo(?=foo)/

Input

foofoofoo

Match

foofoofoo

http://westhoffswelt.de [email protected] slide: 12 / 26

Page 47: Understanding advanced regular expressions

Assertions

Formulate assertions on the matched string withoutconsuming them

Example

/foo(?=foo)/

Input

foofoofoo

Match

foofoofoo

http://westhoffswelt.de [email protected] slide: 12 / 26

Page 48: Understanding advanced regular expressions

Assertions

Formulate assertions on the matched string withoutconsuming them

Example

/foo(?=foo)/

Input

foofoofoo

Match

foofoofoo

http://westhoffswelt.de [email protected] slide: 12 / 26

Page 49: Understanding advanced regular expressions

Assertions

Formulate assertions on the matched string withoutconsuming them

Example

/foo(?=foo)/

Input

foofoofoo

Match

foofoofoo

http://westhoffswelt.de [email protected] slide: 12 / 26

Page 50: Understanding advanced regular expressions

Assertions

Formulate assertions on the matched string withoutconsuming them

Example

/foo(?=foo)/

Input

foofoofoo

Match

foofoofoo

http://westhoffswelt.de [email protected] slide: 12 / 26

Page 51: Understanding advanced regular expressions

Assertions

Formulate assertions on the matched string withoutconsuming them

Example

/foo(?=foo)/

Input

foofoofoo

Match

foofoofoo

http://westhoffswelt.de [email protected] slide: 12 / 26

Page 52: Understanding advanced regular expressions

Assertions

Formulate assertions on the matched string withoutconsuming them

Example

/foo(?=foo)/

Input

foofoofoo

Match

foofoofoo

http://westhoffswelt.de [email protected] slide: 12 / 26

Page 53: Understanding advanced regular expressions

Assertions

Formulate assertions on the matched string withoutconsuming them

Example

/foo(?=foo)/

Input

foofoofoo

Match

foofoofoo

http://westhoffswelt.de [email protected] slide: 12 / 26

Page 54: Understanding advanced regular expressions

Negative Assertions

Negative assertions are possible

foo not followed by another foo

/foo(?!foo)/

http://westhoffswelt.de [email protected] slide: 13 / 26

Page 55: Understanding advanced regular expressions

Negative Assertions

Negative assertions are possible

foo not followed by another foo

/foo(?!foo)/

http://westhoffswelt.de [email protected] slide: 13 / 26

Page 56: Understanding advanced regular expressions

Backward Assertions

bar preceeded by foo

/////////////////////(?=foo)bar////?

Backward assertion

/(?<=foo)bar/

Negative backward assertion

bar not preceeded by foo

/(?<!foo)bar/

http://westhoffswelt.de [email protected] slide: 14 / 26

Page 57: Understanding advanced regular expressions

Backward Assertions

bar preceeded by foo

/(?=foo)bar/ ?

Backward assertion

/(?<=foo)bar/

Negative backward assertion

bar not preceeded by foo

/(?<!foo)bar/

http://westhoffswelt.de [email protected] slide: 14 / 26

Page 58: Understanding advanced regular expressions

Backward Assertions

bar preceeded by foo

/////////////////////(?=foo)bar////?

Backward assertion

/(?<=foo)bar/

Negative backward assertion

bar not preceeded by foo

/(?<!foo)bar/

http://westhoffswelt.de [email protected] slide: 14 / 26

Page 59: Understanding advanced regular expressions

Backward Assertions

bar preceeded by foo

/////////////////////(?=foo)bar////?

Backward assertion

/(?<=foo)bar/

Negative backward assertion

bar not preceeded by foo

/(?<!foo)bar/

http://westhoffswelt.de [email protected] slide: 14 / 26

Page 60: Understanding advanced regular expressions

Inner workings of the PCRE matcher

PCRE uses backtracking to find matches

Pattern: /\d+foo/Subject: 123456789bar

1 Eat up all the numbers: 123456789

2 Try to match foo

3 Backtrack one number and try to match foo again

4 Repeat step 3 until a match is found or the subjects beginningis reached

http://westhoffswelt.de [email protected] slide: 15 / 26

Page 61: Understanding advanced regular expressions

Inner workings of the PCRE matcher

PCRE uses backtracking to find matches

Pattern: /\d+foo/Subject: 123456789bar

1 Eat up all the numbers: 123456789

2 Try to match foo

3 Backtrack one number and try to match foo again

4 Repeat step 3 until a match is found or the subjects beginningis reached

http://westhoffswelt.de [email protected] slide: 15 / 26

Page 62: Understanding advanced regular expressions

Inner workings of the PCRE matcher

PCRE uses backtracking to find matches

Pattern: /\d+foo/Subject: 123456789bar

1 Eat up all the numbers: 123456789

2 Try to match foo

3 Backtrack one number and try to match foo again

4 Repeat step 3 until a match is found or the subjects beginningis reached

http://westhoffswelt.de [email protected] slide: 15 / 26

Page 63: Understanding advanced regular expressions

Inner workings of the PCRE matcher

PCRE uses backtracking to find matches

Pattern: /\d+foo/Subject: 123456789bar

1 Eat up all the numbers: 123456789

2 Try to match foo

3 Backtrack one number and try to match foo again

4 Repeat step 3 until a match is found or the subjects beginningis reached

http://westhoffswelt.de [email protected] slide: 15 / 26

Page 64: Understanding advanced regular expressions

Inner workings of the PCRE matcher

PCRE uses backtracking to find matches

Pattern: /\d+foo/Subject: 123456789bar

1 Eat up all the numbers: 123456789

2 Try to match foo

3 Backtrack one number and try to match foo again

4 Repeat step 3 until a match is found or the subjects beginningis reached

http://westhoffswelt.de [email protected] slide: 15 / 26

Page 65: Understanding advanced regular expressions

Inner workings of the PCRE matcher

PCRE uses backtracking to find matches

Pattern: /\d+foo/Subject: 123456789bar

1 Eat up all the numbers: 123456789

2 Try to match foo

3 Backtrack one number and try to match foo again

4 Repeat step 3 until a match is found or the subjects beginningis reached

http://westhoffswelt.de [email protected] slide: 15 / 26

Page 66: Understanding advanced regular expressions

Once only subpattern

Once only subpatterns prevent backtracking once a certainpattern has acquired a match.

Applying a once only pattern to the shown example

/(?>\d+)foo/After matching the numbers and determining the followingstring is not foo the matcher stops

123456789bar

Can massively improve regex speed if used correctly

http://westhoffswelt.de [email protected] slide: 16 / 26

Page 67: Understanding advanced regular expressions

Once only subpattern

Once only subpatterns prevent backtracking once a certainpattern has acquired a match.

Applying a once only pattern to the shown example

/(?>\d+)foo/After matching the numbers and determining the followingstring is not foo the matcher stops

123456789bar

Can massively improve regex speed if used correctly

http://westhoffswelt.de [email protected] slide: 16 / 26

Page 68: Understanding advanced regular expressions

Once only subpattern

Once only subpatterns prevent backtracking once a certainpattern has acquired a match.

Applying a once only pattern to the shown example

/(?>\d+)foo/After matching the numbers and determining the followingstring is not foo the matcher stops

123456789bar

Can massively improve regex speed if used correctly

http://westhoffswelt.de [email protected] slide: 16 / 26

Page 69: Understanding advanced regular expressions

Once only subpattern

Once only subpatterns prevent backtracking once a certainpattern has acquired a match.

Applying a once only pattern to the shown example

/(?>\d+)foo/After matching the numbers and determining the followingstring is not foo the matcher stops

123456789bar

Can massively improve regex speed if used correctly

http://westhoffswelt.de [email protected] slide: 16 / 26

Page 70: Understanding advanced regular expressions

Conditional subpattern

If statement aquivalent in PCRE

/(?(condition)yes-pattern|no-pattern)/

Conditions can be direct matches or assertions

Numbers need to be followed by foo, while everything elseneeds to be followed by bar

/(?(\d+)foo|bar)/

http://westhoffswelt.de [email protected] slide: 17 / 26

Page 71: Understanding advanced regular expressions

Conditional subpattern

If statement aquivalent in PCRE

/(?(condition)yes-pattern|no-pattern)/

Conditions can be direct matches or assertions

Numbers need to be followed by foo, while everything elseneeds to be followed by bar

/(?(\d+)foo|bar)/

http://westhoffswelt.de [email protected] slide: 17 / 26

Page 72: Understanding advanced regular expressions

Conditional subpattern

If statement aquivalent in PCRE

/(?(condition)yes-pattern|no-pattern)/

Conditions can be direct matches or assertions

Numbers need to be followed by foo, while everything elseneeds to be followed by bar

/(?(\d+)foo|bar)/

http://westhoffswelt.de [email protected] slide: 17 / 26

Page 73: Understanding advanced regular expressions

Conditional subpattern

If statement aquivalent in PCRE

/(?(condition)yes-pattern|no-pattern)/

Conditions can be direct matches or assertions

Numbers need to be followed by foo, while everything elseneeds to be followed by bar

/(?(\d+)foo|bar)/

http://westhoffswelt.de [email protected] slide: 17 / 26

Page 74: Understanding advanced regular expressions

Conditional subpattern

If statement aquivalent in PCRE

/(?(condition)yes-pattern|no-pattern)/

Conditions can be direct matches or assertions

Numbers need to be followed by foo, while everything elseneeds to be followed by bar

/(?(\d+)foo|bar)/

http://westhoffswelt.de [email protected] slide: 17 / 26

Page 75: Understanding advanced regular expressions

Conditional subpattern

If statement aquivalent in PCRE

/(?(condition)yes-pattern|no-pattern)/

Conditions can be direct matches or assertions

Numbers need to be followed by foo, while everything elseneeds to be followed by bar

/(?(\d+)foo|bar)/

http://westhoffswelt.de [email protected] slide: 17 / 26

Page 76: Understanding advanced regular expressions

Conditional subpattern

If statement aquivalent in PCRE

/(?(condition)yes-pattern|no-pattern)/

Conditions can be direct matches or assertions

Numbers need to be followed by foo, while everything elseneeds to be followed by bar

/(?(\d+)foo|bar)/

http://westhoffswelt.de [email protected] slide: 17 / 26

Page 77: Understanding advanced regular expressions

Unicode: Character, code points and graphemes

Unicode consists of different code points

The letter a: U+0061The mark ‘: U+0300

One character might consist of multiple code points

The letter a with the mark ‘ (a) : U+0061 U+0300

Some of these combinations exists as single code points

The letter a: U+00E0

http://westhoffswelt.de [email protected] slide: 18 / 26

Page 78: Understanding advanced regular expressions

Unicode: Character, code points and graphemes

Unicode consists of different code points

The letter a: U+0061The mark ‘: U+0300

One character might consist of multiple code points

The letter a with the mark ‘ (a) : U+0061 U+0300

Some of these combinations exists as single code points

The letter a: U+00E0

http://westhoffswelt.de [email protected] slide: 18 / 26

Page 79: Understanding advanced regular expressions

Unicode: Character, code points and graphemes

Unicode consists of different code points

The letter a: U+0061The mark ‘: U+0300

One character might consist of multiple code points

The letter a with the mark ‘ (a) : U+0061 U+0300

Some of these combinations exists as single code points

The letter a: U+00E0

http://westhoffswelt.de [email protected] slide: 18 / 26

Page 80: Understanding advanced regular expressions

Unicode: Character, code points and graphemes

Unicode consists of different code points

The letter a: U+0061The mark ‘: U+0300

One character might consist of multiple code points

The letter a with the mark ‘ (a) : U+0061 U+0300

Some of these combinations exists as single code points

The letter a: U+00E0

http://westhoffswelt.de [email protected] slide: 18 / 26

Page 81: Understanding advanced regular expressions

Unicode: Character, code points and graphemes

Unicode consists of different code points

The letter a: U+0061The mark ‘: U+0300

One character might consist of multiple code points

The letter a with the mark ‘ (a) : U+0061 U+0300

Some of these combinations exists as single code points

The letter a: U+00E0

http://westhoffswelt.de [email protected] slide: 18 / 26

Page 82: Understanding advanced regular expressions

Unicode: Character, code points and graphemes

Unicode consists of different code points

The letter a: U+0061The mark ‘: U+0300

One character might consist of multiple code points

The letter a with the mark ‘ (a) : U+0061 U+0300

Some of these combinations exists as single code points

The letter a: U+00E0

http://westhoffswelt.de [email protected] slide: 18 / 26

Page 83: Understanding advanced regular expressions

Unicode: Character, code points and graphemes

Unicode consists of different code points

The letter a: U+0061The mark ‘: U+0300

One character might consist of multiple code points

The letter a with the mark ‘ (a) : U+0061 U+0300

Some of these combinations exists as single code points

The letter a: U+00E0

http://westhoffswelt.de [email protected] slide: 18 / 26

Page 84: Understanding advanced regular expressions

Unicode: Pattern matching

Unicode processing is enabled using the u modifier

PCRE works on UTF-8 encoded strings

Each code point is handled as one character

Match any unicode code point: \x{FFFF}

Remember the letter a with the mark ‘ (a)

/\x{0061}\x{0030}/U

http://westhoffswelt.de [email protected] slide: 19 / 26

Page 85: Understanding advanced regular expressions

Unicode: Pattern matching

Unicode processing is enabled using the u modifier

PCRE works on UTF-8 encoded strings

Each code point is handled as one character

Match any unicode code point: \x{FFFF}

Remember the letter a with the mark ‘ (a)

/\x{0061}\x{0030}/U

http://westhoffswelt.de [email protected] slide: 19 / 26

Page 86: Understanding advanced regular expressions

Unicode: Pattern matching

Unicode processing is enabled using the u modifier

PCRE works on UTF-8 encoded strings

Each code point is handled as one character

Match any unicode code point: \x{FFFF}

Remember the letter a with the mark ‘ (a)

/\x{0061}\x{0030}/U

http://westhoffswelt.de [email protected] slide: 19 / 26

Page 87: Understanding advanced regular expressions

Unicode: Pattern matching

Unicode processing is enabled using the u modifier

PCRE works on UTF-8 encoded strings

Each code point is handled as one character

Match any unicode code point: \x{FFFF}

Remember the letter a with the mark ‘ (a)

/\x{0061}\x{0030}/U

http://westhoffswelt.de [email protected] slide: 19 / 26

Page 88: Understanding advanced regular expressions

Unicode: Pattern matching

Unicode processing is enabled using the u modifier

PCRE works on UTF-8 encoded strings

Each code point is handled as one character

Match any unicode code point: \x{FFFF}

Remember the letter a with the mark ‘ (a)

/\x{0061}\x{0030}/U

http://westhoffswelt.de [email protected] slide: 19 / 26

Page 89: Understanding advanced regular expressions

Unicode: Pattern matching

Unicode processing is enabled using the u modifier

PCRE works on UTF-8 encoded strings

Each code point is handled as one character

Match any unicode code point: \x{FFFF}

Remember the letter a with the mark ‘ (a)

/\x{0061}\x{0030}/U

http://westhoffswelt.de [email protected] slide: 19 / 26

Page 90: Understanding advanced regular expressions

Unicode: Extended unicode sequences

How to match the single and multi code point character?

Remember: a = U+0061 U+0300 oder U+00E0

Using escape for extended unicode sequences: \X

\X is aquivalent to (?>\P{M}\p{M}*)Wait. What? → Unicode character properties

http://westhoffswelt.de [email protected] slide: 20 / 26

Page 91: Understanding advanced regular expressions

Unicode: Extended unicode sequences

How to match the single and multi code point character?

Remember: a = U+0061 U+0300 oder U+00E0

Using escape for extended unicode sequences: \X

\X is aquivalent to (?>\P{M}\p{M}*)Wait. What? → Unicode character properties

http://westhoffswelt.de [email protected] slide: 20 / 26

Page 92: Understanding advanced regular expressions

Unicode: Extended unicode sequences

How to match the single and multi code point character?

Remember: a = U+0061 U+0300 oder U+00E0

Using escape for extended unicode sequences: \X

\X is aquivalent to (?>\P{M}\p{M}*)Wait. What? → Unicode character properties

http://westhoffswelt.de [email protected] slide: 20 / 26

Page 93: Understanding advanced regular expressions

Unicode: Extended unicode sequences

How to match the single and multi code point character?

Remember: a = U+0061 U+0300 oder U+00E0

Using escape for extended unicode sequences: \X

\X is aquivalent to (?>\P{M}\p{M}*)Wait. What? → Unicode character properties

http://westhoffswelt.de [email protected] slide: 20 / 26

Page 94: Understanding advanced regular expressions

Unicode: Extended unicode sequences

How to match the single and multi code point character?

Remember: a = U+0061 U+0300 oder U+00E0

Using escape for extended unicode sequences: \X

\X is aquivalent to (?>\P{M}\p{M}*)Wait. What? → Unicode character properties

http://westhoffswelt.de [email protected] slide: 20 / 26

Page 95: Understanding advanced regular expressions

Unicode: Character properties

Every unicode code point has a certain property assigned

Characters may be matched by these properties

Escapes \p and \P are used for this:

\p{xx}: All code points with the property xx\P{xx}: All code points without the property xx

Possible properties:

L: LetterM: MarkP: PunctationSc: Currency symbol. . .

http://westhoffswelt.de [email protected] slide: 21 / 26

Page 96: Understanding advanced regular expressions

Unicode: Character properties

Every unicode code point has a certain property assigned

Characters may be matched by these properties

Escapes \p and \P are used for this:

\p{xx}: All code points with the property xx\P{xx}: All code points without the property xx

Possible properties:

L: LetterM: MarkP: PunctationSc: Currency symbol. . .

http://westhoffswelt.de [email protected] slide: 21 / 26

Page 97: Understanding advanced regular expressions

Unicode: Character properties

Every unicode code point has a certain property assigned

Characters may be matched by these properties

Escapes \p and \P are used for this:

\p{xx}: All code points with the property xx\P{xx}: All code points without the property xx

Possible properties:

L: LetterM: MarkP: PunctationSc: Currency symbol. . .

http://westhoffswelt.de [email protected] slide: 21 / 26

Page 98: Understanding advanced regular expressions

Unicode: Character properties

Every unicode code point has a certain property assigned

Characters may be matched by these properties

Escapes \p and \P are used for this:

\p{xx}: All code points with the property xx\P{xx}: All code points without the property xx

Possible properties:

L: LetterM: MarkP: PunctationSc: Currency symbol. . .

http://westhoffswelt.de [email protected] slide: 21 / 26

Page 99: Understanding advanced regular expressions

Pattern Recursion

Recursion in regular expressions ?

Possible with PCRE

Validate BB-Code using PCRE

[b]Hello [i]World[/i]![/b]

http://westhoffswelt.de [email protected] slide: 22 / 26

Page 100: Understanding advanced regular expressions

Pattern Recursion

Recursion in regular expressions ?

Possible with PCRE

Validate BB-Code using PCRE

[b]Hello [i]World[/i]![/b]

http://westhoffswelt.de [email protected] slide: 22 / 26

Page 101: Understanding advanced regular expressions

Pattern Recursion

Recursion in regular expressions ?

Possible with PCRE

Validate BB-Code using PCRE

[b]Hello [i]World[/i]![/b]

http://westhoffswelt.de [email protected] slide: 22 / 26

Page 102: Understanding advanced regular expressions

BB-Code Recursion Example

[b]Hello [i]World[/i]![/b]

Recursive regular expression pattern

([^\[]*\[(b|i)\]

(?:[^\[]+|(?R))\[/\1\]

[^\[]*)

http://westhoffswelt.de [email protected] slide: 23 / 26

Page 103: Understanding advanced regular expressions

BB-Code Recursion Example

[b]Hello [i]World[/i]![/b]

Recursive regular expression pattern

([^\[]*\[(b|i)\]

(?:[^\[]+|(?R))\[/\1\]

[^\[]*)

http://westhoffswelt.de [email protected] slide: 23 / 26

Page 104: Understanding advanced regular expressions

BB-Code Recursion Example

[b]Hello [i]World[/i]![/b]

Recursive regular expression pattern

([^\[]*\[(b|i)\]

(?:[^\[]+|(?R))\[/\1\]

[^\[]*)

http://westhoffswelt.de [email protected] slide: 23 / 26

Page 105: Understanding advanced regular expressions

BB-Code Recursion Example

[b]Hello [i]World[/i]![/b]

Recursive regular expression pattern

([^\[]*\[(b|i)\]

(?:[^\[]+|(?R))\[/\1\]

[^\[]*)

http://westhoffswelt.de [email protected] slide: 23 / 26

Page 106: Understanding advanced regular expressions

BB-Code Recursion Example

[b]Hello [i]World[/i]![/b]

Recursive regular expression pattern

([^\[]*\[(b|i)\]

(?:[^\[]+|(?R))\[/\1\]

[^\[]*)

http://westhoffswelt.de [email protected] slide: 23 / 26

Page 107: Understanding advanced regular expressions

BB-Code Recursion Example

[b]Hello [i]World[/i]![/b]

Recursive regular expression pattern

([^\[]*\[(b|i)\]

(?:[^\[]+|(?R))\[/\1\]

[^\[]*)

http://westhoffswelt.de [email protected] slide: 23 / 26

Page 108: Understanding advanced regular expressions

Do NOT Parse Using Regular Expressions

Even though this is possible you do NOT want to do it

It is not maintainableIt is nearly impossible to find errorsUseful information extraction (building an AST) is not possible

Use regular expressions for

Match Patterns (not recursive structures)Tokenizing stringsValidate really restricted input values

http://westhoffswelt.de [email protected] slide: 24 / 26

Page 109: Understanding advanced regular expressions

Do NOT Parse Using Regular Expressions

Even though this is possible you do NOT want to do it

It is not maintainableIt is nearly impossible to find errorsUseful information extraction (building an AST) is not possible

Use regular expressions for

Match Patterns (not recursive structures)Tokenizing stringsValidate really restricted input values

http://westhoffswelt.de [email protected] slide: 24 / 26

Page 110: Understanding advanced regular expressions

Do NOT Parse Using Regular Expressions

Even though this is possible you do NOT want to do it

It is not maintainableIt is nearly impossible to find errorsUseful information extraction (building an AST) is not possible

Use regular expressions for

Match Patterns (not recursive structures)Tokenizing stringsValidate really restricted input values

http://westhoffswelt.de [email protected] slide: 24 / 26

Page 111: Understanding advanced regular expressions

Do NOT Parse Using Regular Expressions

Even though this is possible you do NOT want to do it

It is not maintainableIt is nearly impossible to find errorsUseful information extraction (building an AST) is not possible

Use regular expressions for

Match Patterns (not recursive structures)Tokenizing stringsValidate really restricted input values

http://westhoffswelt.de [email protected] slide: 24 / 26

Page 112: Understanding advanced regular expressions

Do NOT Parse Using Regular Expressions

Even though this is possible you do NOT want to do it

It is not maintainableIt is nearly impossible to find errorsUseful information extraction (building an AST) is not possible

Use regular expressions for

Match Patterns (not recursive structures)Tokenizing stringsValidate really restricted input values

http://westhoffswelt.de [email protected] slide: 24 / 26

Page 113: Understanding advanced regular expressions

Do NOT Parse Using Regular Expressions

Even though this is possible you do NOT want to do it

It is not maintainableIt is nearly impossible to find errorsUseful information extraction (building an AST) is not possible

Use regular expressions for

Match Patterns (not recursive structures)Tokenizing stringsValidate really restricted input values

http://westhoffswelt.de [email protected] slide: 24 / 26

Page 114: Understanding advanced regular expressions

Do NOT Parse Using Regular Expressions

Even though this is possible you do NOT want to do it

It is not maintainableIt is nearly impossible to find errorsUseful information extraction (building an AST) is not possible

Use regular expressions for

Match Patterns (not recursive structures)Tokenizing stringsValidate really restricted input values

http://westhoffswelt.de [email protected] slide: 24 / 26

Page 115: Understanding advanced regular expressions

Do NOT Parse Using Regular Expressions

Even though this is possible you do NOT want to do it

It is not maintainableIt is nearly impossible to find errorsUseful information extraction (building an AST) is not possible

Use regular expressions for

Match Patterns (not recursive structures)Tokenizing stringsValidate really restricted input values

http://westhoffswelt.de [email protected] slide: 24 / 26

Page 116: Understanding advanced regular expressions

Thanks for listening

Questions, comments or annotations?

Slides: http://westhoffswelt.de/portfolio.htm

Contact: Jakob Westhoff <[email protected]>Twitter: @jakobwesthoff

Please leave comments and vote at: http://joind.in/1620

http://westhoffswelt.de [email protected] slide: 25 / 26

Page 117: Understanding advanced regular expressions

Bibliography I

[1] Wikipedia.Regular expressions — wikipedia, the free encyclopedia, 2002.[Online; accessed 25-February-2002].

http://westhoffswelt.de [email protected] slide: 26 / 26