colin neller lead software engineer serviceu corporation

Regular Expressions Revisited

Colin NellerLead Software EngineerServiceU Corporationhttp://colinneller.com/blog /

http://colinneller.com/blog%20/

Agenda

Tools

RegEx Refresher

Advanced (read: lesser known) Features

Maintenance

RegEx Tools…or toys?!?

Tools

Presentation ZoomIt SlickRun

Regex The Regulator Regulazy RegexBuddy

RegEx Refresher…couldn't hurt, right?

Function

Validate (IsMatch)

Parse (Match)

Manipulate (Replace)

Slice (Split)

Language Elements

Character Classes Quantifiers Groups Alternation Character Escapes Substitution Options

Character ClassesClass Description. Matches any character except \n. If modified by the Singleline option, a

period character matches any character.[aeiou] Matches any single character included in the specified set of characters.[^aeiou] Matches any single character not in the specified set of characters.[0-9a-fA-F]

Use of a hyphen (–) allows specification of contiguous character ranges.

\p{name}

Matches any character in the named character class specified by {name}. Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z, IsGreek, IsBoxDrawing.

\P{name} Matches text not included in groups and block ranges specified in {name}.\w Matches any word character. Equivalent to the Unicode categories

[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].\W Matches any nonword character. Equivalent to the Unicode categories [^\

p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]\s Matches any white-space character. Equivalent to the Unicode character

categories [\f\n\r\t\v\x85\p{Z}]. \S Matches any non-white-space character. Equivalent to the Unicode

character categories [^\f\n\r\t\v\x85\p{Z\d Matches any decimal digit. Equivalent to \p{Nd} for Unicode and [0-9] for

non-Unicode.\D Matches any nondigit. Equivalent to \P{Nd} for Unicode and [^0-9] for non-

Unicode.

Character Escapes

Escaped character

Description

All Chars Characters other than . $ ^ { [ ( | ) * + ? \ match themselves.\a Matches a bell (alarm) \u0007.\b Matches a backspace \u0008 if in a [] character class; otherwise, see the

note following this table.\t Matches a tab \u0009.\r Matches a carriage return \u000D.\v Matches a vertical tab \u000B.\f Matches a form feed \u000C.\n Matches a new line \u000A.\e Matches an escape \u001B.\040 Matches an ASCII character as octal (up to three digits); numbers with no

leading zero are backreferences if they have only one digit or if they correspond to a capturing group number. (For more information, see Backreferences.) For example, the character \040 represents a space.

\x20 Matches an ASCII character using hexadecimal representation (exactly two digits).

\cC Matches an ASCII control character; for example, \cC is control-C.\u0020 Matches a Unicode character using hexadecimal representation (exactly

four digits).\ When followed by a character that is not recognized as an escaped

character, matches that character. For example, \* is the same as \x2A.

http://msdn.microsoft.com/en-us/library/20bw873z%28VS.71%29.aspx

http://msdn.microsoft.com/en-us/library/ksz2azbh%28VS.71%29.aspx

Quantifiers

Quantifier

Description

* Specifies zero or more matches; for example, \w* or (abc)*. Equivalent to {0,}.

+ Specifies one or more matches; for example, \w+ or (abc)+. Equivalent to {1,}.

? Specifies zero or one matches; for example, \w? or (abc)?. Equivalent to {0,1}.

{n} Specifies exactly n matches; for example, (pizza){2}.{n,} Specifies at least n matches; for example, (abc){2,}.{n,m} Specifies at least n, but no more than m, matches.*? Specifies the first match that consumes as few repeats as

possible (equivalent to lazy *).+? Specifies as few repeats as possible, but at least one

(equivalent to lazy +).?? Specifies zero repeats if possible, or one (lazy ?).{n}? Equivalent to {n} (lazy {n}).{n,}? Specifies as few repeats as possible, but at least n (lazy {n,}).{n,m}? Specifies as few repeats as possible between n and m (lazy

{n,m}).

Atomic Zero-Width AssertionsAssertion

Description

^ Specifies that the match must occur at the beginning of the string or the beginning of the line.

$ Specifies that the match must occur at the end of the string, before \n at the end of the string, or at the end of the line.

\A Specifies that the match must occur at the beginning of the string (ignores the Multiline option).

\Z Specifies that the match must occur at the end of the string or before \n at the end of the string (ignores the Multiline option).

\z Specifies that the match must occur at the end of the string (ignores the Multiline option).

\G Specifies that the match must occur at the point where the previous match ended. When used with Match.NextMatch(), this ensures that matches are all contiguous.

\b Specifies that the match must occur on a boundary between \w (alphanumeric) and \W (nonalphanumeric) characters. The match must occur on word boundaries — that is, at the first or last characters in words separated by any nonalphanumeric characters.

\B Specifies that the match must not occur on a \b boundary.

Advanced Features…or maybe just some that you haven't heard of

Grouping ConstructsConstruct Description( ) Captures the matched substring (or noncapturing group; (?<name> )

Captures the matched substring into a group name or number name. You can use single quotes instead of angle brackets; for example, (?'name').

(?<name1-name2> )

Balancing group definition. Deletes the definition of the previously defined group name2 and stores in group name1 the interval between the previously defined name2 group and the current group. If no group name2 is defined, the match backtracks.

(?: ) Noncapturing group.(?imnsx-imnsx: )

Applies or disables the specified options within the subexpression. For example, (?i-s: ) turns on case insensitivity and disables single-line mode.

(?= ) Zero-width positive lookahead assertion. Continues match only if the subexpression matches at this position on the right. For example, \w+(?=\d) matches a word followed by a digit, without matching the digit. This construct does not backtrack.

(?! ) Zero-width negative lookahead assertion. Continues match only if the subexpression does not match at this position on the right. For example, \b(?!un)\w+\b matches words that do not begin with un.

(?<= ) Zero-width positive lookbehind assertion. Continues match only if the subexpression matches at this position on the left. For example, (?<=19)99 matches instances of 99 that follow 19. This construct does not backtrack.

(?<! ) Zero-width negative lookbehind assertion. Continues match only if the subexpression does not match at the position on the left.

(?> ) Nonbacktracking subexpression (also known as a "greedy" subexpression). The subexpression is fully matched once, and then does not participate piecemeal in backtracking. (That is, the subexpression matches only strings that would be matched by the subexpression alone.)

Alternation ConstructsConstruct Definition| Matches any one of the terms separated by the |

(vertical bar) character; for example, cat|dog|tiger. (?(expression)yes|no)

Matches the "yes" part if the expression matches at this point; otherwise, matches the "no" part. The "no" part can be omitted. The expression can be any valid subexpression, but it is turned into a zero-width assertion, so this syntax is equivalent to (?(?=expression)yes|no). Note that if the expression is the name of a named group or a capturing group number, the alternation construct is interpreted as a capture test (described in the next row of this table). To avoid confusion in these cases, you can spell out the inside (?=expression) explicitly.

(?(name)yes|no) Matches the "yes" part if the named capture string has a match; otherwise, matches the "no" part. The "no" part can be omitted. If the given name does not correspond to the name or number of a capturing group used in this expression, the alternation construct is interpreted as an expression test (described in the preceding row of this table).

Backreference Constructs

Construct

Definition

\number Backreference. For example, (\w)\1 finds doubled word characters.

\k<name>

Named backreference. For example, (?<char>\w)\k<char> finds doubled word characters. The expression (?<43>\w)\43 does the same. You can use single quotes instead of angle brackets; for example, \k'char'.

Substitutions

Character Description$number Substitutes the last substring matched by group

number number (decimal).${name} Substitutes the last substring matched by a (?

<name> ) group.$$ Substitutes a single "$" literal.

$& Substitutes a copy of the entire match itself.

$` Substitutes all the text of the input string before the match.

$' Substitutes all the text of the input string after the match.

$+ Substitutes the last group captured.

$_ Substitutes the entire input string.

Options

RegexOption member

Description

None Specifies that no options are set.IgnoreCase i Specifies case-insensitive matching.Multiline m Specifies multiline mode. Changes the meaning of ^ and $

so that they match at the beginning and end, respectively, of any line.

ExplicitCapture n Specifies that the only valid captures are explicitly named or numbered groups of the form (?<name>)

Compiled Specifies that the regular expression will be compiled to an assembly; yields faster execution at the expense of startup time.

Singleline s Specifies single-line mode. Changes the meaning of the period character (.) so that it matches every character (instead of every character except \n).

IgnorePatternWhitespace

x Specifies that unescaped white space is excluded from the pattern and enables comments following a number sign (#).

RightToLeft Specifies that the search moves from right to left.ECMAScript Specifies that ECMAScript-compliant behavior is enabled for

the expression. This option can be used only in conjunction with the IgnoreCase and Multiline flags. Use of ECMAScript with any other flags results in an exception.

CultureInvariant Specifies that cultural differences in language is ignored.

Maintenance

Comments & Option SwitchesConstruct Definition(?imnsx-imnsx) Sets or disables options such as case

insensitivity to be turned on or off in the middle of a pattern. Option changes are effective until the end of the enclosing group.

(?# ) Inline comment inserted within a regular expression. The comment terminates at the first closing parenthesis character.

# [to end of line] X-mode comment. The comment begins at an unescaped # and continues to the end of the line. (Note that the x option or the RegexOptions.IgnorePatternWhitespace enumerated option must be activated for this kind of comment to be recognized.)

Maintenance Tools & Helpers

Questions?

Documentation: http://msdn.microsoft.com/en-us/library/

az24scfc(VS.71).aspx Cheat Sheet

http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/

Tools The Regulator (free!)▪ http://sourceforge.net/projects/regulator/

RegexBuddy ($39.95)▪ http://www.regexbuddy.com/

http://sourceforge.net/projects/regulator/

http://www.regexbuddy.com/

http://www.regexbuddy.com/

colin neller lead software engineer serviceu corporation

Documents