regular expressions. overview regular expressions allow you to do complex searches within text...
TRANSCRIPT
Regular Expressions
Overview
Regular expressions allow you to do complex searches within text documents.
Examples: Search 8-K filings for restatements a Boolean search of “restate” would yield too
many “false positives.” Regular expressions provide tremendous
flexibility.
Getting Started
Open your “RegexBuddy” program. We are going to build regular expressions
to find specific text in this document using a variety of “Tokens.”
Specifying Literal Text
Literal defined - A literal just means that the characters are to be interpreted “as is.” The application will not attempt to interpret the character.
For example, suppose you where looking for the “\t” You need to tell the the application that you
are looking for “\t” and not a tab space because \t typically represents a tab space
Specifying Literal Text
Click on “Insert Token” then click on Literal Text.
In the text box, type “\t” and click OK You will see “\\t” in the window regular
expression window. The first “\” tells the Perl to interpret the following “\” literally.
Non-printable characters
\t – Tab \r – Carriage return \n – Newline (UNIX/Linux) \r\n – Newline (Windows)
Dot and Short-Hand Character Classes
. Match any character but newline (unless modified with s) Short-Hand Character Classes \w Match any word character (includes numbers and “_”). \W Match any non-word character
\d Match a digit character \D Match a non-digit character \s Match a whitespace character \S Match a non-whitespace character
Character Class and Anchors
Character Class [456] - matches 4, 5 or 6. [^456] - matches anything but 4, 5 or 6. Create an expression that matches either
“Balls” or “Balks” Anchors
• \A – beginning of the string• \z – end of the string• ^ - beginning of the line• $ - end of the line.
Alternation
Alternation is essentially “OR.” | - is inserted between alternatives. Boy|Girl – matches “Boy” or “Girl”
Quantifiers
x? Match 0 or 1 x x* Match 0 or more occurrences of x x+ Match 1 or more occurrences of x (xyz)+ Match 1 or more occurrences of xyz x{m,n} Matches at least m occurrences of x
up to n occurrences of x
Grouping and Backreferencing (string) - use for backreferencing $1 - reference to contents of first set
of parentheses $2 - reference to contents of second
set of parentheses. In regex toolkit
Put the following in the regular expression window:(.*)\s(.*)
Put the following in the “Test” window:John Smith
Select Group 2 from the highlight drop-down.
Greediness Normally, expressions match as many
characters as possible (they are greedy).$_=“ab12345AB”The regex ab[0-9]* will replace as follows:XAB
We can turn off greediness by adding a “?” after the greedy character (*).The regex s/ab[0-9]*?/X will replace as follows:X12345AB
Substitution of subpatterns
Remember using () causes Perl to remember the contents.
Suppose we want to replace Fred with Freddy? Put “(Fred)” in the regular expression window Put \1dy in the replace window Put Fred Couples in the Test window
Look Ahead and Look Behind
Allows you to check ahead or back for a particular pattern before continuing match.
/PATTERN(?=pattern)/ Positive look ahead
/PATTERN(?!pattern)/ Negative look ahead
(?<=pattern)PATTERN/ Positive look behind
(?<!pattern)PATTERN/ Negative look behind
Mode Modifiers
Dot match new lines (s in Perl) Case insensitive (i in Perl) ^$ match at line breaks (m in Perl) Free-spacing (x in Perl)
Note on Regex
Regular expressions can be used on many platforms (besides Perl).
For example, there are built in Perl regular expressions from within SAS.