regular expressions. overview regular expressions allow you to do complex searches within text...

Regular Expressions

Overview

Regular expressions allow you to do complex searches within text documents.

Examples: Search 8-K filings for restatements a Boolean search of “restate” would yield too

many “false positives.” Regular expressions provide tremendous

flexibility.

Getting Started

Open your “RegexBuddy” program. We are going to build regular expressions

to find specific text in this document using a variety of “Tokens.”

Specifying Literal Text

Literal defined - A literal just means that the characters are to be interpreted “as is.” The application will not attempt to interpret the character.

For example, suppose you where looking for the “\t” You need to tell the the application that you

are looking for “\t” and not a tab space because \t typically represents a tab space

Specifying Literal Text

Click on “Insert Token” then click on Literal Text.

In the text box, type “\t” and click OK You will see “\\t” in the window regular

expression window. The first “\” tells the Perl to interpret the following “\” literally.

Non-printable characters

\t – Tab \r – Carriage return \n – Newline (UNIX/Linux) \r\n – Newline (Windows)

Dot and Short-Hand Character Classes

. Match any character but newline (unless modified with s) Short-Hand Character Classes \w Match any word character (includes numbers and “_”). \W Match any non-word character

\d Match a digit character \D Match a non-digit character \s Match a whitespace character \S Match a non-whitespace character

Character Class and Anchors

Character Class [456] - matches 4, 5 or 6. [^456] - matches anything but 4, 5 or 6. Create an expression that matches either

“Balls” or “Balks” Anchors

• \A – beginning of the string• \z – end of the string• ^ - beginning of the line• $ - end of the line.

Alternation

Alternation is essentially “OR.” | - is inserted between alternatives. Boy|Girl – matches “Boy” or “Girl”

Quantifiers

x? Match 0 or 1 x x* Match 0 or more occurrences of x x+ Match 1 or more occurrences of x (xyz)+ Match 1 or more occurrences of xyz x{m,n} Matches at least m occurrences of x

up to n occurrences of x

Grouping and Backreferencing (string) - use for backreferencing $1 - reference to contents of first set

of parentheses $2 - reference to contents of second

set of parentheses. In regex toolkit

Put the following in the regular expression window:(.*)\s(.*)

Put the following in the “Test” window:John Smith

Select Group 2 from the highlight drop-down.

Greediness Normally, expressions match as many

characters as possible (they are greedy).$_=“ab12345AB”The regex ab[0-9]* will replace as follows:XAB

We can turn off greediness by adding a “?” after the greedy character (*).The regex s/ab[0-9]*?/X will replace as follows:X12345AB

Substitution of subpatterns

Remember using () causes Perl to remember the contents.

Suppose we want to replace Fred with Freddy? Put “(Fred)” in the regular expression window Put \1dy in the replace window Put Fred Couples in the Test window

Look Ahead and Look Behind

Allows you to check ahead or back for a particular pattern before continuing match.

/PATTERN(?=pattern)/ Positive look ahead

/PATTERN(?!pattern)/ Negative look ahead

(?<=pattern)PATTERN/ Positive look behind

(?<!pattern)PATTERN/ Negative look behind

Mode Modifiers

Dot match new lines (s in Perl) Case insensitive (i in Perl) ^$ match at line breaks (m in Perl) Free-spacing (x in Perl)

Note on Regex

Regular expressions can be used on many platforms (besides Perl).

For example, there are built in Perl regular expressions from within SAS.

regular expressions. overview regular expressions allow you to do complex searches within text...

Documents