copyright © 2013 splunk inc. regex is fun david clawson splunkyoda

51
Copyright © 2013 Splun Inc. Regex is Fun David Clawson SplunkYoda

Upload: serena-auker

Post on 14-Dec-2015

229 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Copyright © 2013 Splunk Inc.

Regex is Fun

David Clawson

SplunkYoda

Page 2: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

2

Page 3: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Regular Expressions

• “A regular expression is a pattern which specifies a set of

strings of characters; it is said to match certain strings.”

—Ken Thompson

• QED Text Editor written by Ken in the 1970s

• Invented in the 1940s

• Help celebrate it’s 70th Year

3

Page 4: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Types of Regular Expressions

4

Page 5: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

How is Regex used in Python?

Python “re”

Python's built-in "re" module provides excellent support for regular expressions, with a

modern and complete regex flavor.

The only significant features missing from Python's regex syntax are atomic grouping,

possessive quantifiers, and Unicode properties.

Using Regular Expressions in Python

The first thing to do is to import the regexp module into your script with “import re”.

5

Page 6: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

How is Regex used in Python?

Call re.search(regex, subject) to apply a regex pattern to a subject string. • The function returns None if the matching attempt fails, and a Match object otherwise. • The Match object stores details about the part of the string matched by the regular expression

pattern.

Since None evaluates to False, you can easily use re.search() in an if statement.

6

Page 7: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

How is Regex used in Python?

Do not confuse re.search() with re.match(). • Both functions do exactly the same, with the important distinction that re.search() will attempt

the pattern throughout the string, until it finds a match. re.match() on the other hand, only attempts the pattern at the very start of the string.

7

Page 8: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

How is Regex used in Python?

To get all matches from a string, call re.findall(regex, subject). This will return an array of all non-overlapping regex matches in the string. "Non-overlapping" means that the string is searched through from left to right, and the next match attempt starts beyond the previous match.

If the regex contains one or more capturing groups, re.findall() returns an array of tuples,

with each tuple containing text matched by all the capturing groups.

The overall regex match is not included in the tuple, unless you place the entire regex

inside a capturing group.

8

Page 9: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

How is Regex used in Python?

More efficient than re.findall() is re.finditer(regex, subject). It returns an iterator that enables you to loop over the regex matches in the subject string: for m in re.finditer(regex, subject). The for-loop variable m is a Match object with the details of the current match.

9

Page 10: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

How is Regex used in Splunk?

Field extraction

| rex field=_raw “%UC_CALLMANAGER-(?<Severity>\d+)-EndPointUnregistered:

Configure Line Breaking

LINE_BREAKER = [\r\n]+

Filtering and Routing Data to Queues

REGEX =(?m)^EventCode=(592|593)

Many more…….

10

Page 11: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Regex Testing Tools

11

• RegExr http://gskinner.com/RegExr/

• Reggy http://reggyapp.com/

• RegexPal http://regexpal.com/

• Regex Buddy http://www.regexbuddy.com/

• Lars Olav Torvik http://regex.larsolavtorvik.com/

• Rubular http://rubular.com/

Page 12: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Regex Reference Texts

12

• http://www.regular-expressions.info/reference.html - from the creators of RegexBuddy

• Introducing Regular Expressions by Michael Fitzgerald

• Mastering Regular Expressions by Jeffrey Friedl

• Regular Expressions Cookbook by Jan Goyvaerts

• Regular Expressions Pocket Reference by Tony Stubblebine

Page 13: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Basic Concepts of Regular Expressions

Because Knowingleads to Doing

Page 14: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

14

Page 15: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Simple Pattern Matching

Matching String Literals

Matching Digits and Non-Digits

Matching Word and Non-Word Characters

Matching Whitespace

Matching Any Character

15

Page 16: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Matching String LiteralsSample Apache Log

10.23.10.11 www.iamcool.com 10.100.0.11 - - [06/Dec/2012:14:39:03 -0800] "GET /Facelift/answers/swelling HTTP/1.1" 301 20 14932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

Literal String Match of the first ip address would be:

10.23.10.11

16

Page 17: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Matching Digits and Non-Digits

\d or \D or [0-9]

\d - match digit

\D – match non-digit (matches whitespace, punctuation and other characters not used in words)

[0-9] - match any number (called a character class)

[^0-9] – match any non-number

17

Page 18: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Matching Words and Non-Words

\w or \W

\w – match any word character and is essentially the same as the character class [a-zA-Z0-9]

\W – match any non-word character

18

Page 19: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Matching Whitespace

\s or \S

\s – match whitespace (Spaces, Tabs, Line Feeds and Carriage Returns)

\S – match any character that is not whitespace. Same as [^\s]

19

Page 20: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Character shorthands for whitespace

20

Character Shorthand Description

\f Form Feed

\h Horizontal Whitespace

\H Not Horizontal Whitespace

\n Newline

\r Carriage Return

\t Horizontal Tab

\v Vertical Tab (whitespace)

\V Not vertical whitespace

Page 21: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Matching Any Character

Dot (.)

Matches any character but line ending characters

\b – matches a word boundary without consuming any characters

21

Page 22: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Boundaries and Alternation

Matching the Beginning and End of Line

List of Regex Special Character

Alternation and Regex Options

Subpatterns

Capturing and Named Groups

Character Classes

Negated Character Classes22

Page 23: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Matching Beginning and End of Line

^ OR $

^ - matches the beginning of a line

$ - matches the end of a line

23

Page 24: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

List of Regex Special Characters.^*+?|(){}[]\-

. -matches any character

^ -matches beginning of the line

* -matches zero or more

+ -matches one or more

? –matches one or more

| -used for alternation (choice of patterns to match)

() –used for grouping

{} –used as a quantifier

[] –used with character classes

\ -used to make a character literal or as a special regex character

- -hyphen is used in a character class range

24

Page 25: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Alternation and Options

| OR ?

| -gives choice of alternate patterns to match, ie: (THE|The|the)

(?i) – Case insensitive

(?J) –allow duplicate names

(?m) –match on duplicate lines

(?s) –match on a single line

(?U) –match lazy

(?X) –Ignore whitespace, comments

(?-…) –Unset or turn off options

25

Page 26: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Subpatterns

Group(s) within a group

(THE|The|the) -has three subpatterns

(tT)h(e|eir) –matches the, The, their, Their

26

Page 27: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Capturing and Named Groups

()(?<name>…) OR (?P<name>…)

Store their content in memory

(it is) (time to eat)

$1 $2

(?<Severity>\d)

Splunk creates a field of Severity from this named group

27

Page 28: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Character Classes

[]

[aeiou] –only matches the characters inside of the brackets

[0-9] –matches a range of characters, using a hyphen

[a-zA-Z0-9] –matches all alphanumeric characters

28

Page 29: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Negated Character Classes

[^…]

*** Super important – especially for Splunk field extractions ***

[^aeiou] –matches all consonants and NOT vowels

[^\s] – match everything that is not a space

29

Page 30: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Quantifiers

Greedy, Lazy, Possessive

Matching a certain number of times

30

Page 31: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Greedy, Lazy, Possessive* + ?

* - match zero of more times

.* -will match all of the characters in the subject text (want to avoid this)

+ -match one or more

\d+ -match all of the digits until there aren’t any more - greedy

? –match 0 or 1 of the preceeding token.

colou?r –matches either color or colour

31

Page 32: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Matching a Certain Number of Times{}

\d{3} -matches 3 digits only

\d{1,3} –matches range of 1 to 3 digits

\d{1,} -same as \d+

\d{0,} -same as \d*

\d{0,1} -same as \d?

32

Page 33: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

33

Any Thoughts, Ideas, [email protected]

Page 34: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimized Regular Expressions

Because fast is elegant!

Page 35: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

(whiskey) (?:whiskey)

Capture groups add unnecessary overhead and impact overall

performance use them only when necessary.

Page 36: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

splunk|splash spl(?:unk|ash)

Try to “factor” on the left, when you can, while exposing required

text. Less alternation is better.

Page 37: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

(?:aussie$|gypsie$) (?:aus|gyp)sie$

Try to “factor” on the right when input text is close to end of the

line. Most regex engines will anchor at end of line when “$” is

present.

Page 38: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

0{3,7} 0000{0,4}

Typically exposing required or literal text makes the engine

execute the regex faster

Page 39: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

(.)* .*

Useless parenthesis add unnecessary overhead. As above, use

them only when necessary.

Page 40: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

matty[:] matty:

The character class/set (indicated by []) will add unnecessary

overhead when not needed.

Page 41: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

^genti|^collar ^(?:genti|collar)

Anchoring the regex at the beginning of the line will result in

improved performance with most regex engines.

Page 42: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

delaney$|connery$ (delaney|connery)$

I said, anchor the regex!

Page 43: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

^src.*: ^src[^:]*:

Using a negated character class/set instead of lazy/greedy

quantifiers will typically result in faster regexes. Lazy/greedy

quantifiers will make the regex engines backtrack which

ultimately impacts overall performance.

Page 44: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

bride|brian bri(?:de|an)

Full alternation is more expensive than partial alternation. Also,

in this case the regex engine will alternate only AFTER ‘bri’ has

been matched.

Page 45: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

(?:edu|com|net|…) (?:com|edu|net|…)

Leading the engine to a match by placing the most popular match

first may result in faster execution in some engines.

Page 46: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

^.*(answer) ^.{42}(answer)

Specifying an exact position inside the string and leading the

engine to a match, will help improve performance drastically

compared to using a simple greedy/lazy quantifier.

Page 47: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

.*?a ^.*a

If ‘a’ is near the end of the input string will match faster as less

backtracking will be required.

Page 48: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

.*a ^.*?a

If ‘a’ is near the beginning of the input string the regex engine

will match faster.

Page 49: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

:[^:]*: :[^:]*+:

Ex. in ‘ :destination’ the second regex fails faster.

Page 50: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Optimize Regular ExpressionsGood Better

:[^:]*: :(?>[^:]*):

Same as above, using different notation. Explanation:

Atomic grouping or possessive quantifiers instruct the regex

engine not to keep the states captured by * or + therefore

preventing it from unsuccessfully backtracking and in turn failing

faster.

Page 51: Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

Python for the Masses