Download - Regexp secrets
![Page 1: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/1.jpg)
Secrets of RegexpHiro Asari
Red Hat, Inc.
![Page 2: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/2.jpg)
Let's Talk AboutRegular Expressions
![Page 3: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/3.jpg)
Let's Talk AboutRegular Expressions
• There is no regular expression
![Page 4: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/4.jpg)
Let's Talk AboutRegular Expressions
• A good approximation as a name
![Page 5: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/5.jpg)
Let's Talk AboutRegexp
![Page 6: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/6.jpg)
Some people, when confronted with a problem, think, "I know, I'll use regular expressions."
Now they have two problems.
Jaime Zawinski12 Aug, 1997
http://regex.info/blog/2006-09-15/247http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html
The point is not so much the evils of regular expressions, but the evils of overuse of it.
![Page 7: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/7.jpg)
Formal Language Theory
• The Language L
• Over Alphabet Σ
![Page 8: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/8.jpg)
Formal Language Theory
• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
![Page 9: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/9.jpg)
Formal Language Theory
• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
• Words over Σ: "a", "b", "ab", "aequafdhfad"
![Page 10: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/10.jpg)
Formal Language Theory
• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
• Words over Σ: "a", "b", "ab", "aequafdhfad"
• Σ*: The set of all words over Σ
![Page 11: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/11.jpg)
Formal Languageover Σ
• A subset L of Σ* (with various properties)
• L can be finite, and enumerate well-formed words, but often infinite
![Page 12: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/12.jpg)
Example
• Language L over Σ = {a,b}
• 'a' is a word
• a word may be obtained by appending 'ab' to an existing word
• only words thus formed are legal
![Page 13: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/13.jpg)
aaabaabab
Well-formed words
![Page 14: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/14.jpg)
baaaababb
Ill-formed words
![Page 15: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/15.jpg)
Succinctly…
• a(ab)*
![Page 16: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/16.jpg)
Expression
• Textual representation of the formal language against which an input is tested whether it is a well-formed word in that language
![Page 17: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/17.jpg)
Regular Languages
• ∅ (empty language) is regular
![Page 18: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/18.jpg)
Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
![Page 19: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/19.jpg)
Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages
![Page 20: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/20.jpg)
Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages
• No other languages over Σ are regular.
![Page 21: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/21.jpg)
Regular Expressions
• Expressions of regular languages
![Page 22: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/22.jpg)
Regular Expressions
• Expressions of regular languages
Not
![Page 23: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/23.jpg)
Regular? Expressions
• It turns out that some expressions are more powerful and expresses non-regular languages
• Language of 'squares': (.*)\1
• a, aa, aaaa, WikiWiki
![Page 24: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/24.jpg)
How does Regexp work?
• Build a finite state automaton representing a given regular expression
• Feed the String to the regular expression and see if the match succeeds
![Page 25: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/25.jpg)
a
a
![Page 26: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/26.jpg)
ab*
a
b
![Page 27: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/27.jpg)
.*
.
![Page 28: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/28.jpg)
a$
a $
![Page 29: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/29.jpg)
a?
a
ε
![Page 30: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/30.jpg)
a|b
a
b
![Page 31: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/31.jpg)
(ab|c)
c
a b
![Page 32: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/32.jpg)
(ab+|c)
c
a
b
b
![Page 33: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/33.jpg)
Match is attempted at every character, left to
right
![Page 34: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/34.jpg)
zyxwvutsrqponmlkjihgfedcba^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
![Page 35: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/35.jpg)
zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
![Page 36: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/36.jpg)
zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
![Page 37: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/37.jpg)
zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
![Page 38: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/38.jpg)
zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^⋮zyxwvutsrqponmlkjihgfedcba ^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
![Page 39: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/39.jpg)
abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^
# matches 'abc d a dfadg '
^\s*(.*)\s*$
![Page 40: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/40.jpg)
def pathological(n=5) Regexp.new('a?' * n + 'a' * n)end
1.upto(40) do |n| print n, ": " print Time.now, "\n" if 'a'*n =~ pathological(n)end
a?a?a?…a?aaa…a
![Page 41: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/41.jpg)
aaa^
a?a?a?aaa
![Page 42: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/42.jpg)
Regexp tips
![Page 43: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/43.jpg)
UP_TO_256 = /\b(?:25[0-5] # 250-255|2[0-4][0-9] # 200-249|1[0-9][0-9] # 100-199|[1-9][0-9] # 2-digit numbers|[0-9]) # single-digit numbers\b/x
IPV4_ADDRESS = /#{UP_TO_256}(?:\.#{UP_TO_256}){3}/
Use /x
![Page 44: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/44.jpg)
\A, \z for strings^, $ for lines
• \A: the beginning of the string
• \z: the end of the string
• ^: after \n
• $: before \n
![Page 45: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/45.jpg)
always in Ruby
\A, \z for strings^, $ for lines
• \A: the beginning of the string
• \z: the end of the string
• ^: after \n
• $: before \n
![Page 46: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/46.jpg)
What's the problem?
also note the difference in what /m means
![Page 47: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/47.jpg)
#! /usr/bin/env perl$a = "abc\ndef";if ($a =~ /^d/) { print "yes\n";}if ($a =~ /^d/m) { print "yes now\n";}# prints 'yes now'
What's the problem?
also note the difference in what /m means
![Page 48: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/48.jpg)
#! /usr/bin/env ruby
a = "abc\ndef";if (a =~ /^d/) p "yes"end
What's the problem?
http://guides.rubyonrails.org/security.html#regular-expressions
![Page 49: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/49.jpg)
class File < ActiveRecord::Base!!validates :name, :format => /^[\w\.\-\+]+$/end
Security Implications
http://guides.rubyonrails.org/security.html#regular-expressions
![Page 50: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/50.jpg)
file.txt%0A<script>alert(‘hello’)</script>
![Page 51: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/51.jpg)
file.txt%0A<script>alert(‘hello’)</script>
![Page 52: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/52.jpg)
file.txt\n<script>alert(‘hello’)</script>
![Page 53: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/53.jpg)
file.txt\n<script>alert(‘hello’)</script>
/^[\w\.\-\+]+$/
![Page 54: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/54.jpg)
file.txt\n<script>alert(‘hello’)</script>
/^[\w\.\-\+]+$/
Match succeedsActiveRecord validation succeeds
![Page 55: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/55.jpg)
file.txt\n<script>alert(‘hello’)</script>
/\A[\w\.\-\+]+\z/
![Page 56: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/56.jpg)
file.txt\n<script>alert(‘hello’)</script>
/\A[\w\.\-\+]+\z/
Match failsActiveRecord validation fails
![Page 57: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/57.jpg)
require 'benchmark'
# simple benchmark for alternations and character class
n = 5_000
str = 'cafebabedeadbeef'*5_000
Benchmark.bmbm do |x| x.report('alternation') do str =~ /^(a|b|c|d|e|f)+$/ end x.report('character class') do str =~ /^[a-f]+$/ endend
Prefer Character Class to Alterations
![Page 58: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/58.jpg)
Ruby 1.8.7 user system total realalternation 0.030000 0.010000 0.040000 ( 0.036702)character class 0.000000 0.000000 0.000000 ( 0.004704)
Ruby 2.0.0 user system total realalternation 0.020000 0.010000 0.030000 ( 0.023139)character class 0.000000 0.000000 0.000000 ( 0.009641)
JRuby 1.7.4.dev user system total realalternation 0.030000 0.000000 0.030000 ( 0.021000)character class 0.010000 0.000000 0.010000 ( 0.007000)
Benchmarks
![Page 59: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/59.jpg)
# case-insensitively match any non-word character…
# one is unlike the others'r' =~ /(?i:[\W])/'s' =~ /(?i:[\W])/'t' =~ /(?i:[\W])/
Beware of Character Classes
matches, even if 's' is a word character
https://bugs.ruby-lang.org/issues/4044
![Page 60: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/60.jpg)
/^1?$|^(11+?)\1+$/
![Page 61: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/61.jpg)
/^1?$|^(11+?)\1+$/
Matches '1' or ''
![Page 62: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/62.jpg)
/^1?$|^(11+?)\1+$/
Non-greedily match 2 or more 1's
![Page 63: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/63.jpg)
/^1?$|^(11+?)\1+$/
1 or more additional times
![Page 64: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/64.jpg)
/^1?$|^(11+?)\1+$/
matches a composite number
![Page 65: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/65.jpg)
/^1?$|^(11+?)\1+$/
Matches a string of 1's if and only if there are a non-prime # of 1's
![Page 66: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/66.jpg)
class Integer def prime? "1" * self !~ /^1?$|^(11+?)\1+$/ endend
Integer#prime?
No performance guarantee
Attributed a Perl hacker Abigail
![Page 67: Regexp secrets](https://reader033.vdocuments.net/reader033/viewer/2022050919/54639b61b1af9fb0588b4599/html5/thumbnails/67.jpg)
• @hiro_asari
• Github: BanzaiMan