Regexp secrets

67
Secrets of Regexp Hiro Asari Red Hat, Inc.

description

Regexp class is in every Rubyist's toolbox. But do you know the theory behind it, and what goes on under the hood?

Transcript of Regexp secrets

Page 1: Regexp secrets

Secrets of RegexpHiro Asari

Red Hat, Inc.

Page 2: Regexp secrets

Let's Talk AboutRegular Expressions

Page 3: Regexp secrets

Let's Talk AboutRegular Expressions

• There is no regular expression

Page 4: Regexp secrets

Let's Talk AboutRegular Expressions

• A good approximation as a name

Page 5: Regexp secrets

Let's Talk AboutRegexp

Page 6: Regexp secrets

Some people, when confronted with a problem, think, "I know, I'll use regular expressions."

Now they have two problems.

Jaime Zawinski12 Aug, 1997

http://regex.info/blog/2006-09-15/247http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

The point is not so much the evils of regular expressions, but the evils of overuse of it.

Page 7: Regexp secrets

Formal Language Theory

• The Language L

• Over Alphabet Σ

Page 8: Regexp secrets

Formal Language Theory

• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)

Page 9: Regexp secrets

Formal Language Theory

• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)

• Words over Σ: "a", "b", "ab", "aequafdhfad"

Page 10: Regexp secrets

Formal Language Theory

• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)

• Words over Σ: "a", "b", "ab", "aequafdhfad"

• Σ*: The set of all words over Σ

Page 11: Regexp secrets

Formal Languageover Σ

• A subset L of Σ* (with various properties)

• L can be finite, and enumerate well-formed words, but often infinite

Page 12: Regexp secrets

Example

• Language L over Σ = {a,b}

• 'a' is a word

• a word may be obtained by appending 'ab' to an existing word

• only words thus formed are legal

Page 13: Regexp secrets

aaabaabab

Well-formed words

Page 14: Regexp secrets

baaaababb

Ill-formed words

Page 15: Regexp secrets

Succinctly…

• a(ab)*

Page 16: Regexp secrets

Expression

• Textual representation of the formal language against which an input is tested whether it is a well-formed word in that language

Page 17: Regexp secrets

Regular Languages

• ∅ (empty language) is regular

Page 18: Regexp secrets

Regular Languages

• ∅ (empty language) is regular

• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.

Page 19: Regexp secrets

Regular Languages

• ∅ (empty language) is regular

• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.

• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages

Page 20: Regexp secrets

Regular Languages

• ∅ (empty language) is regular

• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.

• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages

• No other languages over Σ are regular.

Page 21: Regexp secrets

Regular Expressions

• Expressions of regular languages

Page 22: Regexp secrets

Regular Expressions

• Expressions of regular languages

Not

Page 23: Regexp secrets

Regular? Expressions

• It turns out that some expressions are more powerful and expresses non-regular languages

• Language of 'squares': (.*)\1

• a, aa, aaaa, WikiWiki

Page 24: Regexp secrets

How does Regexp work?

• Build a finite state automaton representing a given regular expression

• Feed the String to the regular expression and see if the match succeeds

Page 25: Regexp secrets

a

a

Page 26: Regexp secrets

ab*

a

b

Page 27: Regexp secrets

.*

.

Page 28: Regexp secrets

a$

a $

Page 29: Regexp secrets

a?

a

ε

Page 30: Regexp secrets

a|b

a

b

Page 31: Regexp secrets

(ab|c)

c

a b

Page 32: Regexp secrets

(ab+|c)

c

a

b

b

Page 33: Regexp secrets

Match is attempted at every character, left to

right

Page 34: Regexp secrets

zyxwvutsrqponmlkjihgfedcba^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

Page 35: Regexp secrets

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

Page 36: Regexp secrets

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

Page 37: Regexp secrets

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

Page 38: Regexp secrets

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^⋮zyxwvutsrqponmlkjihgfedcba ^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

Page 39: Regexp secrets

abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^

# matches 'abc d a dfadg '

^\s*(.*)\s*$

Page 40: Regexp secrets

def pathological(n=5) Regexp.new('a?' * n + 'a' * n)end

1.upto(40) do |n| print n, ": " print Time.now, "\n" if 'a'*n =~ pathological(n)end

a?a?a?…a?aaa…a

Page 41: Regexp secrets

aaa^

a?a?a?aaa

Page 42: Regexp secrets

Regexp tips

Page 43: Regexp secrets

UP_TO_256 = /\b(?:25[0-5] # 250-255|2[0-4][0-9] # 200-249|1[0-9][0-9] # 100-199|[1-9][0-9] # 2-digit numbers|[0-9]) # single-digit numbers\b/x

IPV4_ADDRESS = /#{UP_TO_256}(?:\.#{UP_TO_256}){3}/

Use /x

Page 44: Regexp secrets

\A, \z for strings^, $ for lines

• \A: the beginning of the string

• \z: the end of the string

• ^: after \n

• $: before \n

Page 45: Regexp secrets

always in Ruby

\A, \z for strings^, $ for lines

• \A: the beginning of the string

• \z: the end of the string

• ^: after \n

• $: before \n

Page 46: Regexp secrets

What's the problem?

also note the difference in what /m means

Page 47: Regexp secrets

#! /usr/bin/env perl$a = "abc\ndef";if ($a =~ /^d/) { print "yes\n";}if ($a =~ /^d/m) { print "yes now\n";}# prints 'yes now'

What's the problem?

also note the difference in what /m means

Page 48: Regexp secrets

#! /usr/bin/env ruby

a = "abc\ndef";if (a =~ /^d/) p "yes"end

What's the problem?

http://guides.rubyonrails.org/security.html#regular-expressions

Page 49: Regexp secrets

class File < ActiveRecord::Base!!validates :name, :format => /^[\w\.\-\+]+$/end

Security Implications

http://guides.rubyonrails.org/security.html#regular-expressions

Page 50: Regexp secrets

file.txt%0A<script>alert(‘hello’)</script>

Page 51: Regexp secrets

file.txt%0A<script>alert(‘hello’)</script>

Page 52: Regexp secrets

file.txt\n<script>alert(‘hello’)</script>

Page 53: Regexp secrets

file.txt\n<script>alert(‘hello’)</script>

/^[\w\.\-\+]+$/

Page 54: Regexp secrets

file.txt\n<script>alert(‘hello’)</script>

/^[\w\.\-\+]+$/

Match succeedsActiveRecord validation succeeds

Page 55: Regexp secrets

file.txt\n<script>alert(‘hello’)</script>

/\A[\w\.\-\+]+\z/

Page 56: Regexp secrets

file.txt\n<script>alert(‘hello’)</script>

/\A[\w\.\-\+]+\z/

Match failsActiveRecord validation fails

Page 57: Regexp secrets

require 'benchmark'

# simple benchmark for alternations and character class

n = 5_000

str = 'cafebabedeadbeef'*5_000

Benchmark.bmbm do |x| x.report('alternation') do str =~ /^(a|b|c|d|e|f)+$/ end x.report('character class') do str =~ /^[a-f]+$/ endend

Prefer Character Class to Alterations

Page 58: Regexp secrets

Ruby 1.8.7 user system total realalternation 0.030000 0.010000 0.040000 ( 0.036702)character class 0.000000 0.000000 0.000000 ( 0.004704)

Ruby 2.0.0 user system total realalternation 0.020000 0.010000 0.030000 ( 0.023139)character class 0.000000 0.000000 0.000000 ( 0.009641)

JRuby 1.7.4.dev user system total realalternation 0.030000 0.000000 0.030000 ( 0.021000)character class 0.010000 0.000000 0.010000 ( 0.007000)

Benchmarks

Page 59: Regexp secrets

# case-insensitively match any non-word character…

# one is unlike the others'r' =~ /(?i:[\W])/'s' =~ /(?i:[\W])/'t' =~ /(?i:[\W])/

Beware of Character Classes

matches, even if 's' is a word character

https://bugs.ruby-lang.org/issues/4044

Page 60: Regexp secrets

/^1?$|^(11+?)\1+$/

Page 61: Regexp secrets

/^1?$|^(11+?)\1+$/

Matches '1' or ''

Page 62: Regexp secrets

/^1?$|^(11+?)\1+$/

Non-greedily match 2 or more 1's

Page 63: Regexp secrets

/^1?$|^(11+?)\1+$/

1 or more additional times

Page 64: Regexp secrets

/^1?$|^(11+?)\1+$/

matches a composite number

Page 65: Regexp secrets

/^1?$|^(11+?)\1+$/

Matches a string of 1's if and only if there are a non-prime # of 1's

Page 66: Regexp secrets

class Integer def prime? "1" * self !~ /^1?$|^(11+?)\1+$/ endend

Integer#prime?

No performance guarantee

Attributed a Perl hacker Abigail

Page 67: Regexp secrets

• @hiro_asari

• Github: BanzaiMan