Regexp secrets

Post on 12-Nov-2014

575 views 0 download

Tags:

description

Regexp class is in every Rubyist's toolbox. But do you know the theory behind it, and what goes on under the hood?

Transcript of Regexp secrets

Secrets of RegexpHiro Asari

Red Hat, Inc.

Let's Talk AboutRegular Expressions

Let's Talk AboutRegular Expressions

• There is no regular expression

Let's Talk AboutRegular Expressions

• A good approximation as a name

Let's Talk AboutRegexp

Some people, when confronted with a problem, think, "I know, I'll use regular expressions."

Now they have two problems.

Jaime Zawinski12 Aug, 1997

http://regex.info/blog/2006-09-15/247http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

The point is not so much the evils of regular expressions, but the evils of overuse of it.

Formal Language Theory

• The Language L

• Over Alphabet Σ

Formal Language Theory

• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)

Formal Language Theory

• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)

• Words over Σ: "a", "b", "ab", "aequafdhfad"

Formal Language Theory

• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)

• Words over Σ: "a", "b", "ab", "aequafdhfad"

• Σ*: The set of all words over Σ

Formal Languageover Σ

• A subset L of Σ* (with various properties)

• L can be finite, and enumerate well-formed words, but often infinite

Example

• Language L over Σ = {a,b}

• 'a' is a word

• a word may be obtained by appending 'ab' to an existing word

• only words thus formed are legal

aaabaabab

Well-formed words

baaaababb

Ill-formed words

Succinctly…

• a(ab)*

Expression

• Textual representation of the formal language against which an input is tested whether it is a well-formed word in that language

Regular Languages

• ∅ (empty language) is regular

Regular Languages

• ∅ (empty language) is regular

• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.

Regular Languages

• ∅ (empty language) is regular

• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.

• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages

Regular Languages

• ∅ (empty language) is regular

• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.

• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages

• No other languages over Σ are regular.

Regular Expressions

• Expressions of regular languages

Regular Expressions

• Expressions of regular languages

Not

Regular? Expressions

• It turns out that some expressions are more powerful and expresses non-regular languages

• Language of 'squares': (.*)\1

• a, aa, aaaa, WikiWiki

How does Regexp work?

• Build a finite state automaton representing a given regular expression

• Feed the String to the regular expression and see if the match succeeds

a

a

ab*

a

b

.*

.

a$

a $

a?

a

ε

a|b

a

b

(ab|c)

c

a b

(ab+|c)

c

a

b

b

Match is attempted at every character, left to

right

zyxwvutsrqponmlkjihgfedcba^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^⋮zyxwvutsrqponmlkjihgfedcba ^

/a$/

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^

# matches 'abc d a dfadg '

^\s*(.*)\s*$

def pathological(n=5) Regexp.new('a?' * n + 'a' * n)end

1.upto(40) do |n| print n, ": " print Time.now, "\n" if 'a'*n =~ pathological(n)end

a?a?a?…a?aaa…a

aaa^

a?a?a?aaa

Regexp tips

UP_TO_256 = /\b(?:25[0-5] # 250-255|2[0-4][0-9] # 200-249|1[0-9][0-9] # 100-199|[1-9][0-9] # 2-digit numbers|[0-9]) # single-digit numbers\b/x

IPV4_ADDRESS = /#{UP_TO_256}(?:\.#{UP_TO_256}){3}/

Use /x

\A, \z for strings^, $ for lines

• \A: the beginning of the string

• \z: the end of the string

• ^: after \n

• $: before \n

always in Ruby

\A, \z for strings^, $ for lines

• \A: the beginning of the string

• \z: the end of the string

• ^: after \n

• $: before \n

What's the problem?

also note the difference in what /m means

#! /usr/bin/env perl$a = "abc\ndef";if ($a =~ /^d/) { print "yes\n";}if ($a =~ /^d/m) { print "yes now\n";}# prints 'yes now'

What's the problem?

also note the difference in what /m means

#! /usr/bin/env ruby

a = "abc\ndef";if (a =~ /^d/) p "yes"end

What's the problem?

http://guides.rubyonrails.org/security.html#regular-expressions

class File < ActiveRecord::Base!!validates :name, :format => /^[\w\.\-\+]+$/end

Security Implications

http://guides.rubyonrails.org/security.html#regular-expressions

file.txt%0A<script>alert(‘hello’)</script>

file.txt%0A<script>alert(‘hello’)</script>

file.txt\n<script>alert(‘hello’)</script>

file.txt\n<script>alert(‘hello’)</script>

/^[\w\.\-\+]+$/

file.txt\n<script>alert(‘hello’)</script>

/^[\w\.\-\+]+$/

Match succeedsActiveRecord validation succeeds

file.txt\n<script>alert(‘hello’)</script>

/\A[\w\.\-\+]+\z/

file.txt\n<script>alert(‘hello’)</script>

/\A[\w\.\-\+]+\z/

Match failsActiveRecord validation fails

require 'benchmark'

# simple benchmark for alternations and character class

n = 5_000

str = 'cafebabedeadbeef'*5_000

Benchmark.bmbm do |x| x.report('alternation') do str =~ /^(a|b|c|d|e|f)+$/ end x.report('character class') do str =~ /^[a-f]+$/ endend

Prefer Character Class to Alterations

Ruby 1.8.7 user system total realalternation 0.030000 0.010000 0.040000 ( 0.036702)character class 0.000000 0.000000 0.000000 ( 0.004704)

Ruby 2.0.0 user system total realalternation 0.020000 0.010000 0.030000 ( 0.023139)character class 0.000000 0.000000 0.000000 ( 0.009641)

JRuby 1.7.4.dev user system total realalternation 0.030000 0.000000 0.030000 ( 0.021000)character class 0.010000 0.000000 0.010000 ( 0.007000)

Benchmarks

# case-insensitively match any non-word character…

# one is unlike the others'r' =~ /(?i:[\W])/'s' =~ /(?i:[\W])/'t' =~ /(?i:[\W])/

Beware of Character Classes

matches, even if 's' is a word character

https://bugs.ruby-lang.org/issues/4044

/^1?$|^(11+?)\1+$/

/^1?$|^(11+?)\1+$/

Matches '1' or ''

/^1?$|^(11+?)\1+$/

Non-greedily match 2 or more 1's

/^1?$|^(11+?)\1+$/

1 or more additional times

/^1?$|^(11+?)\1+$/

matches a composite number

/^1?$|^(11+?)\1+$/

Matches a string of 1's if and only if there are a non-prime # of 1's

class Integer def prime? "1" * self !~ /^1?$|^(11+?)\1+$/ endend

Integer#prime?

No performance guarantee

Attributed a Perl hacker Abigail

• @hiro_asari

• Github: BanzaiMan