Regexp secrets

Secrets of RegexpHiro Asari

Red Hat, Inc.

Let's Talk AboutRegular Expressions

• There is no regular expression

Let's Talk AboutRegular Expressions

• A good approximation as a name

Let's Talk AboutRegexp

Some people, when confronted with a problem, think, "I know, I'll use regular expressions."

Now they have two problems.

Jaime Zawinski12 Aug, 1997

http://regex.info/blog/2006-09-15/247http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

The point is not so much the evils of regular expressions, but the evils of overuse of it.

Formal Language Theory

• The Language L

• Over Alphabet Σ

• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)

• Words over Σ: "a", "b", "ab", "aequafdhfad"

• Σ*: The set of all words over Σ

Formal Languageover Σ

• A subset L of Σ* (with various properties)

• L can be finite, and enumerate well-formed words, but often infinite

Example

• Language L over Σ = {a,b}

• 'a' is a word

• a word may be obtained by appending 'ab' to an existing word

• only words thus formed are legal

aaabaabab

Well-formed words

baaaababb

Ill-formed words

Succinctly…

• a(ab)*

Expression

• Textual representation of the formal language against which an input is tested whether it is a well-formed word in that language

Regular Languages

• ∅ (empty language) is regular

Regular Languages

• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.

Regular Languages

• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages

Regular Languages

• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages

• No other languages over Σ are regular.

Regular Expressions

• Expressions of regular languages

Regular Expressions

• Expressions of regular languages

Regular? Expressions

• It turns out that some expressions are more powerful and expresses non-regular languages

• Language of 'squares': (.*)\1

• a, aa, aaaa, WikiWiki

How does Regexp work?

• Build a finite state automaton representing a given regular expression

• Feed the String to the regular expression and see if the match succeeds

(ab|c)

(ab+|c)

Match is attempted at every character, left to

zyxwvutsrqponmlkjihgfedcba^

Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^

zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^⋮zyxwvutsrqponmlkjihgfedcba ^

abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^

# matches 'abc d a dfadg '

^\s*(.*)\s*$

def pathological(n=5) Regexp.new('a?' * n + 'a' * n)end

1.upto(40) do |n| print n, ": " print Time.now, "\n" if 'a'*n =~ pathological(n)end

a?a?a?…a?aaa…a

a?a?a?aaa

Regexp tips

UP_TO_256 = /\b(?:25[0-5] # 250-255|2[0-4][0-9] # 200-249|1[0-9][0-9] # 100-199|[1-9][0-9] # 2-digit numbers|[0-9]) # single-digit numbers\b/x

IPV4_ADDRESS = /#{UP_TO_256}(?:\.#{UP_TO_256}){3}/

Use /x

\A, \z for strings^, $ for lines

• \A: the beginning of the string

• \z: the end of the string

• ^: after \n

• $: before \n

always in Ruby

\A, \z for strings^, $ for lines

• \A: the beginning of the string

• \z: the end of the string

• ^: after \n

• $: before \n

What's the problem?

also note the difference in what /m means

#! /usr/bin/env perl$a = "abc\ndef";if ($a =~ /^d/) { print "yes\n";}if ($a =~ /^d/m) { print "yes now\n";}# prints 'yes now'

What's the problem?

also note the difference in what /m means

#! /usr/bin/env ruby

a = "abc\ndef";if (a =~ /^d/) p "yes"end

What's the problem?

http://guides.rubyonrails.org/security.html#regular-expressions

class File < ActiveRecord::Base!!validates :name, :format => /^[\w\.\-\+]+$/end

Security Implications

http://guides.rubyonrails.org/security.html#regular-expressions

file.txt%0A<script>alert(‘hello’)</script>

file.txt\n<script>alert(‘hello’)</script>

/^[\w\.\-\+]+$/

Match succeedsActiveRecord validation succeeds

/\A[\w\.\-\+]+\z/

Match failsActiveRecord validation fails

require 'benchmark'

# simple benchmark for alternations and character class

n = 5_000

str = 'cafebabedeadbeef'*5_000

Benchmark.bmbm do |x| x.report('alternation') do str =~ /^(a|b|c|d|e|f)+$/ end x.report('character class') do str =~ /^[a-f]+$/ endend

Prefer Character Class to Alterations

Ruby 1.8.7 user system total realalternation 0.030000 0.010000 0.040000 ( 0.036702)character class 0.000000 0.000000 0.000000 ( 0.004704)

Ruby 2.0.0 user system total realalternation 0.020000 0.010000 0.030000 ( 0.023139)character class 0.000000 0.000000 0.000000 ( 0.009641)

JRuby 1.7.4.dev user system total realalternation 0.030000 0.000000 0.030000 ( 0.021000)character class 0.010000 0.000000 0.010000 ( 0.007000)

Benchmarks

# case-insensitively match any non-word character…

# one is unlike the others'r' =~ /(?i:[\W])/'s' =~ /(?i:[\W])/'t' =~ /(?i:[\W])/

Beware of Character Classes

matches, even if 's' is a word character

https://bugs.ruby-lang.org/issues/4044

/^1?$|^(11+?)\1+$/

Matches '1' or ''

/^1?$|^(11+?)\1+$/

Non-greedily match 2 or more 1's

/^1?$|^(11+?)\1+$/

1 or more additional times

/^1?$|^(11+?)\1+$/

matches a composite number

/^1?$|^(11+?)\1+$/

Matches a string of 1's if and only if there are a non-prime # of 1's

class Integer def prime? "1" * self !~ /^1?$|^(11+?)\1+$/ endend

Integer#prime?

No performance guarantee

Attributed a Perl hacker Abigail

• @hiro_asari

• Github: BanzaiMan

Regexp secrets

Technology

Transcript of Regexp secrets

The Secrets of Secrets: Secretum Secretorum of Pseudo Aristotle

TRADE SECRETS. Outline of Presentation What are trade secrets Keeping them secret Trade secrets or patents Legal protection for trade secrets and remedies.

Regular Expressions - University of Washingtonelbo.gs.washington.edu/courses/GS_559_13_wi/slides/14B_RegExp.pdf · Regular expressions (a.k.a. RE, regexp, regexes, regex) are a highly

CH4 Strings Regexp

Credit Secrets: THE CREDIT SECRETS MINI-BOOK

Secrets are secrets. Please, maintain keep them.

Lecture 14 Automata and RegExp

A simple library for regular expressions - lri.frmarche/regexp/regexp.pdf · 2.1.4 Compilation of regular expressions type compiled regexp ... string, 4 7. Chapter 3 ... and provides

Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords () ...

Perly Parsing with Regexp::Grammars

Importing a - Electrical Engineering and Computer Sciencegrail.cba.csuohio.edu/~matos/notes/cis-610/MySQL-Notes/Stored... · MID NOT LIKE NOT REGEXP OCTET_LENGTH NOT RLIKE ... YEAR()

A FOREST’S SECRETS A forest has many secrets—secrets that man has long forgotten. A forest has healing secrets, treasure secrets, and history secrets.

Handout: Regular Expressions - CMU Statisticscshalizi/statcomp/14/lectures/05/regexp-handout.pdf · Handout: Regular Expressions 36-350, Fall 2014 Accompanying lecture 5 Readings:

Spa Menu - Secrets Resorts...Welcome to Secrets Seaweed Body Wrap, Swedish Massage, Secrets Exclusive Facial. Secrets Signature Hot Stone Massage, Secrets Exclusive Facial, Spa Manicure

CS 683 Emerging Technologies Fall Semester, 2005 Doc 19 Ruby Regexp, Expression, Exceptions, Modules Nov 8, 2005 Copyright ©, All rights reserved. 2005.

INSIDERS SECRETS TO PLAYING TEXAS HOLD’EM POKER … Secrets of Online Poker.pdf · our goal with Insiders Secrets To Playing Texas Hold’em Online; ... Insiders Secrets To Texas

Regexp master 2011

Secrets of PostgreSQL Performance - Revolution Systemsmedia.revsys.com/.../2011/secrets-of-postgresql-performance.pdf · Secrets of PostgreSQL Performance Frank Wiles ... •Retrieve

TRADE SECRETS. Outline of Presentation What are trade secrets Protecting trade secrets Trade secrets or patents Legal protection for trade secrets and.

Regular Expressions - Borenstein Labborensteinlab.com/courses/GS_559_11_wi/slides/13B_RegExp.pdfRegular expressions Regular expressions (a.k.a. RE, regexp, regexes, regex) are a highly