Post on 30-Dec-2015
Regular Expressions in PerlPart I
Alan Gold
Basic syntax
• =~ is the matching operator• !~ is the negated matching operator• // are the default delimiters• Prefixing the expression with “m” allows for
arbitrary delimiters: e.g. m%Don’t use this%• Modifiers follow the closing delimiter
Simple matching
• “Hello World” =~ /Hello/• Matches the literal string “Hello”• “Superman” =~ /Kal-El/• Unfortunately does not match
Metacharacters
• Metacharacters are {}[]()^$.|*+?\• These must be escaped with a “\” to match
their literal characters• “Spoon+fork” =~ /Spoon+/ will match, but not
how you want it to• “Spoonnnnnn” =~ /Spoon+/ will also match• “Spoon+fork” =~ /Spoon\+/ matches properly
Escape sequences
• Several characters can’t be printed directly• They are matched using an escape sequence• \t is a tab character (ASCII code 9)• \n is a newline character (ASCII code 10)• \r is a carriage return (ASCII code 13)• \0.. Is an octal character, e.g. \033• \x.. Is a hexidecimal character, e.g. \x1B
Variables
• Variables can be used in regular expressions similarly to double-quoted strings
• $something = “cool”;• ‘cool cruel pool’ =~ /$something/• Will match just fine
Anchors
• ^ anchors the pattern to the beginning of the string
• $ anchors to the end• “Speaker” =~ /^peak/• Will not match• “Rabbit” =~ /bit$/• Will match
Character classes
• Character classes match any character contained in [brackets]
• /tin[yas]/ will match tiny, tina, and tins• “-” can be used to represent a range• /[a-zA-Z0-9]/ will match a single alphanumeric
character• The literal “-” character can be matched if it is
the first or last character, e.g. /[-0-9]/
Negated character classes
• The “^” character negates a character class• /200[^7]/ will not match 2007 but will match
2008, 200q, etc.
Shortcut character classes
• \d is a digit, equivalent to [0-9]• \s is any whitespace, equivalent to [\ \t\r\n\f]• \w is a word character, eq. [0-9a-zA-Z_]• \D is any non-digit, eq. [^0-9]• \S is any non-whitespace, eq. [^\s]• \W is any non-word, eq. [^\w]• The period ‘.’ matches any character but ‘\n’
Word anchors
• The word anchor ‘\b’ matches the boundary between a word character and non-word character
• /\bpen/ matches “penitentiary”, not “open”• /\bpen\b/ only matches “pen” if surrounded
by non-words, e.g. “this pen is blue”
Modifiers
• Modifiers change the behavior of the engine• // is the default, ‘.’ doesn’t match newlines• //s causes ‘.’ to match newlines• //m treats each line as its own string• //i matches case-insensitively• Modifiers can be combined, e.g. //sim• /^car.$/im matches “not a car\nCAR!”
Or
• The pipe character ‘|’ can be used to match any one of the given choices
• /lumber|wood/ will match “My desk is made of spare lumber” and “My desk is made of 100,000 year old petrified wood”
• /0|1|2/ is equivalent to [0-2]
A blank slide