Perl Regex
description
Transcript of Perl Regex
![Page 1: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/1.jpg)
![Page 2: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/2.jpg)
Overview
![Page 3: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/3.jpg)
Introduction• Regular expressions are tiny programs in their own special
language, built inside Perl.• These allow fast, flexible, and reliable string handling.• A regular expression, often called a pattern in Perl, is a
template that either matches or doesn’t match a given string.
• That is, there are an infinite number of possible text strings; a given pattern divides that infinite set into two groups: the ones that match, and the ones that don’t.
• Don’t confuse regular expressions with shell filename-matching patterns, called globs, which is a different sort of pattern with its own rules.
![Page 4: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/4.jpg)
Simple Pattern• To match a pattern (regular expression) against the
contents of $_, simply put the pattern between a pair of forward slashes (/).
$_ = "yabba dabba doo";
if (/abba/) {
print "It matched!\n";
}
• The expression /abba/ looks for that four-letter string in $_; if it finds it, it returns a true value.
![Page 5: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/5.jpg)
Unicode Properties• Unicode characters know something about themselves;
they aren’t just sequences of bits.• Instead of matching on a particular character, you can
match a type of character.• To match a particular property, you put the name in \
p{PROPERTY}.if (/\p{Space}/) { # 26 different possible characters
print "The string has some whitespace.\n";
}
if (/\p{Digit}/) { # 411 different possible characters
print "The string has a digit.\n";
}
• More properties at perluniprops .
![Page 6: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/6.jpg)
Meta-characters• The dot (.) is a wildcard character—it matches any single
character except a newline./bet.y/ - > matches betty, betsy, bet=y, bet.y,
doesn’t match bety or betsey.
• The dot always matches exactly one character.• If you wanted the dot to match just a period, you can
simply backslash it./3\.141/ -> matches 3.141596456
doesn’t match 3a141545
• If you mean a real backslash, use a pair of them.$_ = 'a real \\ backslash';
if (/\\/) {
print "It matched!\n";
}
![Page 7: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/7.jpg)
Simple Quantifiers• * -- zero or more occurrences
/fred\t*barney/ matches fredbarney, fred\tbarney, fred\t\tbarney
/fred.*barney/ matches fredbarney, fredabcd…barney
• + -- one or more occurrences/fred\t+barney/ matches fred\tbarney, fred\t\tbarney
doesn’t match fredbarney
• ? -- zero or one occurrence/bam-?bam/ matches bambam, bam-bam
doesn’t match bam-----bam
![Page 8: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/8.jpg)
Grouping in Patterns• Use parentheses (“( )”) to group parts of a pattern.• So, parentheses are also meta-characters.
/fred+/ matches fredddd, fredd
/(fred)+/ matches fred, fredfred, fredfredfred
/(fred)*/ matches hello, barney, fred, fredfred
• Using of parentheses makes perl to store matched text in the special variables $1, $2, and so on. The number denotes the capture group.
$_ = “perl version is 5.14”;
if(/perl version is (.*)/) {
print $1; #prints 5.14
}
![Page 9: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/9.jpg)
• Use back references to refer to text that you matched in the parentheses, called a capture group.
• You denote a back reference as a backslash followed by a number, like \1, \2, and so on.
$_ = "abba";
if (/(.)\1/) { # matches 'bb'
print "It matched same character next to itself!\n";
}
$_ = "yabba dabba doo";
if (/y(....) d\1/) {
print "It matched the same after y and d!\n";
}
![Page 10: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/10.jpg)
$_ = "yabba dabba doo";
if (/y(.)(.)\2\1/) { # matches 'abba'
print "It matched after the y!\n";
}
• “How do I know which group gets which number?”--just count the order of the opening parenthesis and ignore nesting.
$_ = "yabba dabba doo";
if (/y((.)(.)\3\2) d\1/) {
print "It matched!\n";
}
![Page 11: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/11.jpg)
• Consider the problem where you want to use a back reference next to a part of the pattern that is a number.
• In this regular expression, you want to use \1 to repeat the character you matched in the parentheses and follow that with the literal string 11
$_ = "aa11bb";
if (/(.)\111/) {
print "It matched!\n";
}
Is that \1, \11, or \111?
![Page 12: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/12.jpg)
• Starting from perl 5.10, by using \g{1}, you disambiguate the back reference and the literal parts of the pattern:‖
use 5.010;
$_ = "aa11bb";
if (/(.)\g{1}11/) {
print "It matched!\n";
}
• With the \g{N} notation, you can also use negative numbers.
use 5.010;
$_ = "xaa11bb";
if (/(.)(.)\g{–1}11/) {
print "It matched!\n"; }
![Page 13: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/13.jpg)
Alternatives• The vertical bar (|), often called “or” in this usage, means, if
the part of the pattern on the left of the bar fails, the part on the right gets a chance to match.
/fred|barney|betty/ matches fred, barney, betty.
/fred( |\t)+barney/ matches if fred and barney are separated by spaces, tabs, or a mixture of the two.
/fred( +|\t+)barney/ matches if fred and barney are separated either only by space or only by tabs not mixture of space and tabs.
/fred (and|or) barney/ matches fred and barney, fred or barney. Same as pattern /fred and barney|fred or barney/.
![Page 14: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/14.jpg)
Character Classes• A character class, a list of possible characters inside square
brackets.• It matches just one single character, but that one character
may be any of the ones you list in the brackets.[abcwxyz] matches a,b,c,w,x,y,z (any of those seven characters)
• You may specify a range of characters with a hyphen (-)[a-cw-z] implies all alphabets between a to c and w to z[a-zA-Z0-9] implies any alphanumeric character
$_ = "The HAL-9000 requires authorization to continue.";
if (/HAL-[0-9]+/) {
print "The string mentions some model of HAL computer.\n";
}
![Page 15: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/15.jpg)
Character Class Shortcuts• Some character classes appear so frequently that they have
shortcuts.• The character class for any digit as \d.
$_ = 'The HAL-9000 requires authorization to continue.';
if (/HAL-[\d]+/) {
say 'The string mentions some model of HAL computer.';
}
• However, there are many more digits than the 0 to 9 that you may expect from ASCII, so that will also match HAL-٩٠٠٠
• Recognizing this problematic shift from ASCII to Unicode, Perl 5.14 adds /a modifier on the end of the match perator tells Perl to use the old ASCII interpretation.
![Page 16: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/16.jpg)
• \s matches any whitespace, which is almost the same as the Unicode property \p{Space}
• \h only matches horizontal whitespace. • \v shortcut only matches vertical whitespace.• Taken together, the \h and \v are the same as \p{Space}• The \R shortcut, introduced in Perl 5.10, matches any sort
of line-break, independent of operating system.• \w matches the set of characters [a-zA-Z0-9_]
![Page 17: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/17.jpg)
Negating the Shortcuts• To specify the characters you want to leave out, rather than
the ones within the character class use caret(^).• A caret (^) at start of character class(i.e., inside square
brackets) negates the class.[^def] match any single character except one of those three.
[^n\-z] matches any character except for n, hyphen, or z.
• To negate a shortcut use it upper case \S matches any non-space
\D matches any non-digit
[\d\D] matches any digit, or any non-digit. i.e., any character or anything
[^\d\D] matches anything that’s not either a digit or a non-digit. i.e., nothing!
![Page 18: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/18.jpg)
![Page 19: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/19.jpg)
Matches with m//• We put patterns in pairs of forward slashes, like /fred/. But
this is actually a shortcut for the m// (pattern match operator).
• We may choose any pair of delimiters to quote the contents.
m(fred), m<fred>, m{fred}, m[fred], m,fred,, m!fred!, m^fred^
• The shortcut is that if you choose the forward slash as the delimiter, you may omit the initial m.
• Wisely choose a delimiter that doesn’t appear in your pattern.
m%http://% instead of /http:\/\// to match the initial "http://".
![Page 20: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/20.jpg)
Match Modifiers• Case-Insensitive Matching with /i
$_=“Is Freddy there?”;
if(/freddy/i) {
print “Yes Freddy is here”;
}
• Without the /s modifier, that match would fail, since the two names aren’t on the same line.
• If you wanted to still match any character except a newline? --You could use the character class [^\n], or from Perl
5.12 added the shortcut \N to mean the complement of \n.
![Page 21: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/21.jpg)
• Matching Any Character with /s– Using /s modifier makes dot(.) to match any character including a
newline character. – It achieves this by replacing (.) with [dD] with matches anything.– The effect can only be felt when the string has newline characters.
$_ = "I saw Barney\ndown at the bowling alley\nwith Fred\nlast night.\n";
if (/Barney.*Fred/s) {
print "That string mentions Fred after Barney!\n";
}
• There are many other modifiers available at perlop documentation. A few are described below.
![Page 22: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/22.jpg)
• Adding Whitespace with /x– allows you to add arbitrary whitespace to a pattern, in order to
make it easier to read./-?[0-9]+\.?[0-9]*/ # what is this doing?
/ -? [0-9]+ \.? [0-9]* /x # a little better– /x allows whitespace inside the pattern, Perl ignores literal space
or tab characters within the pattern.– You could use a backslashed space or \t or \s (more common)(or \
s* or \s+) when you want to match whitespace.
![Page 23: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/23.jpg)
– Perl considers comments a type of whitespace, so you can put comments into that pattern to tell what you are trying to do:
/
-? # an optional minus sign
[0-9]+ # one or more digits before the decimal point
\.? # an optional decimal point
[0-9]* # some optional digits after the decimal point
/x # end of string– Use the escaped character, \#, or the character class, [#], if you
need to match a literal pound sign as it indicates start of comment/
[0-9]+ # one or more digits before the decimal point
[#] # literal pound sign
/x # end of string
![Page 24: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/24.jpg)
– Be careful not to include the closing delimiter inside the comments, or it will prematurely terminate the pattern. This pattern ends before you think it does:
/
-? # with / without - <--- OOPS!
[0-9]+ # one or more digits before the decimal point
\.? # an optional decimal point
[0-9]* # some optional digits after the decimal point
/x # end of string
![Page 25: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/25.jpg)
Combining Option Modifiers• If you want to use more than one modifier on the same
match, just put them both at the end (their order isn’t significant)
if (/barney.*fred/is) { # both /i and /s
print "That string mentions Fred after Barney!\n";
}
Or as a more expanded version with comments:
if (m{
barney # the little guy
.* # anything in between
fred # the loud guy
}isx) { # all three of /s and /i and /x
print "That string mentions Fred after Barney!\n"; }
![Page 26: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/26.jpg)
![Page 27: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/27.jpg)
![Page 28: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/28.jpg)
![Page 29: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/29.jpg)
![Page 30: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/30.jpg)
![Page 31: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/31.jpg)
![Page 32: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/32.jpg)
![Page 33: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/33.jpg)
![Page 34: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/34.jpg)
![Page 35: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/35.jpg)
![Page 36: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/36.jpg)
![Page 37: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/37.jpg)
![Page 38: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/38.jpg)
![Page 39: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/39.jpg)
![Page 40: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/40.jpg)
![Page 41: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/41.jpg)
![Page 42: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/42.jpg)
![Page 43: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/43.jpg)
![Page 44: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/44.jpg)
![Page 45: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/45.jpg)
![Page 46: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/46.jpg)
![Page 47: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/47.jpg)
![Page 48: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/48.jpg)
![Page 49: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/49.jpg)
![Page 50: Perl Regex](https://reader036.fdocuments.us/reader036/viewer/2022081502/55cf9941550346d0339c6f1b/html5/thumbnails/50.jpg)
Misc• The trick with a good pattern is to not match more than you
ever mean to match.