Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find...

23
CMPSC 311: Introduction to Systems Programming Page 1 Institute for Networking and Security Research Department of Computer Science and Engineering Pennsylvania State University, University Park, PA Systems and Internet Infrastructure Security i i Regular Expressions Devin J. Pohly <[email protected]>

Transcript of Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find...

Page 1: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

CMPSC 311: Introduction to Systems Programming Page 1

Institute for Networking and Security ResearchDepartment of Computer Science and EngineeringPennsylvania State University, University Park, PA

Systems and Internet Infrastructure Security

i

i

Regular Expressions

Devin J. Pohly <[email protected]>

Page 2: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 2CMPSC 311: Introduction to Systems Programming

Regular expressions

• Often shortened to “regexp” or “regex”

• Regular expressions are a language for matching patterns‣ Powerful find and replace

tool‣ Can be used on the CLI,

in shell scripts, as a text editor feature, or as part of a program

Page 3: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 3CMPSC 311: Introduction to Systems Programming

What are they good for?

• Searching for specifically formatted text‣ Email address‣ Phone number‣ Anything that follows a

pattern

• Validating input‣ Same idea

• Powerful find-and-replace‣ E.g. change “X and Y” to

“Y and X” for any X, Y

Page 4: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 4CMPSC 311: Introduction to Systems Programming

Regex flavors

• Many languages support regular expressions‣ Perl‣ JavaScript‣ Python‣ PHP‣ Java, Ruby, .NET, etc.

• Today we will be learning standard Unix “extended regular expressions”

Page 5: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 5CMPSC 311: Introduction to Systems Programming

On the command line

• The grep command is a regex filter‣ That’s what the “re” in the

middle stands for‣ We have seen fgrep,

which looks for literal strings

• Today we will use egrep‣ E for “extended” regular

expressions‣ Very close to other

languages’ flavors

Page 6: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 6CMPSC 311: Introduction to Systems Programming

grep command syntax• To find matches in files:

egrep regex file(s)

• To filter standard input:egrep regex

‣ where regex is a regular expression, and file(s) are the files to search

• Options (aka “flags”):‣ -i: ignore case

‣ -v: find non-matching lines

‣ -r: search entire directories

‣ man grep for more

Page 7: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 7CMPSC 311: Introduction to Systems Programming

Okay, let’s begin!

$ cd /usr/share/dict$ egrep hello words...$ cat words | egrep hello...

Page 8: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 8CMPSC 311: Introduction to Systems Programming

First lesson

• Letters, numbers, and a few other things match literally‣ Find all the words that

contain “fgh”‣ Find all the words that

contain “lmn”

• Notice: this will match anywhere in the string‣ Not necessarily the whole

string

Page 9: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 9CMPSC 311: Introduction to Systems Programming

Anchors• Caret ^ matches at the

beginning of a line

• Dollar sign $ matches at the end of a line‣ Use '…' to protect

characters from the shell!

• Try it‣ Find words ending in “gry”‣ Find words starting with “ah”

• What happens if we use both?

Page 10: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 10CMPSC 311: Introduction to Systems Programming

Single-character wildcard

• Dot . matches any single character (exactly one)‣ Find a 6-letter word

where the second, fourth, and sixth letters are “o”

‣ Find any words that are at least 23 characters long

Page 11: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 11CMPSC 311: Introduction to Systems Programming

Multi-character wildcard

• Dot-star .* will match 0 or more characters‣ We’ll see why on the

next slide‣ Find all the words that

contain a, e, i, o, u in that order (with anything in between)

‣ How about u, o, i, e, a?

Page 12: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 12CMPSC 311: Introduction to Systems Programming

Quantifiers• How many repetitions of

the previous thing to match?‣ Star *: 0 or more

‣ Plus +: at least 1

‣ Question mark ?: 0 or 1 (i.e., optional)

• Try it out‣ Spell check: necc?ess?ary‣ Global awareness: colou?r‣ Find words with u, o, i, e, a in

that order and at least one letter in between each

Page 13: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 13CMPSC 311: Introduction to Systems Programming

Careful!

• What happens if you search for the empty string?‣ Use '' to give the shell

an empty argument

• Now, what happens if you search for z*?‣ Why?

• Make sure your regex always tries to match something!

Page 14: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 14CMPSC 311: Introduction to Systems Programming

Character classes• Square brackets [abc] will

match exactly one of the enclosed characters‣ What will [chs]andy

match?‣ You can use quantifiers on

character classes◾ Find words starting with b

where all the rest of the letters are a, n, or s

◾ How many words can you type with ASDFJKL?

◾ How many words can you type with AOEUHTNS?

Page 15: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 15CMPSC 311: Introduction to Systems Programming

• Part of character classes

• You can specify a range of characters with [a-j]‣ One hex digit: [0-9a-f]‣ Consonants:[b-df-hj-np-tv-z]

‣ Find all the words you can make with A through E◾ … that are at least 5

letters long (hint: pipe the output!)

Ranges

Page 16: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 16CMPSC 311: Introduction to Systems Programming

Negative character classes

• If the first character is a caret, matches anything except these characters‣ Consonants: [^aeiou]‣ Find words that contain a

q, followed by something other than u

‣ Can be combined with ranges◾ Any character that isn’t a

digit: ???

Page 17: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 17CMPSC 311: Introduction to Systems Programming

Negative character classes

• If the first character is a caret, matches anything except these characters‣ Consonants: [^aeiou]‣ Find words that contain a

q, followed by something other than u

‣ Can be combined with ranges◾ Any character that isn’t a

digit: [^0-9]

Page 18: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 18CMPSC 311: Introduction to Systems Programming

Groups

• Parentheses (…) create groups within a regex‣ Quantifiers operate on

the entire group‣ Find words with an m,

followed by “ach” one or more times, followed by e

‣ Find words where every other character, starting with the first, is an e

Page 19: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 19CMPSC 311: Introduction to Systems Programming

Branches

• The vertical bar | denotes that either the left or right side matches‣ It’s the “or” operator‣ Useful inside parentheses

• Guess before you try:‣ book(worm|end)‣ ^(out|lay)+$

Page 20: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 20CMPSC 311: Introduction to Systems Programming

Special characters

• We’ve seen a lot already‣ ^$.*+?[]()|\

• Backslash \ will escape a special character to search for it literally‣ For example, you could

search your code for the expression int \* to find integer pointers

Page 21: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 21CMPSC 311: Introduction to Systems Programming

Backreferences

• Groups in () can be referred to later‣ Must match exactly the

same characters again‣ Numbered \1, \2, \3

from the start of the regex

‣ Try it: (can)\1‣ Find words that have a

four-character sequence repeated immediately

Page 22: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 22CMPSC 311: Introduction to Systems Programming

Substituting: a demo• The sed program has many commands for modifying text

• Most useful is the s///g “substitute” command: regular expression find-and-replace‣ Also available in Vim by typing :%s/regex/replacement/g

• Try it: run this command and type things

$ sed -r 's/([a-z]+) and ([a-z]+)/\2 and \1/g'

Page 23: Regular Expressionspdm12/cmpsc311-f15/slides/26-regex.pdf · grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where

Page 23CMPSC 311: Introduction to Systems Programming

Enjoy puzzles?

• regexcrossword.com‣ Great way to practice

your regex-fu‣ Starts with simple tutorial

puzzles and works up