Post on 09-Jul-2018
/usr/share/dict/wordsIntroducing My Favourite Unix File
235,886 English words in alphabetical order.
A pattern.Matched against a line.
Did it match at all?
Where, specifically,did it match?
Can we change the part that matched?
Theoretical Computer ScienceFormal Language and Automata Theory
1950s• Stephen Cole Kleene – Foundations of Recursion Theory
– Mathematical notation of “regular sets”
– Kleene star - unary operation on a set of strings, known
in mathematics as the “free monoid construction.”
– Application of the Kleene star to a set V is written as V*
QED
Ordinary character Matches itself
^ Beginning of line
$ End of line
. Any character
[string] Any character in the string
[^string] Any character not in the string
* Zero or more occurrences
a|b Either of the expressions
(expr) Grouping
UNIX and the “ed” editor
• Hey, I’ll show you.
• You should know this editor. Some day you’ll need it.
Kernighan and Pike, 1984Early UNIX Regular Expressions
c Any non-special character matches itself\c Turn off any special meaning of c^ Beginning of line$ End of line. Any single character
[...] Any one of the characters in the range[^...] Any one character not in the range\(r\) Tagged regular expression (grep only)\x What the x’th \(expression\) matched (grep only)r* Zero or more occurrences of rr+ One or more occurrences of r (egrep only)r? Zero or one occurrences of r
r1r2 r1 followed by r2
r1|r2 r1 or r2 (egrep only)
Early Unix - THREE grep programs
grep• Early. Based on “ed”
egrep• Extended Grep. Fancier regular expressions. Runs faster, but starts slower.
fgrep• “Fixed Grep”. No patterns, static strings only. Very efficient.
• fgrep -f fileOfWordsICannotSpell myDocument
“The distinction is hard to justify”.
Surely we don’t have three grep programs any more. It’s 2012.
Today’s UNIX. Much better.
$ ls -l /usr/bin/*grep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/bzegrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/bzfgrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/bzgrep-rwxr-xr-x 3 root wheel 29664 23 Jul 20:57 /usr/bin/egrep-rwxr-xr-x 3 root wheel 29664 23 Jul 20:57 /usr/bin/fgrep-rwxr-xr-x 3 root wheel 29664 23 Jul 20:57 /usr/bin/grep-rwxr-xr-x 2 root wheel 25632 23 Jul 20:58 /usr/bin/pgrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/zegrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/zfgrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/zgrep-rwxr-xr-x 1 root wheel 1188 23 Jul 20:58 /usr/bin/zipgrep
Surely we don’t have three grep programs any more. It’s 2012.
Today’s UNIX. Much better.
$ ls -l /usr/bin/*grep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/bzegrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/bzfgrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/bzgrep-rwxr-xr-x 3 root wheel 29664 23 Jul 20:57 /usr/bin/egrep-rwxr-xr-x 3 root wheel 29664 23 Jul 20:57 /usr/bin/fgrep-rwxr-xr-x 3 root wheel 29664 23 Jul 20:57 /usr/bin/grep-rwxr-xr-x 2 root wheel 25632 23 Jul 20:58 /usr/bin/pgrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/zegrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/zfgrep-rwxr-xr-x 6 root wheel 29664 23 Jul 20:58 /usr/bin/zgrep-rwxr-xr-x 1 root wheel 1188 23 Jul 20:58 /usr/bin/zipgrep
Perl-Compatible Regular Expressions
Inspired by a billion things Perl 5 added to regular expressions• lookahead, lookbehind, lookaround
Used by• Apache
• PHP
• Postfix
• Safari
• ack
A C library used by many other tools
It can be simple. /* match: search for regexp anywhere in text */ int match(char *regexp, char *text) { if (regexp[0] == '^') return matchhere(regexp+1, text); do { /* must look even if string is empty */ if (matchhere(regexp, text)) return 1; } while (*text++ != '\0'); return 0; }
/* matchhere: search for regexp at beginning of text */ int matchhere(char *regexp, char *text) { if (regexp[0] == '\0') return 1; if (regexp[1] == '*') return matchstar(regexp[0], regexp+2, text); if (regexp[0] == '$' && regexp[1] == '\0') return *text == '\0'; if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text)) return matchhere(regexp+1, text+1); return 0; }
/* matchstar: search for c*regexp at beginning of text */ int matchstar(int c, char *regexp, char *text) { do { /* a * matches zero or more instances */ if (matchhere(regexp, text)) return 1; } while (*text != '\0' && (*text++ == c || c == '.')); return 0; }
c matches any literal character c . matches any single character ^ matches the beginning of the input string $ matches the end of the input string * matches zero or more occurrences of the previous character
$ grep pattern file1 file2 file3 ....About “grep”
Read each file in turn, one line at a time
If the pattern matches anywhere in the line, print the line.
$ grep steve /usr/share/dict/words
$ grep pattern file1 file2 file3 ....About “grep”
Read each file in turn• One line at a time
If the pattern matches anywhere in the line,• print the line.
$ grep steve /usr/share/dict/words
$ grep pattern file1 file2 file3 ....About “grep”
Read each file in turn• One line at a time
If the pattern matches anywhere in the line,• print the line.
$ grep steve /usr/share/dict/words
$ grep pattern file1 file2 file3 ....About “grep”
Read each file in turn• One line at a time
If the pattern matches anywhere in the line,• print the line.
$ grep steve /usr/share/dict/wordsstevedoragestevedorestevedoringstevelstevenTransteverineTrastevereTrasteverine
grep -v: “Print the lines that DON’T match”
Getting fancy with grep
$ grep -v steve /usr/share/dict/wordsAaaaaalaaliiaamAaniaardvarkaardwolfAaron
grep -c: “Just count the matches”
Getting fancy with grep
$ grep -c steve /usr/share/dict/words8
Tip:Remember
this one!
grep -o: “Just print the matching part”
Getting fancy with grep
$ grep -o steve /usr/share/dict/wordsstevestevestevestevestevestevestevesteve
grep -q: “Be quiet.”
Getting fancy with grep
$ grep -q steve /usr/share/dict/words$$ if grep -q steve /usr/share/dict/words; then echo Not a good password; fi Not a good password
grep -C n: “Print some context.”
Getting fancy with grep
$ grep -C 2 steve /usr/share/dict/words
--transsubjectivetranstemporalTransteverinetransthalamictransthoracic----trashytrassTrastevereTrasteverinetrasytraulism
2 lines before, 2 lines after each match.
grep --color: “Colour the matches.”
Getting fancy with grep
$ grep --color steve /usr/share/dict/wordsstevedoragestevedorestevedoringstevelstevenTransteverineTrastevereTrasteverine
grep -i: “Case-insensitive matching.”
Getting fancy with grep
$ grep -i steve /usr/share/dict/wordsStevestevedoragestevedorestevedoringstevelStevenstevenStevensonianStevensonianaTransteverineTrastevereTrasteverine
grep -r “Recursively go through subdirectories”grep -l “Just list the matching file names”
Getting fancy with grep
$ grep -r -l -i Tennis ~/Data/Users/me/Data/Medallists.csv/Users/me/Data/Shakespeare.txt
That you match against text.(Usually against a single line.)
A pattern.
Pattern Matches this Doesn’t match this
stevesteve
stevenEmelio Estevez
stephenSteve
[0123456789] Can$200 Two hundred dollars
. Steve
Beginning or end of the line
Anchors
Pattern Matches this Doesn’t match
^SteSteveStephenStegosaurus
Hello Steve
hen$
Stephenlichenstrengthen
Hello SteveChicken
Repeat the previous pattern
Repetition
Pattern Matches this Meaning
S*Steve“HISSSS” said the snake
|TychoZero or more
S? |Steve|Tycho
Zero or one(Optional)
S+ SteveSSSSnake
One or more
S{3} SSSnakeExactly three
S{3,4} SSSnakeSSSSnake
Three or four
Matches as much as it canNote: * is “greedy”
Suppose you want to match HTML tags.
You have a string
<HTML> <HEAD> <TITLE>My Page</TITLE> </HEAD> </HTML>
You have a pattern <.*>
You can say <.*?> to make it “lazy” instead of “greedy”
Combinations
Pattern Matches lines that....* Anything at all.
^x Start with x
y$ End with y
^x... Start with x;at least 3 more characters
^x.*y$Start with x,anything in the middle,end with y
.......... Have at least 10 characters
^..........$ Have exactly 10 characters
^.{10}$ Have exactly 10 characters
Tip:Remember
this one!
(egrep)Backreferences
References to a (expression) you already matched.
\1 is the first one, \2 is the 2nd, etc.
Example: Find words with double letters
$ egrep ‘(.)\1’ wordsaaaalaaliiaamaardvarkaardwolfabactinallyabaffabaissed
abandonee
sed, vi, perl, XcodeSubstitution
You match something with (grouping) and (more grouping)
You replace it with something else, using \1, \2 to refer to the matched groups
s/pattern/replacement/
perl -p -e 's/(.*):(.*)/\2,\1/;' -- Flip the first two colon-separated fields around
foo:bar => bar,foo
perl -i.bak -p -e 's/(.*):(.*)/\2,\1/;' names.txt
Tycho:SjogrenArek:Dreyer
names.txt
Tycho:SjogrenArek:Dreyer
names.txt.bak
Sjogren,TychoDreyer,Arek
names.txt
Challenge: Find a word witha repeating five letter pattern (like “abcdeabcde”)
$ cd /usr/share/dict$ egrep '(.)\1' words
“Lookaround”Lookahead and Lookbehind
Pattern Means ...
(?=pattern) Zero-width positive lookahead.
(?!pattern) Zero-width negative lookahead.
(?<=pattern) Zero-width positive lookbehind.
(?<!pattern) Zero-width negative lookbehind.
“U” always follows “Q”, right?A Negative Lookahead Example
Negative Lookahead: A “q” not followed by a “u”
q(?!u)
Why not just use q[^u] ?
Didn’t we use that already?
A q followed by a non-u character?
Lookahead and Lookbehind don’t work everywhere.A Problem
Perl 5
grep -P• 10.7 and earlier only
• 10.8 uses a different grep. You can use “perl -e”
ack
Nice tutorial here:• http://www.regular-expressions.info/lookaround.html
Certain tools only (perl)Comments and Free-Spacing
Match a valid date:
^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$
Or, with comments and free-spacing
# Match a 20th or 21st century date in yyyy-mm-dd format(19|20)\d\d # year (group 1)[- /.] # separator(0[1-9]|1[012]) # month (group 2)[- /.] # separator(0[1-9]|[12][0-9]|3[01]) # day (group 3)
www.regular-expressions.info
Matching an Email Address
It’s easy, right?
Everybody is someString@someDomain.com, right?
\s+@\s+\.\s+
Can you match this?
steve@example.com
steve@[192.0.43.10]
“Steve” <steve@example.com>
steve@example.com (Steve (Musical Dictator))
steve+junkmail@example.com
Steve\ Lastname@example.com
“steve”@example.com
steve\@lastname@example.com
Doing it Right Is Hard
Recommended on stackoverflow.com ...
^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$
perl - Mail::RFC822::ADDRESS(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)
perl - Mail::RFC822::ADDRESS
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)
So how do you really do it?
Ask what their address is.
Maybe look for an “@”
Send a test message.
If it gets through, it’s valid.
betterthangrep.com - based on PRCEConsider installing “ack”
Use --passthru to print all lines, even those not matching
Surprise!grep changed in Mountain Lion
GNU Grep in 10.7
BSD Grep in 10.8– No more “--use-perl” for PCRE
Use perl itself - or go get “ack”.• www.betterthangrep.com
Beyond Regular ExpressionsFor Objective-C Programmers
NSRegularExpression
NSDataDetector (iOS 4, OS X 10.7)
• Dates
• Addresses
• Links
• Phone Numbers
• Transit Information
NSLinguisticTagger
NSLinguisticTagger
Find ...• Word and sentence boundaries
• Lexical classes– noun, verb, adjective
• Lemmas– Root forms of words
• Named Entities– Personal names, place names, organization names
In ActionNSLinguisticTagger
NSLinguisticTagger *tagger = [[NSLinguisticTagger alloc] initWithTagSchemes: [NSArray arrayWithObjects: NSLinguisticTagSchemeTokenType, NSLinguisticTagSchemeLexicalClass, NSLinguisticTagSchemeNameType, NSLinguisticTagSchemeNameTypeOrLexicalClass, NSLinguisticTagSchemeLemma, nil] options:0];[tagger setString:string];
In ActionNSLinguisticTagger
[tagger enumerateTagsInRange:range scheme:NSLinguisticTagSchemeLexicalClass options:NSLinguisticTaggerOmitWhitespace usingBlock:^(NSString *tag, NSRange tokenRange, NSRange sentenceRange, BOOL *stop){
if (tag == NSLinguisticTagNoun) // we have found a noun...;
Safe Tweets In SchoolAn Example
Prompt the user for a line of text
Use NSLinguisticTagger to identify the nouns
Remove everything else.