REGEX
Problems
• Have big text file, want to extract data– Phone numbers• 1-503-123-1234• 503-123-1234• (503) 123-1234• 123-1234• 503.123.1234
Regular Expressions
• Regular Expressions– Format for specifying patterns
• Pattern consists of– Literals– Ranges – Special values– Quantity indicators– Groupings
Literals
• Characters without special meaning are interpreted literally
1 Look for 1123 Look for 12312A Look for 12A
Ranges
• [ ] enclose a group of options
[123] Look for 1, 2, or 3
[AB] Look for A or B
2[BC] Look for 2 followed by B or C
Ranges
• [a-b] indicates a range
[0-9] Look for 0-9
[1-3] Look for 1-3
[a-zA-Z] Look for lowercase a-z or upper
[0-9A-Z] digit or uppercase letter
Ranges
• [^ ] says not any of these
[^123] Look for anything but 1,2,3
AA[^A] Look for 2 A's followed by anything not an A
Special Characters
• . Means any character but newline
A.C Matches ABC, ADC, A_C, A+C…
Special Characters
• ^ at start means nothing can be before• $ at end means nothing else after
Special Characters
• \s any whitespace– Tab, space, etc…
• \d any digit– Same as [0-9]
• \w any word character– Same as [a-zA-Z]
Special Characters
• \S anything BUT whitespace• \D anything BUT digit• \W anything BUT word character
Quantity Indicators
• {n} Must have n copies of whatever came before
\d{5} Match 5 digits
A{3}B Match 3 A's followed by a B
Quantity Indicators
• {n, m} n to m copies\d{2,5} Match 2 to 5 digits
• {n,} n or more copies {3,} Match any sequence of 3 or more digits
Quantity Indicators
• ? Indicates 0 or 1• + indicates 1 or more• * indicates 0 or more
A?B+C* could be:
BBBB, AB, ABBBC, ABCCCCC, B, BCCCC,…
\
• \ to escape chars\[ Find a [\. Find a .\\ Find a \
Grouping
• ( ) groups sequences– Apply options to whole group– Can extract each group from results
|
• | gives multiple options
In C++
• Part of c++11– Only partially implemented in current GCC– Available in boost xpression library
Top Related