CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20:...

18
CSCE 121:509-512 Set 20: Regular Expressions CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1

Transcript of CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20:...

Page 1: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512Introduction to Program Design and ConceptsJ. Michael MooreSpring 2015Set 20: Regular Expressions

1

Page 2: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

2

Text Processing

• “all we know can be represented as text”– And often is

• Books, articles• Transaction logs (email, phone, bank, sales, …)• Web pages (even the layout instructions)• Tables of figures (numbers)• Graphics (vectors)• Mail• Programs• Measurements• Historical data• Medical records• …

Amendment ICongress shall make no law respectingan establishment of religion, or prohibitingthe free exercise thereof; or abridging thefreedom of speech, or of the press; or theright of the people peaceably to assemble,and to petition the government for a redressof grievances.

Page 3: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

3

A problem: Read a ZIP code• U.S. state abbreviation and ZIP code

– two letters followed by five digits string s;while (cin>>s) {

if (s.size()==7&& isletter(s[0]) && isletter(s[1])&& isdigit(s[2]) && isdigit(s[3]) && isdigit(s[4])&& isdigit(s[5]) && isdigit(s[6]))

cout << "found " << s << '\n';}

• Brittle, messy, unique code

Page 4: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

4

Problems with simple solution• It’s verbose (4 lines, 8 function calls)• We miss (intentionally?) every ZIP code number not separated

from its context by whitespace– "TX77845", TX77845-1234, and ATM77845

• We miss (intentionally?) every ZIP code number with a space between the letters and the digits – TX 77845

• We accept (intentionally?) every ZIP code number with the letters in lower case– tx77845

• If we decided to look for a postal code in a different format we would have to completely rewrite the code– CB3 0DS, DK-8000 Arhus

Page 5: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

5

TX77845-1234• 1st try:

wwddddd• 2nd (remember -1234): wwddddd-

ddddWhat’s “special”?• 3rd:

\w\w\d\d\d\d\d-\d\d\d\d• 4th (make counts explicit): \w2\d5-\d4• 5th (and “special”): \w{2}\

d{5}-\d{4}But -1234 was optional?• 6th:

\w{2}\d{5}(-\d{4})?We wanted an optional space after TX• 7th (invisible space): \w{2} ?\

d{5}(-\d{4})?• 8th (make space visible): \w{2}\s?\d{5}(-\

d{4})?• 9th (lots of space – or none): \w{2}\s*\d{5}(-\

d{4})?

Page 6: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

6

Delimiters and Special Characters

• C++ strings mark beginning and end with "– What if we actually want to use the " symbol?– Escape… use \ before

• So use \" • What if we actually want to use the \ symbol?

– Use \\

• Regular Expressions has it’s own rules– Put a regular expression in to a c++ string??– Combine rules…

Page 7: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

7

Not standard C++

• Regex is not part of c++ standard… yet• Use external Libraries, e.g. Boost• Usage varies slightly… check documentation

Page 8: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

8

#include <iostream>#include <string>#include <fstream>using namespace std; int main(){

ifstream in("file.txt"); // input fileif (!in) cerr << "no file\n";

regex pat ("\\w{2}\\s*\\d{5}(-\\d{4})?");// ZIP pattern// cout << "pattern: " << pat << '\n'; // printing of patterns is not C++11

 // …

}

Page 9: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

9

int lineno = 0;string line; // input bufferwhile (getline(in,line)) {

++lineno;smatch matches; // matched strings go hereif (regex_search(line, matches, pat)) {

cout << lineno << ": " << matches[0] << '\n'; // whole match

if (1<matches.size() && matches[1].matched)cout << "\t: " << matches[1] << '\n‘;

// sub-match}

}

Page 10: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

10

ResultsInput: address TX77845

ffff tx 77843 asasasaaggg TX3456-23456howdyzzz TX23456-3456sss ggg TX33456-1234cvzcv TX77845-1234 sdsasxxxTx77845xxxTX12345-123456

 Output: pattern: "\w{2}\s*\d{5}(-\d{4})?"

1: TX778452: tx 778435: TX23456-3456

: -34566: TX77845-1234

: -12347: Tx778458: TX12345-1234

: -1234

Page 11: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

11

Regular Expression Syntax• Regular expressions have a thorough theoretical

foundation based on state machines– You can mess with the syntax, but not much with the semantics

• The syntax is terse, cryptic, boring, useful– Go learn it

Page 12: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

12

Characters with Special Meaning

• . Any single character (a ‘wildcard’)• [ Character class• ( ) Begin / end grouping• \ Next character has special meaning• | Alternative (or)• ^ start of line; negation• $ end of line

Page 13: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

13

Repetition

• * zero or more• + one or more• ? Optional (zero or one)• {n} exactly n times• {n,} n or more times• {n,m}at least n and at most m times

Page 14: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

14

Character Classes

• \d a decimal digit• \l a lowercase letter• \s a space (space, tab, etc.)• \u a uppercase character• \w a letter (a-z or A-Z) or digit (0-9) or

an underscore (_)

Page 15: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

15

Character Classes

• \D not \d (a decimal digit)• \L not \l (a lowercase letter)• \S not \s [a space (space, tab, etc.)]• \U not \u (a uppercase character)• \W not \w [a letter (a-z or A-Z) or digit

(0-9) or an underscore (_)]

Page 16: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

16

Examples– Xa{2,3} // Xaa Xaaa

– Xb{2} // Xbb

– Xc{2,} // Xcc Xccc Xcccc Xccccc …

– \w{2}-\d{4,5} // \w is letter \d is digit

– (\d*:)?(\d+) // 124:1232321 :123 123

– Subject: (FW:|Re:)?(.*)// . (dot) matches any character

– [a-zA-Z] [a-zA-Z_0-9]* // identifier

– [^aeiouy] // not an English vowel

• http://regex101.com

Page 17: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

17

https://xkcd.com/208/

Page 18: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.

CSCE 121:509-512 Set 20: Regular Expressions

Acknowledgments

• Photo on title slide: http://pixabay.com/en/manuscript-writing-paper-old-729617/

• Slides are based on those for the textbook: http://www.stroustrup.com/Programming/23_text.ppt

• Regex cartoon: https://xkcd.com/208/

18