Lee CSCE 314 TAMU 1 CSCE 314 Programming Languages Haskell 101 Dr. Hyunyoung Lee.
CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20:...
-
Upload
brittany-jennings -
Category
Documents
-
view
212 -
download
0
Transcript of CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20:...
![Page 1: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/1.jpg)
CSCE 121:509-512Introduction to Program Design and ConceptsJ. Michael MooreSpring 2015Set 20: Regular Expressions
1
![Page 2: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/2.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
2
Text Processing
• “all we know can be represented as text”– And often is
• Books, articles• Transaction logs (email, phone, bank, sales, …)• Web pages (even the layout instructions)• Tables of figures (numbers)• Graphics (vectors)• Mail• Programs• Measurements• Historical data• Medical records• …
Amendment ICongress shall make no law respectingan establishment of religion, or prohibitingthe free exercise thereof; or abridging thefreedom of speech, or of the press; or theright of the people peaceably to assemble,and to petition the government for a redressof grievances.
![Page 3: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/3.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
3
A problem: Read a ZIP code• U.S. state abbreviation and ZIP code
– two letters followed by five digits string s;while (cin>>s) {
if (s.size()==7&& isletter(s[0]) && isletter(s[1])&& isdigit(s[2]) && isdigit(s[3]) && isdigit(s[4])&& isdigit(s[5]) && isdigit(s[6]))
cout << "found " << s << '\n';}
• Brittle, messy, unique code
![Page 4: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/4.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
4
Problems with simple solution• It’s verbose (4 lines, 8 function calls)• We miss (intentionally?) every ZIP code number not separated
from its context by whitespace– "TX77845", TX77845-1234, and ATM77845
• We miss (intentionally?) every ZIP code number with a space between the letters and the digits – TX 77845
• We accept (intentionally?) every ZIP code number with the letters in lower case– tx77845
• If we decided to look for a postal code in a different format we would have to completely rewrite the code– CB3 0DS, DK-8000 Arhus
![Page 5: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/5.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
5
TX77845-1234• 1st try:
wwddddd• 2nd (remember -1234): wwddddd-
ddddWhat’s “special”?• 3rd:
\w\w\d\d\d\d\d-\d\d\d\d• 4th (make counts explicit): \w2\d5-\d4• 5th (and “special”): \w{2}\
d{5}-\d{4}But -1234 was optional?• 6th:
\w{2}\d{5}(-\d{4})?We wanted an optional space after TX• 7th (invisible space): \w{2} ?\
d{5}(-\d{4})?• 8th (make space visible): \w{2}\s?\d{5}(-\
d{4})?• 9th (lots of space – or none): \w{2}\s*\d{5}(-\
d{4})?
![Page 6: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/6.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
6
Delimiters and Special Characters
• C++ strings mark beginning and end with "– What if we actually want to use the " symbol?– Escape… use \ before
• So use \" • What if we actually want to use the \ symbol?
– Use \\
• Regular Expressions has it’s own rules– Put a regular expression in to a c++ string??– Combine rules…
![Page 7: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/7.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
7
Not standard C++
• Regex is not part of c++ standard… yet• Use external Libraries, e.g. Boost• Usage varies slightly… check documentation
![Page 8: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/8.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
8
#include <iostream>#include <string>#include <fstream>using namespace std; int main(){
ifstream in("file.txt"); // input fileif (!in) cerr << "no file\n";
regex pat ("\\w{2}\\s*\\d{5}(-\\d{4})?");// ZIP pattern// cout << "pattern: " << pat << '\n'; // printing of patterns is not C++11
// …
}
![Page 9: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/9.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
9
int lineno = 0;string line; // input bufferwhile (getline(in,line)) {
++lineno;smatch matches; // matched strings go hereif (regex_search(line, matches, pat)) {
cout << lineno << ": " << matches[0] << '\n'; // whole match
if (1<matches.size() && matches[1].matched)cout << "\t: " << matches[1] << '\n‘;
// sub-match}
}
![Page 10: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/10.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
10
ResultsInput: address TX77845
ffff tx 77843 asasasaaggg TX3456-23456howdyzzz TX23456-3456sss ggg TX33456-1234cvzcv TX77845-1234 sdsasxxxTx77845xxxTX12345-123456
Output: pattern: "\w{2}\s*\d{5}(-\d{4})?"
1: TX778452: tx 778435: TX23456-3456
: -34566: TX77845-1234
: -12347: Tx778458: TX12345-1234
: -1234
![Page 11: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/11.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
11
Regular Expression Syntax• Regular expressions have a thorough theoretical
foundation based on state machines– You can mess with the syntax, but not much with the semantics
• The syntax is terse, cryptic, boring, useful– Go learn it
![Page 12: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/12.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
12
Characters with Special Meaning
• . Any single character (a ‘wildcard’)• [ Character class• ( ) Begin / end grouping• \ Next character has special meaning• | Alternative (or)• ^ start of line; negation• $ end of line
![Page 13: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/13.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
13
Repetition
• * zero or more• + one or more• ? Optional (zero or one)• {n} exactly n times• {n,} n or more times• {n,m}at least n and at most m times
![Page 14: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/14.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
14
Character Classes
• \d a decimal digit• \l a lowercase letter• \s a space (space, tab, etc.)• \u a uppercase character• \w a letter (a-z or A-Z) or digit (0-9) or
an underscore (_)
![Page 15: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/15.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
15
Character Classes
• \D not \d (a decimal digit)• \L not \l (a lowercase letter)• \S not \s [a space (space, tab, etc.)]• \U not \u (a uppercase character)• \W not \w [a letter (a-z or A-Z) or digit
(0-9) or an underscore (_)]
![Page 16: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/16.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
16
Examples– Xa{2,3} // Xaa Xaaa
– Xb{2} // Xbb
– Xc{2,} // Xcc Xccc Xcccc Xccccc …
– \w{2}-\d{4,5} // \w is letter \d is digit
– (\d*:)?(\d+) // 124:1232321 :123 123
– Subject: (FW:|Re:)?(.*)// . (dot) matches any character
– [a-zA-Z] [a-zA-Z_0-9]* // identifier
– [^aeiouy] // not an English vowel
• http://regex101.com
![Page 18: CSCE 121:509-512 Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.](https://reader035.fdocuments.us/reader035/viewer/2022072013/56649e555503460f94b4bcaf/html5/thumbnails/18.jpg)
CSCE 121:509-512 Set 20: Regular Expressions
Acknowledgments
• Photo on title slide: http://pixabay.com/en/manuscript-writing-paper-old-729617/
• Slides are based on those for the textbook: http://www.stroustrup.com/Programming/23_text.ppt
• Regex cartoon: https://xkcd.com/208/
18