Why use Regular Expressionsfaculty.cse.tamu.edu/.../RegularExpressionsNotes.docx · Web viewRegular...
Transcript of Why use Regular Expressionsfaculty.cse.tamu.edu/.../RegularExpressionsNotes.docx · Web viewRegular...
Regular ExpressionsWhy use Regular Expressions
Pull or filter data from larger files validation
o HTML formso GUI forms
every languages enables Reg Exo C++/Java/Python/Bash/CSH
Regular expression (REs) Scanners are based on regular expressions that define simple patterns
o Simpler and less expressive than BNF uses some of the same notation as EBNF Basic operations are set union, concatenation, Kleene closure
o Plus: parentheses, naming patterns No recursion! Why use??
o able to name patterns is just syntactic sugaro use parentheses to group things is just syntactic sugar provided we
specify the precedence and associatively of the operators (i.e., |, * and “concat”)
refers to syntax within a programming language that is designed to make things easier to read or to express.
A regular language is a language that can be defined by a regular expression http://youtu.be/394NxYBDaiA (about 11 minutes)
o does a great job of explaining much of below great training website!!
o https://regexone.com/
1
Basic Regular Expression NotesSyntax Meaning Example Matched DFS !Match. Any single non-null
characterSh.t Shot, Shut, etc.. - Sht, Shoot,
a This particular character alone
a aAny other character than a
ab This particular characters joined alone
tha. that, than, thal, thay
Any other joined character than ab
a|b Or demo|example demo, example c, ab, ba, aa
* Zero or more times go*gle gooooogle, gogle, google
ggle, gooogoogle
[abc] any of these single characters
tha[nt] than, that tha, thant
[a-d] any of these single characters in range
so[b-f] sob, soc, sod, soe, sof
so, sobb, soy
[^abc] none of these characters
2
(notice ^ leads off)[^a-d] not a character
within this range(notice ^ leads off)
so[^b-f] soa,sog, soh, sot, sos
sob, soc, sod, soe,sof
^ starts withnotice NOT within [grouping]
^The These, The, Theatre, Theta
these, Tomas, Darn
$ string or ϵ ends with $ton cotton, Clinton, ton, Scraton, Easton
jerk, certain,
? Zero or one character(need a value in front)
(dos)?e
doss?e(s in front of ? is targetted)
dose, e
dosse, dose, dossse
nose, doe
doddoss, dosss
+ one or more(need a value in front)
(dos)+e
doss+e(s in front of ? is targetted)
is the same as below, but less resources
3
{n} n times exactly(need a value in front)
w{3}(nag){3} = ???
www ww, w, wwww
{n,m} from n to m times(need a value in front)
(blah){3,5} blahblahblah, blahblahblahblah
blah,blahblahblah blahblahblah
{n,} at least n times(need a value in front)
[] group\ Escape\s White Space\S non-White Space\d digit character\D non-digit character\w Word\W non-Word
(punctuation, spaces)
4
Simple Union thankfully nothing special, but there is order
Union Example 1A={grand, ε}, B={father, mother} What is AB? (A is then followed by a B)
AB={father, mother, grandfather, grandmother, …}
RE operator “+”, “ε”, “.” and “?” The operator “?” means ZERO or ONE!! (Optional)
o This is different than *, which is 0 or MANY
? operator ab?c
zero or 1
Epsilon εo Sometimes we’d like a token that represents nothingo This makes a regular expression matching more complex, but can be
useful The + operator is commonly used to mean “one or more repetitions” of a
patterno We can always do without this
letter+ = = letter letter*
o So the + operator is just syntactic sugar
+ operator ab+c
one or more
5
The dot “.” In Reg Expressiono matches a single character, without caring what that character is
dot CANNOT be epsilon
. operator a.b*c
6
Regular Expression Edge Values edge values in an FA (Finite Automata) can be of varying values and setup for what we are doing, each edge will contain ONE value
Simplifying RE Edges for Reg. Exp. UnderstandingLoop space string match
Wildcard ? and . ε (epsilon)
(more next page)
7
either or
Difference in Grouping there is a big difference in grouping in Reg. Ex.
o grouping options ( ) all together, whatever is within the ( )s [ ] select only ONE from whatever is within the [ ]s
Grouping Differences(ab) [ab]
(ab)* [ab]*
Kleene Example 18
A={grand, ε}, B={father, mother}
What is A*B ?????
A*B={father, mother, grandfather, grandmother, grandgrandfather, …}
Kleene Example 2
(a | b | c)* = {"ε ", "a", "b", "c", "aa", "ab", ..., "bccabb" ...}[a – c]* = …
Order of operations this is important, can really switch things up
Precedence of operators
( )s* +Concatenation|All the operators are left associative
Example(A) | ((B)* (C)) is equivalent to A | B * C
9
Complete this exercise. $ is the delimiter character showing where the regular expression begins and ends. Strings to be matched start and end with non-blank characters: there are no leading or trailing blanks.
. match any character. WILDCARD (for one char.)* means zero or more instances of? means optional+ means one or more instances of
There can be more than ONE correct answer per question
1 Which of the following matches regexp $a(ab)*a$1) abababa2) aaba3) aabbaa4) aba5) aabababa 2 Which of the following matches regexp $ab+c?$1) abc2) ac3) abbb4) bbc 3 Which of the following matches regexp $a.[bc]+$1) abc2) abbbbbbbb3) azc4) abcbcbcbc5) ac6) asccbbbbcbcccc Answers Try these on your own, Positive list should be all red, Negative should be all black#1. http://regex.sketchengine.co.uk/cgi/ex1.cgi#2. http://regex.sketchengine.co.uk/cgi/ex2.cgi #3. http://regex.sketchengine.co.uk/cgi/ex3.cgi Answersb:
10
The use of a starting character $ in the example above / in many JavaScript versions depends on the language
o (none in C++/Java needed)
Filtering remember you’re are given a “massive” amount of data in which you are
searching for matches the reg. ex. is going to filter out the strings that don’t match and produce
matches
Where to go for support Because I can’t remember everything
o http://www.regular-expressions.info/anchors.html Tutorials
o https://regexcrossword.com/
1. In regular-expression.info, review the documentation on the left side of the page for:
a. Word Boundariesb. Repetitionc. Dot
2. In regexcrossword.coma. create and account b. start the “Tutorial” portionc. (as of 3/10/17) last one (Space) is a little tricky, and may not exactly tell
you that the Tutorial portion was completed.d. Use the “Help” to view the various Reg. Ex. forms
3. Complete this problema. http://regex.sketchengine.co.uk/cgi/ex4.cgi
11
Testing your Regular Expression You can certainly buy/download a Regular Expression editor that will show
results RegEx testers on defined text
o these will already have sample text that you will try (filter) your Reg. Ex. on
o http://regexr.com/ try /l{2}/g
Reg. Ex. and Words and spacing around Data some of the other features will help with real life applications
Reg. Ex. handling words and spacing\s White Space\S non-White Space\d digit character\D non-digit character\w Word\W non-Word (punctuation, spaces)
Using RegEx in other Languages There are differences in some languages!!
o Minor but when programming can be hauntingo Be careful
Pythono https://www.debuggex.com/cheatsheet/regex/python
JavaScripto https://www.debuggex.com/cheatsheet/regex/javascript
What were the differences between the two??
12
Coded examples of RegEx In various languages
Languages and RegExJava – Simple Stringimport java.util.regex.*;
…
String expression = "JHGADEEZroots";String pattern = "(DEE?)";
Pattern cool = Pattern.compile(pattern);Matcher match = cool.matcher(expression);if (match.find( )){
System.out.println("Found value: " + match.group(0) ); //System.out.println("Found value: " + match.group(1) ); //System.out.println("Found value: " + match.group(2) ); }
else { System.out.println("NO MATCH"); }
Java – File IOimport java.util.regex.*;
…System.out.println("Regex on a text file");String allData = "";try{
String line2 = "";FileInputStream fstream = new FileInputStream("courses.txt");BufferedReader br = new BufferedReader(new InputStreamReader(fstream));while((line2 = br.readLine()) != null) { allData += line2; }br.close();
} catch (Exception ex) { }
// String to be scanned to find the pattern.String pattern = "[A-Z]{4}[0-9]{3}";// Create a Pattern objectPattern r = Pattern.compile(pattern);Matcher m2 = r.matcher(allData);while(m2.find()) { System.out.println(m2.group() ); }
13
Python (both simple and File IO)#!/usr/bin/pythonimport reimport sys
print("Regex on a string")line = "This class is CMSC433";searchObj = re.findall( r'[A-Z]{4}[0-9]{3}', line)for i in range (0, len(searchObj)):
print(searchObj[i])
print("Regex on a text file")allData = ""with open("courses.txt", "r") as f:
for line in f:allData += line
searchObj = re.findall('[A-Z]{4}[0-9]{3}', allData)for i in range (0, len(searchObj)):
print(searchObj[i])
C++ - Simple String#include <iostream>#include <regex>#include <string>
using namespace std;
int main(){string target = "Lupoli needs more work.";string replacement = "a vacation.";string result;regex vacation("m.*");
cout<<"Before regex replace: "<<target<<endl;cout<<"regex is: m.*"<<endl;
result = regex_replace(target, vacation, replacement);cout<<"After regex replace: "<<result<<endl;
return 0;}
14
JavaScript Code for Regular Expressions has some items to watch for / and / in front and behind the string you are looking for
o much like http://regexr.com/ uses string commands in which the regular expression is within those
functionso search
returns index number of where it can be found try it here notice VERY limiting!!! will only return the last instance!! may have to create your own search function that will return an
array of starting index valueso replaceo test
returns a Boolean to see if the regular expression passed in returns anything
Sample test codevar dateTime = /\d\d-\d\d-\d\d\d\d \d\d:\d\d/;console.log(dateTime.test("30-01-2003 15:20")); // → trueconsole.log(dateTime.test("30-jan-2003 15:20")); // → falseconsole.log(/'\d+'/.test("'123'")); // → trueconsole.log(/'\d+'/.test("''")); // → falseconsole.log(/'\d*'/.test("'123'")); // → trueconsole.log(/'\d*'/.test("''")); // → true(from http://eloquentjavascript.net/09_regexp.html)
15
modifierso global flags used for the file/data read in
JavaScript Modifier DescriptionsModifier Descriptioni Perform case-insensitive matchingg Perform a global match (find all matches rather than stopping after the
first match)m Perform multiline matching
Using the example above, create a new function completeSearch that will return and display a list of indices on where the string was found. This should help significantly.Answerb:
16
The exec JavaScript Command tokenizes the data returns the full string match or each match within data
I hate you Lupoli, now you tell me about exec<!DOCTYPE html><html><body>
<p>Search a string for "w3Schools", and display the position of the match:</p>
<button onclick="myFunction()">Try it</button>
<p id="demo"></p>
<script>function myFunction() {
var str = "Visit W3Schools! W3SCHOOLS"; completeSearch(str); //document.getElementById("demo").innerHTML = n;}
function completeSearch(str) {var matches = [];
var regex = /w3Schools/gi; var match = "";
while(match = regex.exec(str)) matches.push(match.index);
var res = ""; for(var i = 0; i < matches.length; i++) { res += matches[i]; if(i < matches.length - 1) res += ", "; }
document.getElementById("demo").innerHTML = res;}</script>
</body></html>
This should display 6 and 17 to the screen. Why? Try another, changing the variables str and regex.
17
Game-planning for a Reg. Ex. application Overall gameplan
1. what the application is looking for is first2. knowing what is in the file first is second
a. and how it is set upb. you’ll know why XML rocks
3. How are we to read the filea. place it into a text boxb. upload the file
4. How are we to display the results?
Exercises:The data for both exercises can be found here (as of 1/6/16)https://earthquake.usgs.gov/earthquakes/feed/v1.0/quakeml.phpPast Day All Earthquakes Save the file as a .txt fileYou can also copy and paste into the regexr website for #1 and 2 belowIn regexr.com, remember to use Library Cheatsheet
Exercise 1 Exercise 21. Display town location of earthquake Display magnitude
2. look for <text> tag look for <mag><value> tag3. use http://regexr.com/ use http://regexr.com/4. use http://regexr.com/ use http://regexr.com/
part 1 - <text>..</text> is fine within the answerpart 2 – see if you can leave out <text>..</text>
(1 -4 above are from 1-4 in game planning)Answerb:
Exercise 3 Exercise 41. Display town location of earthquake Display magnitude2. look for <text> tag look for <mag><value> tag3. use HTML file uploader (use this as help) same as 4. display within same HTML page using innerText same as
18
19
Substituting using Regular Expressions /s usually is the command for some scripting languages like BASH/CSH/KSH
Pull the timestamp with whatever language you wish http://www.usgovxml.com/examples/public/merged_catalog.xml
20
Solutions#1 – Using DFSM
#1 – BUT a(ab)+a
#3
21
Exercise 1 Exercise 2 Exercise 3\A(pi|sp|sl|re).* .*ap( |et|h|/|9|o|t).* .*(afgk|af.g.k|afg.k|af[a-z]gk).*r*e*s*l*a*p[ioa ]te*w*o* \A(ra|ta|ap|wr|sa|87|ap)[^r][^l].* [r,b]af.*|\Aaffgf.*(re)?s?(la)?p(a|e|o|i| )t(e|w|o)* s*w*8*7*r*t*(ap)o*(et)*[ t]*[hr9/][eta]*[mryhca]* [br]*aff*gf* *[hk][aik][nhtge]*.*p.t.* (had several times) .*(ap).?t.* *af+g.[a-z]+re|s+la|s|(p.t)+e?|wo? .*ap.?t.* (had several times) [br]?af+g.?k[ingahet]*(p|(sp)|(sl)|(re))+[aeiou]p?\s?t(e|wo)?
[rtws87]*ap[\s/9oe]?[thm]+e?[mcary]* .*af+g.*k.* (had several times)
^((pi)|[str]).*([oet]|(ot) .?af+g.?k.*.*p.{1}t.* .?af{1,2}g.*k.* .?af{1,2}g.?k.*.*(r|s|p|l)+.t.* .*(ap).?t.h*.* .?(af).?g.?k.*[a-z]*p.{1}t[a-z]* [a-z0-9]*ap.?t[a-z]* [a-z]+fg.?k[a-z]
completeSearchby Luke Carrico S17function globalSearch() { //String to search var str = “Visit W3Schools! W3Schools W3Schools”; //Index of search result var n = str.search(/W3Schools/i); var index = []; var offset = 0; //search returns -1 when no match is found while (n != -1) { //add the found index to index //offset is included because part of the string is removed //every time index.push(n + offset); //offset += start index of the string + size of the string //keep track of what has already been removed offset += n + 9; //remove everything that has been searched str = str.substr(n + 9); //search for another match n = str.search(/W3Schools/i); } document.getElementById(“demo”).innerHTML = index;}
by Dohyun Roh S17’
22
RegExr and EarthQuake Data1. Part A - /<text>(\w+.)+/g 2. Part A -
/<mag>\n<value>(.*)<\/value>/gm1. Part B - (?![<text>]).*(?=<\/text>) 2. Part B -Part 3<input type="file" id="fileinput" /><script type="text/javascript"> function readSingleFile(evt) { //Retrieve the first (and only!) File from the FileList object var f = evt.target.files[0];
if (f) { var r = new FileReader(); r.onload = function(e) { var contents = e.target.result; var regex = /<text>.*<\/text>/g console.log(regex.exec(contents)) } r.readAsText(f); } else { alert("Failed to load file"); } }
document.getElementById('fileinput').addEventListener('change', readSingleFile, false);</script>
// originally from// http://www.htmlgoodies.com/beyond/javascript/read-text-files-using-the-javascript-filereader.html#fbid=lVKVjUCWdjk
23
Sourceshttp://www.csee.umbc.edu/~damas1/courses/cmsc433/fall2014/tools/regex-evaluator/index.phphttp://www.regular-expressions.info/ http://www.funduc.com/regexp.htm
Search and ReplaceJava - http://www.javamex.com/tutorials/regular_expressions/search_replace.shtml#.VtV-QfkrKUk
http://eloquentjavascript.net/09_regexp.html
24