Regexes in .NET
-
Upload
pablo-fernandez-duran -
Category
Technology
-
view
198 -
download
2
Transcript of Regexes in .NET
Reg-what?
• Regular expressions
• Describing a search pattern
• Find and replace operations
• 1950
• Regular language, formal language …
• Different flavors -> PCRE (Perl Compatible Regular Expressions)
• Now… not so regular
regex
regexp
reg-exp
regexps
reg-exps
regexes
regexen
^reg-?ex(?(?<=-ex)p|p?)(?(?<=x)e[sn]|s)?$
var re = new RegExp(/.*/); // js
var re = new Regex(".*"); // .NET
What about you ?
• Can you read regexes ?
^[0-9]\w*$
• Can you really read regexes ?
^[^)(]*\((?>[^()]+|\((?<p>)|\)(?<-p>))*(?(p)(?!))\)[^)(]*$
Language overview• Character classes
•
• Character group [abc]
• Negation [^a1]
• Range [C-F] or [2-6A-D]
• Differences [A-Z-[B]]
• Anchors
•
. (wildcard) \w (writable) \d (decimals) \s (spacing)
\W (not \w) \D (not \d) \S (not \s)
^ (beginning of string or line) $ (end of string or line) \b (word boundary)
\B (not \b)
Language overview
• Quantifiers
• Range : {n,m} , {n,}
• Zero or more : * (can be written {0,})
• One or more : + (can be written {1,})
• Zero or one : ? (can be written {0,1})
• Greedy vs Lazy
• Greedy : the longest match (by default)
• Lazy : the shortest match
• *? , +? , ?? , {n,m}?
Language overview
• Grouping constructs
• Capturing group : (subexpression)
• Named group : (?<group_name>subexpression)
• Non capturing group : (?:subexpression)
• Balancing groups : (?<name1-name2>subexpression)
• Look around assertions (zero length)
• Positive look ahead : (?=subexpression)
• Negative look ahead : (?!subexpression)
• Positive look behind : (?<=subexpression)
• Negative look behind : (?<!subexpression)
Language overview
• Backreference constructs
• \groupnumber or \k<groupname>
• Alternation constructs
• (expression1|..|expressionn)
• (?(expression)yes|no)
• (?(referenced group)yes|no)
Format/Comment your code
As you do it when you write code…
public static void C(string an, string pn, string n, string nn) { RegexCompilationInfo[] re ={ new RegexCompilationInfo(pn, RegexOptions.Compiled, n, nn, true) };System.Reflection.AssemblyName asn = new System.Reflection.AssemblyName(); asn.Name = an;Regex.CompileToAssembly(re, asn); }
Regexes can have inline comments:
(#comment)
And can be written in multiple lines (don’t forget the IgnorePatternWhitespace option ):
Before:
^[^()]*((?<g>\()[^()]*)*((?<-g>\))[^()]*)*[^()]*(?(g)(?!))$
After:
^ #start
[^()]* #everything but ()
(
(?<g>\() #opening group (
[^()]* #everything but ()
)*
(
(?<-g>\)) #closing group )
[^()]* #everything but ()
)*
[^()]* #everything but ()
(?(g) #if opening group remaining
(?!)) #then make match fail
$ #end
In .NET / C#
• A class to know : System.Text.RegularExpressions.Regex
• Represents the Regex engine
• A pattern is tightly coupled to the regex engine
• All regular expressions must be compiled (sooner or later)
• Initialization can be an expensive process
Regex options
• None
• IgnoreCase
• Multiline
• Singleline
• ExplicitCapture
• Compiled
• IgnorePatternWhitespace
• RightToLeft
• ECMAScript
• CultureInvariant
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx
Instance or Static method calls ?
• Both provide the same matching/replacing methods
• Static method calls use caching (15 by default)
• Manage the cache size using Regex.CacheSize
• Only static calls use caching (since .NET 2.0)
Instance or Static method calls ?
• new Regex(pattern).IsMatch(email)
Vs
• Regex.IsMatch(email, pattern)
Data from:
http://blogs.msdn.com/b/bclteam/archive/2010/06/25/optimizing-regular-expression-performance-part-i-working-with-the-regex-class-and-regex-objects.aspx
Interpreted or compiled
• Interpreted:
• opcodes created on initialization (static or instance).
• opcodes converted to MSIL and executed by the JIT when the method is called.
• Startup time reduced but slower execution time
• Compiled (RegexOptions.Compiled):
• regex converted to MSIL code.
• MSIL code executed by the JIT when the method is called.
• Execution time reduced but slower startup time.
• Compiled on design time:
• Regex.CompileToAssembly
• The regex is fixed and used only in instance calls.
• Startup and execution time reduced at run-time but must be done design time.
Interpreted or compiled
Data from:
http://blogs.msdn.com/b/bclteam/archive/2010/06/25/optimizing-regular-expression-performance-part-i-working-with-the-regex-class-and-regex-objects.aspx
Tools
• Regex Design
• Expresso
• The regex coach
• Regex buddy (not free)
• Rex (microsoft research)
• Visual Studio
Bonus
• Mail::RFC822::Address: regexp-based address validation http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
• A regular expression to check for prime numbers:
^1?$|^(11+?)\1+$http://montreal.pm.org/tech/neil_kandalgaonkar.shtml
• RegEx match open tags except XHTML self-contained tags (stackoverflow)http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
Regex optimization
• Time out
• Consider the input source
• Capture only when necessary
• Factorization
• Backtracking
“In general, a Nondeterministic Finite Automaton (NFA) engine like the .NET Framework regular expression engine places the responsibility for crafting efficient, fast regular expressions on the developer.”