Intro To Regex In Java

32
Copyright© Nabeel Ali Memon Regular Expressions in Java Regular Expressions in Java

Transcript of Intro To Regex In Java

Page 1: Intro To Regex In Java

Copyright© Nabeel Ali Memon

Regular Expressions in JavaRegular Expressions in Java

Page 2: Intro To Regex In Java
Page 3: Intro To Regex In Java

Introduction

A regular expression (regex or regexp for short) is a special text string for describing a search pattern.

You can think of regular expressions as wildcards on steroids.

You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is .*\.txt$.

1 Copyright© 2008 Nabeel Ali Memon.

Page 4: Intro To Regex In Java

Importance

In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters.

Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

Many validations and input checks are flexibly done through regexes.

2 Copyright© 2008 Nabeel Ali Memon.

Page 5: Intro To Regex In Java

Example

The sequence of characters "car" in any context, such as "car", "cartoon", or "bicarbonate".

The word "car" when it appears as an isolated word the word "car" when preceded by the word "blue" or "red".

A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits..

But, Regular expressions can be much more complex than these examples.

3 Copyright© 2008 Nabeel Ali Memon.

Page 6: Intro To Regex In Java

Every-day usage

Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns.

Perl and Ruby have a powerful regular expression engine built directly into their syntax. (perhaps Perl really got famous due to it's powerful regex support built-in)

Several utilities provided by Unix distributions—including the editor ed and the filter grep—were the first to popularize the concept of regular expressions.

4 Copyright© 2008 Nabeel Ali Memon.

Page 7: Intro To Regex In Java

Basic concept

Alternation: A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey".

Grouping: Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of "gray" and "grey".

Quantification A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are ?, *, and +.

?The question mark indicates there is zero or one of the preceding element. For example, colou?r matches both "color" and "colour".

5 Copyright© 2008 Nabeel Ali Memon.

Page 8: Intro To Regex In Java

*The asterisk indicates there are zero or more of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.

+

The plus sign indicates that there is one or more of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".

These constructions can be combined to form arbitrarily complex expressions.

But, the precise syntax for regular expressions varies among tools and with context

6 Copyright© 2008 Nabeel Ali Memon.

Page 9: Intro To Regex In Java

Example

H(ae?|ä)ndel and H(a|ae|ä)ndel are both valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel.

7 Copyright© 2008 Nabeel Ali Memon.

Page 10: Intro To Regex In Java

History

The origins of regular expressions lie in automata theory and formal language, both of which are part of theoretical computer science. (but luckily we can survive without getting into them :-)

In the 1950s, mathematician Stephen Cole Kleene described these models using his mathematical notation called regular sets.

The SNOBOL language was an early implementation of pattern matching, but not identical to regular expressions.

Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files.

8 Copyright© 2008 Nabeel Ali Memon.

Page 11: Intro To Regex In Java

POSIX BRE syntax

In the BRE syntax, most characters are treated as literals — they match only themselves (i.e., a matches "a"). The exceptions, listed below, are called metacharacters or metasequences.

. Matches any single character except newlines (exactly which characters are considered newlines is flavor, character encoding, and platform specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c".

[ ]

A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a

9 Copyright© 2008 Nabeel Ali Memon.

Page 12: Intro To Regex In Java

range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", and "z", as does [a-cx-z].

The - character is treated as a literal character if it is the last or the first character within the brackets, or if it is escaped with a backslash: [abc-], [-abc], or [a\-bc].

[^ ]

Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". As above, literal characters and ranges can be mixed.

^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

$ Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.

10 Copyright© 2008 Nabeel Ali Memon.

Page 13: Intro To Regex In Java

\( \)

Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, \n). A marked subexpression is also called a block or capturing group.

\n Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is theoretically irregular and was not adopted in the POSIX ERE syntax. Some tools allow referencing more than nine capturing groups.

* Matches the preceding element zero or more times. For example, ab*c matches "ac", "abc", "abbbc", etc. [xyz]* matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on. \(ab\)* matches "", "ab", "abab", "ababab", and so on.

\{m,n\}

Matches the preceding element at least m and not more than n times. For example, a\{3,5\} matches only "aaa", "aaaa", and "aaaaa". This is not found in a few, older instances of regular expressions.

11 Copyright© 2008 Nabeel Ali Memon.

Page 14: Intro To Regex In Java

Examples:

.at matches any three-character string ending with "at", including "hat", "cat", and "bat".

[hc]at matches "hat" and "cat".

[^b]at matches all strings matched by .at except "bat".

^[hc]at matches "hat" and "cat", but only at the beginning of the string or line.

[hc]at$ matches "hat" and "cat", but only at the end of the string or line.

12 Copyright© 2008 Nabeel Ali Memon.

Page 15: Intro To Regex In Java

POSIX ERE Syntax

The meaning of metacharacters escaped with a backslash is reversed for some characters in the POSIX Extended Regular Expression (ERE) syntax.

With this syntax, a backslash causes the metacharacter to be treated as a literal character. Additionally, metacharacters are added:

?Matches the preceding element zero or one time. For example, ba? matches "b" or "ba".

+Matches the preceding element one or more times. For example, ba+ matches "ba", "baa", "baaa", and so on.

|

The choice (aka alternation or set union) operator matches either the expression before or the expression after the operator. For example, abc|def matches "abc" or "def".

13 Copyright© 2008 Nabeel Ali Memon.

Page 16: Intro To Regex In Java

Examples:

[hc]+at matches "hat", "cat", "hhat", "chat", "hcat", "ccchat", and so on, but not "at".

[hc]?at matches "hat", "cat", and "at".

cat|dog matches "cat" or "dog".

POSIX Extended Regular Expressions can often be used with modern Unix utilities by including the command line flag -E.

14 Copyright© 2008 Nabeel Ali Memon.

Page 17: Intro To Regex In Java

The Java™ way of doing Regexes.

15 Copyright© 2008 Nabeel Ali Memon.

Page 18: Intro To Regex In Java

Java specification

Recall, that a '\' has an special usage in a string literal.

For regex we use '\\' which means i'm inserting a regular expression backslash so that the following character has a special meaning.

To insert a literal backslash, use '\\\\'.

-?\\d+ means a minus-sign followed by one or more digits.

...and a bit more few things as we go along.

Yes, we call it Java-safe regex.

16 Copyright© 2008 Nabeel Ali Memon.

Page 19: Intro To Regex In Java

Kick-start example using String class//: String class based basic regex example

public class Main{public static void main(String[] args){System.out.println("­123".matches("­?\\d+"));System.out.println("+123".matches("­?\\d+"));System.out.println("­+123".matches("(­|\\+)?\\d+"));System.out.println("+­123".matches("­?\\+?\\d+"));System.out.println("­+123".matches("­?\\+?\\d+"));

}}/*****Output******true*true*false*true****************/

17 Copyright© 2008 Nabeel Ali Memon.

Page 20: Intro To Regex In Java

Another examplepublic class Main{public static String knights = "nabeel, Then,   when   you   have   found   the shrubbery,"+                       "you must cut down   the   mightiest   tree   in   the forest...";

    public static void split(String regex){System.out.println(Arrays.toString(knights.split(regex)));

}    public static void main(String[] args){          split(" ");          split("\\W+");          split("n\\W+");          System.out.println("f".matches("[

^abc].?"));}

}

//output on next slide

18 Copyright© 2008 Nabeel Ali Memon.

Page 21: Intro To Regex In Java

//output

[nabeel,, Then,, when, you, have, found, the, shrubbery,you,   must,   cut,   down,   the, mightiest, tree, in, the, forest...]

[nabeel, Then, when, you, have, found, the, shrubbery,   you,   must,   cut,   down,   the, mightiest, tree, in, the, forest]

[nabeel,   The,   whe,   you   have   found   the shrubbery,you must cut dow, the mightiest tree i, the forest...]

19 Copyright© 2008 Nabeel Ali Memon.

Page 22: Intro To Regex In Java

Other useful String class functions

split()

replace()

replaceFirst()

replaceAll()

20 Copyright© 2008 Nabeel Ali Memon.

Page 23: Intro To Regex In Java

Summary of regex character class constructs

[xyz] matches x,y and z

[^mno]  matches any character except m,n,o

[a­zA­Z]   matches both upcase and lowcase       characters

[a­d[m­p]] matches all chars from a to d or m to p (union)

[a­z&&[def]] matches only d,e,f (intersection)

[a­z&&[^mn]] matches all chars from a to z except m and n (subtraction)

[a­z&&[^m­p]]matches all chars from a to z but not from m to p

21 Copyright© 2008 Nabeel Ali Memon.

Page 24: Intro To Regex In Java

Java API for Regex handling

Java gives you java.util.regex package.

Interface: MatchResult

Classes: Pattern, Matcher

Exception: PatternSyntaxException(unchecked exception)

All implemented interfaces are Serializable

22 Copyright© 2008 Nabeel Ali Memon.

Page 25: Intro To Regex In Java

Regex API demystified

A regular expression, specified as a string, must first be compiled into an instance of this class.

The resulting pattern can then be used to create a Matcher object that can match arbitrary character sequences against the regular expression.

All of the state involved in performing a match resides in the matcher, so many matchers can share the same pattern.

23 Copyright© 2008 Nabeel Ali Memon.

Page 26: Intro To Regex In Java

A typical example

A typical invocation sequence may be;

Pattern p = Pattern.compile("a*b");  Matcher m = p.matches("aaaaab");

boolean b = m.matches();

A matches() method is defined by this class as a convenience for when a regular expression is used just once. This method compiles an expression and matches an input sequence against it in a single invocation.

A boolean match statement may be;boolean   b   =   Pattern.matches("a*b", "aaaaab");

24 Copyright© 2008 Nabeel Ali Memon.

Page 27: Intro To Regex In Java

Pattern class

In Java, you compile a regular expression by using the Pattern.compile() class factory.

This factory returns an object of type Pattern. E.g.: Pattern   myPattern   = Pattern.compile("regex");

You can specify certain options as an optional second parameter.E.g.: Pattern.compile("regex", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE);

When working with Unicode strings, specify Pattern.UNICODE_CASE if you want to make the regex case insensitive for all characters in all languages. 

25 Copyright© 2008 Nabeel Ali Memon.

Page 28: Intro To Regex In Java

More on Pattern

You should always specify Pattern.CANON_EQ to ignore differences in Unicode encodings, unless you are sure your strings contain only US ASCII characters and you want to increase performance.

If you will be using the same regular expression often in your source code, you should create a Pattern object to increase performance.

26 Copyright© 2008 Nabeel Ali Memon.

Page 29: Intro To Regex In Java

The Matcher class

To create a Matcher object, simply call Pattern.matcher() like this:Matcher   myMatcher   = pattern.matcher("subject")

If you already created a Matcher object from the same pattern, call myMatcher.reset("newsubject") instead of creating a new matcher object, for reduced garbage and increased performance.

To find the first match of the regex in the subject string, call myMatcher.find()

Repeat the above step again unless find() returns false.

Or use while(myMatcher.find()) {...} to make your life easy.

27 Copyright© 2008 Nabeel Ali Memon.

Page 30: Intro To Regex In Java

More on Matcher

The Matcher object holds the results of the last match.

Call it's methods start(),   end() and group() to get details about the entire regex match and the matches between capturing parantheses.

myMatcher.replaceAll("replacement") has exactly the same results as myString.replaceAll("regex", "replacement").

But only if speed could be ignored.

28 Copyright© 2008 Nabeel Ali Memon.

Page 31: Intro To Regex In Java

Example code-snippets

//snippet to append replacement using Matcher

StringBuffer   myStringBuffer   =   new StringBuffer();myMatcher = myPattern.matcher("subject");while (myMatcher.find()) {  if(method_replace_check()) {    myMatcher.appendReplacement(myStringBuffer, computeReplacementString());  }}

//method_replace_check()//computeReplacementString()

29 Copyright© 2008 Nabeel Ali Memon.

Page 32: Intro To Regex In Java

Another code-snippet //start(), end(), group() methods import java.util.regex;

public class MatcherMethods{//Input the string for validation

      String email = "[email protected]";

      //Set the email pattern string      Pattern p = Pattern.compile(".+@.+\\.[a­z]+");

            //Match   the   given   string   with   the pattern      Matcher m = p.matcher(email);

      //check whether match is found       boolean matchFound = m.matches();

if (matchFound)        System.out.println("Valid Email Id.");      else               System.out.println("Invalid Email Id.");   }}

30 Copyright© 2008 Nabeel Ali Memon.