Finding a needle in Haystack Facebooks Photo Storage Shakthi Bachala.
Finding the needle(s) in the textual haystack
-
Upload
penelope-barker -
Category
Documents
-
view
212 -
download
0
Transcript of Finding the needle(s) in the textual haystack
Textual Patterns
Finding the needle(s) in the textual haystack
http://www.flickr.com/photos/seadigs/4810947743/sizes/z/in/photostream/
Patterns
Consider the text above. How would you identify… Proper names?… Email addresses?… Dates?
From: Gow, Joe <[email protected]>Subject: Reminder About Open Forums TodayDate: March 25, 2011 8:44:08 AM CDTBcc: [email protected]
Hello, everyone. I just wanted to send a quick reminder about the two campus wide Open Forums we're holding today from 2 to 3 and 3 to 4 p.m. in the Cleary Center. I'll host the first session from 2 to 3, and we'll cover any topics you'd like to discuss. Then from 3 to 4 Vice Chancellor Bob Hetzel will lead a conversation about the plans for a new Cowley Science Building. Please join us!
Thanks,
Joe
Joe Gow, ChancellorUniversity of Wisconsin-La Crosse
PatternsWhat do you think of when you see the following?
MM/DD/YYYY
This is a (string) pattern.
Are there different patterns for this same thing?
How would you describe the pattern of a credit card number?
Regular ExpressionsRegular expressions are “formulas” for string patterns.
Regular expressions follow a standard notation.
Regular expressions can be used in various computer applications and programming languages.
Applying a regular expression to a string (piece of text) is called pattern matching.
- The regular expression might match the string (or part of it) or it might not.
Regular Expression NotationRegular expressions use a standard pattern language.
Any (non-meta) character is a pattern. The character pattern represents itself.
The '.' (period) is a pattern. The period (a meta character) pattern represents "any character"
If A and B are both patterns, then so areAB : This represents the pattern A followed by pattern B
F. matches Fa FR and F3 but not fa or aF
A|B : This represents either the pattern A or the pattern BP|Q matches P and Q but not R
Parentheses are special; they form a pattern group. Anything in parenthesis is a group. A group is one "thing".
(red|blue) fish matches what strings?
Example
(1|2|3|4|5|6|7|8|9|10|11|12):(0|1|2|3|4|5)(0|1|2|3|4|5|6|7|8|9)
How would you write an expression for the time on a digital 12-hour clock?
1|2|3|4|5|6|7|8|9|10|11|12
A regular expression matching any possible minute:(0|1|2|3|4|5)(0|1|2|3|4|5|6|7|8|9)
[HINT: Let’s divide & conquer]
A regular expression matching any possible hour:
A regular expression matching any possible time:
Repeating Patterns within PatternsQuantifiers are used to allow and constrain repetitions. If re is a regular expression (pattern), then so are:
re* represents zero or more repetitions of re
re+ represents one or more repetitions of re
re? represents zero or one occurrences of re
re{n} represents exactly n repetitions of re (n is some positive integer)
re{m,n} represents at least m and no more than n repetitions of re
(n, m are positive integers, m ≤ n)
Write a regular expression for Social Security Numbers123-45-6789
Example
• TextI sometimes wonder if the manufactures of foolproof items keep a fool or two on their payroll.
Patten: o{2}1?
Escaped CharactersSome characters have special meaning in regular expressions, and others have no printable form. Such characters can still be represented using a 2-character notation, known as an escape code.
\+ represents +
\. represents .
\n represents the new line character
The same technique works for * ? ( ) { } [ ] \ ^ $ |
\t represents the tab character
\r represents the carriage return character
\v represents the vertical tab character
\f represents the form feed character
Location Symbols
There are also two “location” symbols.
^ matches the start of a new line, including right after \n$ matches the end of a new line, including right before \n
Sample Regular Expressions(snow|rain)(flake|drop)
g(rr|ee)*
W.*W
B\.C\.
^Right now.$
^Right now.\$
Character ClassesSquare brackets enclose a character class (a set of
characters). The class will match any one character from the set. Within brackets…
specific characters can be listed ranges are denoted using -
Examples [aDb] matches a or D or b and nothing else[c-e] matches c or d or e and nothing else[a-z] matches any lowercase letter and nothing else
[a-zA-Z0-9] matches any alphabetic or numeric symbol
[a+*] matches a or + or * and nothing else
Examples
Which of the following match [a-z][0-9]*abc1z93a-9
Which of the following match [0-9]*[02468]039929354
Give a pattern for social security numbers using character classes.
Example 1: Phone Numbers
Create a regular expression to match phone numbers. The phone numbers can take on the following forms:
800-555-1212800 555 1212800.555.12121-800-555-1212800-555-1212-1234800-555-1212x1234
Example 1: Phone Numbers• Divide and conquer
Note that each phone number has at most four parts.• prefix (the number 1)• area code• trunk (first three digits)• rest (next 4 digits)• extension (last digits. May be between 1 and 4 in length)
• Consider defining each of these parts – what is the prefix?– what is the area code?– what is the trunk?– what is the rest?– what is the extension?
Example 1: Phone Numbers• We need to 'conquer' by combining the solutions for the parts.• Rules:
– The prefix is optional– One of the following must occur between the prefix and the area code:
space, comma, dash, period– One of the following must occur between the area code and the trunk:
space, comma, dash, period– One of the following must occur between the trunk and the rest: space,
comma, dash, period– An ‘x’ must occur between the rest and the extension.
Example 2: User NameSuppose the rules for some system are that a user name must begin with a capital letter, followed by lowercase letters and/or dashes and/or periods. The length of user names are restricted to 3 to 16 characters.
ExamplesDaveD.-rileyRdave
Invaliddave doesn’t begin with a capital letter
DDR3 capital letters and digits not permitted after first symbol
R too short
Example 3: MAC AddressEvery computer network connection has a unique MAC address that is expressed as six numbers separated by colons. Each number consists of two hexadecimal digits.
Examples10:22:93:04:91:00AF:0C:AA:ED:B7:21
Invalid10:22:93:04:91 too short
10:22:013:04:91 numbers must be two digits long, not three
AG:0C:AA:ED:B7:21 the letter “G” is not a hexadecimal digit
Example 4: IPV4Internet addresses are referred to as IP numbers. A common address consists of four positive integers separated by periods. These integers must each be within the range of 0-255.
Examples1.01.001.0255.255.255.255193.24.17.2
Invalid256.255.255.255 no number can be greater than 255
193.24.175. too few numbers
193.24:17.2 separators must be periods
Example 5: Email Addresses• An email address consists of two strings separated by a @
localString @ domainString• localString
– Must be one or more of the following characters: alphabetic, digits (0 through 9), or any of these !#$&’+-_/=?^`{|}~
– Periods are permitted but with the following restrictions: the first and last characters cannot be periods and there cannot be any consecutive periods.
– Note: There is another unusual notation for selected characters only allowed inside double quotes, which we will ignore.
• domainString– Must be one or more of the following characters: alphabetic, digits, dashes or
periods.– Alternately, the domain could be written as a pair of square brackets enclosing
four numbers separated by periods, where each of the four numbers is a non-negative number of one to three digits.
e.g., [138.93.200.0]