Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX:...

26
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…) A RE is a notation for characterizing a set of strings. Formally a language is defined as a (possibly infinite) set of strings of a given alphabet. A regular expression search consists of a search pattern and a text to search through.
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    226
  • download

    0

Transcript of Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX:...

Page 1: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Regular Expressions (RE)

• Used for specifying text search strings.

• Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

• A RE is a notation for characterizing a set of strings. Formally a language is defined as a (possibly infinite) set of strings of a given alphabet.

• A regular expression search consists of a search pattern and a text to search through.

Page 2: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Basic RE Patterns

• E.g /woodchuck/• Case sensitive /Woodchuck/ not the same as /woodchuck/• Disjunction /[Ww]oodchuck/ : Woodchuck or woodchuck• Ranges

– /[A-Z]/ : [ABCDEFGHIJKLMNOPQRSTUVWXYZ]– /[0-9]/ : [0123456789]

• Negation – [^a] : anything that is not an “a”– [^A-Z] : anything that is not an uppercase letter– But: [a^b] : the pattern “a^b”

Page 3: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Basic RE Patterns

• Optional characters

– /woodchucks?/ : woodchuck or woodchucks

• Zero or more instances (Kleene star)

– /baa*!/ : ba! or baa! or baaa! or baaaa! …

– /c[ab]*c/ : cabababc or caaaac or cc …

– Note: /a*/ matches everything.

• One or more instances

– /ba+!/ : ba! or baa! or baaa! or baaaa! …

– /[0-9]+/: A string of digits.

Page 4: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Basic RE Patterns

• Wildcards: /./ matches any character

– /beg.n/ : begin, begun, beg_n…

• Anchors:

– Pattern at beginning of string: /^the car/ matches “the car I drive” but not “I drive the car”

– Pattern at end of string: /the car$/ matches “I drive the car” but not “the car I drive”

– \b matches a word boundary: /\bthe\b/ matches “the” but not “other”

Page 5: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Basic RE Patterns

• Parentheses: (abc)+ matches abc, abcabc, abcabcabc ...

• Disjunction: /cit(y|ies)/ matches city or cities

• Repetitions: /(abc){3}/ matches abcabcabc

• Backslash: Used for escaping special characters.

– \*, \+, \., \? ...

• Aliases

– \n: newline, \t:tab, \d:[0-9], \w:[a-zA-Z0-9 ]

Page 6: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

RE Substitution

• s/regexp1/regexp2/ E.g. s/colour/color/

• Back references: \1, \2, \3 …

– s/([0-9]+)/<\1>/ : the 35 boxes -> the <35> boxes

– s/^\s*(\w+)\W+(\w+)/\2 \1/ : reverses the first two words of a sentence.

– Also used in search REs

• /A [a-z]+ is a \1/ : matches “A car is a car”.

Page 7: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

ELIZA

• Simulated the responses of a psychologist based on simple pattern substitution.

• Initially it cascades through a set of RE substitutions that change for example s/I’m/YOU ARE/, s/my/YOUR/ ...

• Then it runs the input through RE substitutions looking for relevant patterns and produces the appropriate output. e.g.

s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR THAT YOU ARE \1/

s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1\?/

s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Page 8: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Finite State Automata (FSA)

• REs (that don’t use back-references) can be implemented as finite-state automata.

• A FSA is described by a regular expression.

• A RE or a FSA can be used to describe a class of languages called Regular Languages (RL).

Page 9: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Finite State Automata

• A FSA is represented as a graph with a finite set of nodes (called states) and directed arcs between pairs of states (called transition) labeled with symbols from the alphabet.

• One state is a start state, represented by an incoming arrow.

• Some states are final or accepting states represented by a double circle.

Page 10: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

FSA Example

Sheeptalk: baa! baaa! baaaa! baaaaa! …

Equivalent to RE: /baaa*!/

Page 11: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

FSA Recognition

Examples:

baaa! Succeeds

aba!b Fails

Page 12: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

FSA State Transition Table

• Alternative representation for FSA

Page 13: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

FSA Example

Page 14: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Formal FSA Definition

• Q: a finite set of states. (q0, q1, q2, …)• Σ: a finite input alphabet of symbols• q0: the start state (first state) • F: the states with of final states (subset of Q)• δ(q,i): the transition function from states and inputs to

states. Given a state q and an input i, it returns a new state q’.

Deterministic FSA (DFSA). The recognition of a string has no choice points.

Page 15: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Non Deterministic FSA (NFSA)

• When in state q2 with input a, the FSA has the choice to move to state q3 or remain in state q2.

Page 16: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Empty Arcs

From state q3 the FSA can move to state q2, without looking at the input (without advancing the tape).

Page 17: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

NFSA Transition Tables

An extra ε column is added.

The transitions are now sets of states (instead of single states)

Page 18: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Accepting Strings with NFSA

• Since there is a choice of which arc to follow it is possible to take the wrong path and reject a string that should be accepted.

• All possible paths should be followed and if even one reaches a final state then the string is accepted.

• Computational approaches– Backup: When we store the current search-state (the state of the

FSA and the position of the tape) and when we reach dead end we back up to that search-state and try another path from there.

– Lookahead: We look ahead in the input to decide which path to take.

– Parallelism: Alternative paths are explored in parallel.

Page 19: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

NFSA Recognition as Search

• The NFSA recognition can be seen as a search through a space of search-states. This consists of all the possible pairings of FSA-states and tape positions.

• The order that these search-states are visited (i.e. the decision about which possible path to follow) is important for performance.

• Depth-first or breadth-first search.

• For larger search spaces it may be necessary to use more complex search tehniques (e.g Dynamic programming or A*).

Page 20: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
Page 21: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Relating DFSA and NFSA

• For every NFSA there exists an equivalent DFSA (i.e. that accepts exactly the same set of strings).

• The idea behind the proof is based on converting a NFSA to an equivalent DFSA. The resulting DFSA, may have many more states than the original NFSA (up to 2N states for a NFSA with N states).

Page 22: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Morphological Parsing and Recognition

• Morphological recognition: Accepts and rejects forms:

– Accept: geese

– Reject: gooses

• Morphological parsing: produces a morphological analysis (stem followed by morphological features)

– geese: goose + N + PL

– cats: cat + N + PL

– ground: ground +N +SG, grind +V +PPart

Page 23: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Morphological Parsing

• A morphological parser is composed of

– lexicon: the list of stems or affixes in a language, together this basic information about them.

– morphotactics: model of morpheme ordering, that defines which morpheme classes may follow other classes.

– orthographic rules: spelling rules used to model changes that occur in the language (e.g. city+s -> cities)

Page 24: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Lexicon

• A repository of words:

a, AAA, AA, Aachen, aardvark, aardwolf...

• Not practical to list every word in the language. Impossible for some languages (e.g. Finnish, Turkish...) Usually only the stems and the affixes are listed.

• Ideally every word possible word (or stem) should be in the lexicon, including abbreviations and proper names.

• Often along with stems in the lexicon we keep information about stem classes.

– e.g. dog: reg-noun, goose: irreg-sg-noun,

– geese: irreg-pl-noun, -s: plural-suffix

Page 25: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Morphotactics• Commonly represented as a FSA.

• e.g. Simple FSA for plural formation in English

Page 26: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

Morphotactics• In cases where a morphological process is more complicated, or not

fully productive (unhappy, unreal but *unbig, *unred) the morphotactics FSA, may become quite complicated and many different stem classes may be necessary.