Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX:...

Regular Expressions (RE)

• Used for specifying text search strings.

• Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)

• A RE is a notation for characterizing a set of strings. Formally a language is defined as a (possibly infinite) set of strings of a given alphabet.

• A regular expression search consists of a search pattern and a text to search through.

Basic RE Patterns

• E.g /woodchuck/• Case sensitive /Woodchuck/ not the same as /woodchuck/• Disjunction /[Ww]oodchuck/ : Woodchuck or woodchuck• Ranges

– /[A-Z]/ : [ABCDEFGHIJKLMNOPQRSTUVWXYZ]– /[0-9]/ : [0123456789]

• Negation – [^a] : anything that is not an “a”– [^A-Z] : anything that is not an uppercase letter– But: [a^b] : the pattern “a^b”

Basic RE Patterns

• Optional characters

– /woodchucks?/ : woodchuck or woodchucks

• Zero or more instances (Kleene star)

– /baa*!/ : ba! or baa! or baaa! or baaaa! …

– /c[ab]*c/ : cabababc or caaaac or cc …

– Note: /a*/ matches everything.

• One or more instances

– /ba+!/ : ba! or baa! or baaa! or baaaa! …

– /[0-9]+/: A string of digits.

Basic RE Patterns

• Wildcards: /./ matches any character

– /beg.n/ : begin, begun, beg_n…

• Anchors:

– Pattern at beginning of string: /^the car/ matches “the car I drive” but not “I drive the car”

– Pattern at end of string: /the car$/ matches “I drive the car” but not “the car I drive”

– \b matches a word boundary: /\bthe\b/ matches “the” but not “other”

Basic RE Patterns

• Parentheses: (abc)+ matches abc, abcabc, abcabcabc ...

• Disjunction: /cit(y|ies)/ matches city or cities

• Repetitions: /(abc){3}/ matches abcabcabc

• Backslash: Used for escaping special characters.

– \*, \+, \., \? ...

• Aliases

– \n: newline, \t:tab, \d:[0-9], \w:[a-zA-Z0-9 ]

RE Substitution

• s/regexp1/regexp2/ E.g. s/colour/color/

• Back references: \1, \2, \3 …

– s/([0-9]+)/<\1>/ : the 35 boxes -> the <35> boxes

– s/^\s*(\w+)\W+(\w+)/\2 \1/ : reverses the first two words of a sentence.

– Also used in search REs

• /A [a-z]+ is a \1/ : matches “A car is a car”.

ELIZA

• Simulated the responses of a psychologist based on simple pattern substitution.

• Initially it cascades through a set of RE substitutions that change for example s/I’m/YOU ARE/, s/my/YOUR/ ...

• Then it runs the input through RE substitutions looking for relevant patterns and produces the appropriate output. e.g.

s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR THAT YOU ARE \1/

s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1\?/

s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Finite State Automata (FSA)

• REs (that don’t use back-references) can be implemented as finite-state automata.

• A FSA is described by a regular expression.

• A RE or a FSA can be used to describe a class of languages called Regular Languages (RL).

Finite State Automata

• A FSA is represented as a graph with a finite set of nodes (called states) and directed arcs between pairs of states (called transition) labeled with symbols from the alphabet.

• One state is a start state, represented by an incoming arrow.

• Some states are final or accepting states represented by a double circle.

FSA Example

Sheeptalk: baa! baaa! baaaa! baaaaa! …

Equivalent to RE: /baaa*!/

FSA Recognition

Examples:

baaa! Succeeds

aba!b Fails

FSA State Transition Table

• Alternative representation for FSA

FSA Example

Formal FSA Definition

• Q: a finite set of states. (q0, q1, q2, …)• Σ: a finite input alphabet of symbols• q0: the start state (first state) • F: the states with of final states (subset of Q)• δ(q,i): the transition function from states and inputs to

states. Given a state q and an input i, it returns a new state q’.

Deterministic FSA (DFSA). The recognition of a string has no choice points.

Non Deterministic FSA (NFSA)

• When in state q2 with input a, the FSA has the choice to move to state q3 or remain in state q2.

Empty Arcs

From state q3 the FSA can move to state q2, without looking at the input (without advancing the tape).

NFSA Transition Tables

An extra ε column is added.

The transitions are now sets of states (instead of single states)

Accepting Strings with NFSA

• Since there is a choice of which arc to follow it is possible to take the wrong path and reject a string that should be accepted.

• All possible paths should be followed and if even one reaches a final state then the string is accepted.

• Computational approaches– Backup: When we store the current search-state (the state of the

FSA and the position of the tape) and when we reach dead end we back up to that search-state and try another path from there.

– Lookahead: We look ahead in the input to decide which path to take.

– Parallelism: Alternative paths are explored in parallel.

NFSA Recognition as Search

• The NFSA recognition can be seen as a search through a space of search-states. This consists of all the possible pairings of FSA-states and tape positions.

• The order that these search-states are visited (i.e. the decision about which possible path to follow) is important for performance.

• Depth-first or breadth-first search.

• For larger search spaces it may be necessary to use more complex search tehniques (e.g Dynamic programming or A*).

Relating DFSA and NFSA

• For every NFSA there exists an equivalent DFSA (i.e. that accepts exactly the same set of strings).

• The idea behind the proof is based on converting a NFSA to an equivalent DFSA. The resulting DFSA, may have many more states than the original NFSA (up to 2N states for a NFSA with N states).

Morphological Parsing and Recognition

• Morphological recognition: Accepts and rejects forms:

– Accept: geese

– Reject: gooses

• Morphological parsing: produces a morphological analysis (stem followed by morphological features)

– geese: goose + N + PL

– cats: cat + N + PL

– ground: ground +N +SG, grind +V +PPart

Morphological Parsing

• A morphological parser is composed of

– lexicon: the list of stems or affixes in a language, together this basic information about them.

– morphotactics: model of morpheme ordering, that defines which morpheme classes may follow other classes.

– orthographic rules: spelling rules used to model changes that occur in the language (e.g. city+s -> cities)

Lexicon

• A repository of words:

a, AAA, AA, Aachen, aardvark, aardwolf...

• Not practical to list every word in the language. Impossible for some languages (e.g. Finnish, Turkish...) Usually only the stems and the affixes are listed.

• Ideally every word possible word (or stem) should be in the lexicon, including abbreviations and proper names.

• Often along with stems in the lexicon we keep information about stem classes.

– e.g. dog: reg-noun, goose: irreg-sg-noun,

– geese: irreg-pl-noun, -s: plural-suffix

Morphotactics• Commonly represented as a FSA.

• e.g. Simple FSA for plural formation in English

Morphotactics• In cases where a morphological process is more complicated, or not

fully productive (unhappy, unreal but *unbig, *unred) the morphotactics FSA, may become quite complicated and many different stem classes may be necessary.

Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX:...

Documents

Transcript of Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX:...