Regular Expressions and Automata

31
Regular Expressions and Automata Chapter 2

description

Regular Expressions and Automata. Chapter 2. Regular Expressions. Standard notation for characterizing text sequences Used in all kinds of text processing and information extraction tasks - PowerPoint PPT Presentation

Transcript of Regular Expressions and Automata

Page 1: Regular Expressions and Automata

Regular Expressions and Automata

Chapter 2

Page 2: Regular Expressions and Automata

Regular Expressions• Standard notation for characterizing

text sequences• Used in all kinds of text processing

and information extraction tasks• As things have progressed, the RE

languages used in various tools and languages (grep, Emacs, Python, Ruby, Java, …) are very similar

04/24/23 Speech and Language Processing - Jurafsky and Martin 2

Page 3: Regular Expressions and Automata

Regular Expressions• We’ll look at a few examples [in

lecture], make a note about types of errors, and then move toward automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 3

Page 4: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 4

Example• Find all the instances of the word

“the” in a text. /the/ /[tT]he/ /\b[tT]he\b/

Page 5: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 5

Errors

• We fixed two kinds of errors Matching strings that we should not have

matched False positives (Type I)

Not matching things that we should have matched False negatives (Type II)

Page 6: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 6

Errors

• We’ll see the same story for many tasks, all semester. Reducing the error rate for an application often involves two antagonistic efforts: Increasing precision, (minimizing false

positives) Increasing coverage, or recall,

(minimizing false negatives)

Page 7: Regular Expressions and Automata

Formal Languages and Models• Language: a (possibly infinite) set of

strings made up of symbols from a finite alphabet

• Model of a language: can recognize and generate all and only the strings from the language Serves as a definition of the formal

language

04/24/23 Speech and Language Processing - Jurafsky and Martin 7

Page 8: Regular Expressions and Automata

Chomsky Hierarchy• Regular language

Model: regular expressions, finite state automata

• Context free language• Context sensitive language• Unrestricted language

Model: Turning Machine

04/24/23 Speech and Language Processing - Jurafsky and Martin 8

Page 9: Regular Expressions and Automata

Regular Expressions and Languages

• A regular expression pattern can be mapped to a set of strings

• A regular expression pattern defines a language (in the formal sense) – the class of this type of languages is called a regular language

04/24/23 Speech and Language Processing - Jurafsky and Martin 9

Page 10: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 10

Finite State Automata• FSAs and their probabilistic relatives

are at the core of much of what we’ll be doing all semester.

• They also capture significant aspects of what linguists say we need for morphology and parts of syntax.

• They are formally equivalent to regular expressions

Page 11: Regular Expressions and Automata

Formal Definition of a Finite Automaton

1. Finite set of states, typically Q.2. Alphabet of input symbols, typically 3. One state is the start/initial state, typically q0 // q0 Q4. Zero or more final/accepting states; the set is typically F. //

F Q5. A transition function, typically δ. This function

• Takes a state and input symbol as arguments.• Returns a state.• One “rule” would be written δ(q, a) = p, where q and p are

states, and a is an input symbol.• Intuitively: if the FA is in state q, and input a is received, then

the FA goes to state p (note: q = p OK).6. A FA is represented as the five-tuple: A = (Q, , δ,q0, F).

Here, F is a set of accepting states.

Page 12: Regular Expressions and Automata

A Simple Example• Language: “Sheepish” Any string

that starts with the letter b, followed by two or more a’s, and ending in !

• {“baa!”,”baaa!”,”baaaa!”,”baaaaa!”,…}

• Regular expression for this?

04/24/23 Speech and Language Processing - Jurafsky and Martin 12

Page 13: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 13

One Possible Sheepish FSA

• Formal definition of this FSA?

Page 14: Regular Expressions and Automata

FSA as a Recognizer• Does a string belong to its language?1.Place the input string on a tape, point at

start2.Initialize current state to q0

3.Iteratively check the next letter on tape.1. From the current state, if an outgoing arc

label matches new letter, move to new state2. If stuck, REJECT

4.If reach the end of the tape and in a final state, then ACCEPT; else, REJECT

04/24/23 Speech and Language Processing - Jurafsky and Martin 14

Page 15: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 15

Recognition• Traditionally, (Turing’s notion) this

process is depicted with a tape.

Page 16: Regular Expressions and Automata

FSA as Generator• FSA can also produce strings in the

language it represents1.Start from q02.Pick an out-going arc to a new state (for

now, assume picking randomly) and print the symbol on the arc

3.Follow the arc to the new state4.Repeat from step 2 until reaching a

final state

04/24/23 Speech and Language Processing - Jurafsky and Martin 16

Page 17: Regular Expressions and Automata

FSAs and Regular Expressions• These are formally equivalent. • Both of these classes of models

recognize/generate exactly the class of regular languages

• Interesting proofs: constructive! Given any regular expression, create an equivalent FSA; given any FSA, create an equivalent regular expression

04/24/23 Speech and Language Processing - Jurafsky and Martin 17

Page 18: Regular Expressions and Automata

Note on Practical Regular Expression Utilities

• NOTE: additional features added to regular expression processing can make them more powerful; think of memory

04/24/23 Speech and Language Processing - Jurafsky and Martin 18

Page 19: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 19

About Alphabets• Don’t take term alphabet word too

narrowly; it just means we need a finite set of symbols in the input.

• These symbols can and will stand for bigger objects that can have internal structure.

Page 20: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 20

Often there is more than one FSA for a given language• E.g., here is another FSA for

“Sheepish”

Page 21: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 21

Yet Another View

• The guts of FSAs can ultimately be represented as tables

b a ! e0 11 22 2,33 44

If you’re in state 1 and you’re looking at an a, go to state 2

Page 22: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 22

Deterministic versus Non-Deterministic FSAs

• Deterministic means that at each point in processing there is always one unique thing to do (no choices).

• Non-deterministic means there are choices

• Go back and look at previous DFA• How do deterministic and non-

deterministic FSAs compare?

Page 23: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 23

Non-Deterministic FSAs• May include

Epsilon transitions Key point: these transitions do not

examine or advance the tape during recognition

Page 24: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 24

ND Recognition• Two basic approaches

1. Either take a ND machine and convert it to a D machine and then do recognition with that.

2. Or explicitly manage the process of recognition as a state-space search (leaving the machine as is).

Page 25: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 25

Non-Deterministic Recognition: Search

• In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine.

• But not all paths through the machine for an accept string lead to an accept state.

• If a string is not in the language, there are no paths through the machine that lead to an accept state

Page 26: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 26

Non-Deterministic Recognition

• So success in non-deterministic recognition occurs when a path is found through the machine that ends in an accept.

• Failure occurs when all of the possible paths for a given string lead to failure.

Page 27: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 27

Example

Page 28: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 28

Why Non-Determinism?• Non-determinism doesn’t get us

more formal power and it causes headaches so why bother? More natural (understandable) solutions

Page 29: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 29

Compositional Machines

• Formal languages are just sets of strings

• Therefore, we can talk about various set operations (intersection, union, concatenation)

• We’ll just do a couple

Page 30: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 30

Union

Page 31: Regular Expressions and Automata

04/24/23 Speech and Language Processing - Jurafsky and Martin 31

Concatenation