LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

Post on 23-Jan-2016

215 views 0 download

Tags:

Transcript of LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

LING 438/538Computational Linguistics

Sandiway Fong

Lecture 10: 9/26

2

Administrivia

• reminder– no class this Thursday

3

+ left & right recursive rules

Last Time

introduced Finite State Automata (FSA) and regular expressions (RE)

formally equivalent– in terms of generative capacity or power

Regular Grammars

FSA Regular Expressions

Regular Languagesregular grammars --> [a],b.b --> [a],b.b --> [b],c.b --> [b].c --> [b],c.c --> [b].

regular expressiona+b+s x

y

a a

b

b

4

Last Time

• FSA– gave a formal definition

• (Q,s,f,Σ,)

• many practical applications– can be encoded and run

efficiently on a computer– implement regular

expressions– compress large dictionaries– build morphological

analyzers (suffixation)• see chapter 3 of textbook

– speech recognizers • (Hidden) Markov models =

FSA + probabilities

5

Today’s Lecture

• from the textbook– Chapter 2: Regular

Expressions and Finite State Automata

• how to implement FSA – in Prolog– from first principles– two ways

6

Regular Expressions

• pattern-matching using regular expressions– important tool in automated searching

• popular implementations– Unix grep

• returns lines matching a regular expression• standard part of all Unix-based systems

– including MacOS X (command-line interface in Terminal)

• many shareware/freeware implementations available for Windows XP– just Google and see...

– wildcard search in Microsoft Word• limited version with differences in notation

7

Regular Expressions

• One of the most popular programs for searching files and returning lines that match a regular expression pattern is called GREP– name comes from Unix ed command g/re/p– “search globally for lines matching the regular

expression, and print them”

– [Source: http://en.wikipedia.org/wiki/Grep]

8

Regular Expressions: GNU grep

• terminology:– metacharacter

• character with special meaning, not interpreted literally, e.g. ^ vs. a

• must be quoted or escaped using the backslash \ to get literal meaning, e.g. \^

• excerpts from the manpage

– A list of characters enclosed by [ and ] matches any single character in that list;

– if the first character of the list is the caret ^ then it matches any character not in the list.

• Examples– the regular expression

[0123456789] matches any single digit.

– A range of characters may be specified by giving the first and last characters, separated by a hyphen.

– [0-9]– [a-z]– [A-Za-z]

9

Regular Expressions: grep

• excerpts from the manpage

– The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line.

– The symbol \b matches the empty string at the edge of a word

– The symbols \< and \> respectively match the empty string at the beginning and end of a word.

– The period . matches any single character.

– Finally, certain named classes of characters are predefined.

• Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].

• For example, [[:alnum:]] (alphanumeric) means [0-9A-Za-z]

– The symbol \w is a synonym for [[:alnum:]] and

– \W is a synonym for [^[:alnum]].

• terminology– word

• unbroken sequence of digits, underscores and letters

10

Regular Expressions: grep

• Excerpts from the manpage– A regular expression may be followed by one of several repetition

operators:• ? The preceding item is optional and matched at most once.• * The preceding item will be matched zero or more times.• + The preceding item will be matched one or more times.• {n} The preceding item is matched exactly n times• {n,} The preceding item is matched n or more times.• {n,m} The preceding item is matched at least n times, but not more

than m times.

11

Regular Expressions: GNU grep

• concatenation– Two regular expressions

may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions.

• disjunction– Two regular expressions

may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression.

Excerpts from the manpage

12

Regular Expressions: Examples

• Regular Expression– \b99

• matches 99 in “there are 99 bottles …”

• but not in “there are 299 bottles …”

– Note: • $99 contains two words • so \b99 will match 99 here

• Regular Expression– beds?

• examples– bed– beds

13

Regular Expressions: Examples

• example– guppy

– guppies

• Regular Expression– gupp(y|ies)– | = disjunction– ( ) = parentheses indicate

scope

• example– the

• (whole word, case insensitive)

– the25

• Regular Expression (pg. 29)(^|[^a-zA-Z])[tT]he[^a-zA-

Z]– ^ = beginning of line– [^ ] = negation

14

Regular Expressions: Microsoft Word

• terminology:– wildcard search

15

Regular Expressions: Microsoft Word

Note: zero or more times is missing in Microsoft Word

16

Finite State Automata (FSA)

• more formally– (Q,s,f,Σ,)1. set of states (Q): {s,x,y} must be a finite set2. start state (s): s3. end state(s) (f): y

4. alphabet (Σ): {a, b}5. transition function :

signature: character × state → state1. (a,s)=x2. (a,x)=x3. (b,x)=y4. (b,y)=y

s x

y

aa

b

b

17

Finite State Automata (FSA)

• directly implement the formal definition– define a predicate fsa/2– takes two arguments– S = a start state– L = string (as a list) we’re interested in testing

• Prolog code (for any FSA)– fsa(S,L) :-

L = [C|M], transition(S,C,T), fsa(T,M).

– fsa(E,[]):- end_state(E).

18

Finite State Automata (FSA)

• Prolog code (for any FSA)– fsa(S,L) :-

L = [C|M], transition(S,C,T), fsa(T,M).

– fsa(E,[]):- end_state(E).• Facts (FSA-particular)

– end_state(y).

– transition(s,a,x).– transition(x,a,x).– transition(x,b,y).– transition(y,b,y).

s x

y

aa

b

b

transition function : (a,s)=x (a,x)=x (b,x)=y (b,y)=y

19

Finite State Automata (FSA)

• computation tree?- fsa(s,[a,a,b]).

?- transition(s,a,T). T=x

?- fsa(x,[a,b]).?- transition(x,a,T’).

T’=x?- fsa(x,[b]).

?- transition(x,b,T”). T”=y

?- fsa(y,[]). ?- end_state(y).

Yes

fsa(S,L) :-fsa(S,L) :- L = [C|M], L = [C|M], transition(S,C,T),transition(S,C,T),fsa(T,M).fsa(T,M).

fsa(E,[]) :- fsa(E,[]) :- end_state(E)..

20

Finite State Automata (FSA)

• deterministic FSA (DFSA)– no ambiguity about where to go at any given state

• non-deterministic FSA (NDFSA)– no restriction on ambiguity (surprisingly, no increase in formal power)

• textbook– D-RECOGNIZE (FIGURE 2.13)– ND-RECOGNIZE (FIGURE 2.21)

fsa(S,L) :-fsa(S,L) :- L = [C|M], L = [C|M], transition(S,C,T),transition(S,C,T),fsa(T,M).fsa(T,M).

fsa(E,[]) :- fsa(E,[]) :- end_state(E)..

21

Finite State Automata (FSA)

• Prolog– no change in code– Prolog computation rule takes care of choice

point management

• example– one change – “a” instead of “b” from x to y– non-deterministic– what regular language does this machine

accept?

fsa(S,L) :-fsa(S,L) :- L = [C|M], L = [C|M], transition(S,C,T),transition(S,C,T),fsa(T,M).fsa(T,M).

fsa(E,[]) :- fsa(E,[]) :- end_state(E)..

s x

y

aa

b

a

22

Finite State Automata (FSA)

• another possible Prolog encoding strategy

– define one predicate for each state• taking one argument (the input string)• consume input character• call next state with remaining input string

– query•?- s(L).

call start state s

23

Finite State Automata (FSA)

– state s: (start state)• s([a|L]) :- x(L).

match input string beginning with a

and call state x with remainder of input

– state x:• x([a|L]) :- x(L).• x([b|L]) :- y(L).

– state y: (end state)• y([]).

• y([b|L]) :- y(L).

s x

y

aa

b

b

24

Finite State Automata (FSA)

example:1. ?- s([a,a,b]).

2. ?- x([a,b]).

3. ?- x([b]).

4. ?- y([]).

Yes

s([a|L]) :- x(L).s([a|L]) :- x(L).x([a|L]) :- x(L).x([a|L]) :- x(L).x([b|L]) :- y(L).x([b|L]) :- y(L).y([]).y([]).y([b|L]) :- y(L).y([b|L]) :- y(L).

25

Finite State Automata (FSA)

• Note:– non-deterministic properties of Prolog’s

computation rule still applies here

26

Finite State Automata (FSA)

example1. ?- s([a,b,a]).

2. ?- x([b,a]).

3. ?- y([a]).

No

s([a|L]) :- x(L).s([a|L]) :- x(L).x([a|L]) :- x(L).x([a|L]) :- x(L).x([b|L]) :- y(L).x([b|L]) :- y(L).y([]).y([]).y([b|L]) :- y(L).y([b|L]) :- y(L).

27

Next Time

• bit more on FSA... • read (if you haven’t yet)– Chapter 3:

Morphology and Finite State Transducers