LING 581: Advanced Computational Linguistics Lecture Notes February 2nd.
LING 6932 Topics in Computational Linguistics
description
Transcript of LING 6932 Topics in Computational Linguistics
![Page 1: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/1.jpg)
1 LING 6932 Spring 2007
LING 6932 Topics in Computational Linguistics
Hana FilipLecture 2: Regular Expressions, Finite State Automata
![Page 2: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/2.jpg)
2 LING 6932 Spring 2007
Regular expressions
formulas for specifying text stringsHow can we search for any of these strings?
woodchuckwoodchucksWoodchuckWoodchucks
Figure from Dorr/Monz slides
![Page 3: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/3.jpg)
3 LING 6932 Spring 2007
Regular Expressions
Basic patterns of regular expressionsPerl-based syntax (slightly different from other notations for regular expressions as used in UNIX, for example)/Woodchuck/ matches any string containing the substring Woodchuck, if your search application returns entire lines, for example‘/’ notation used by Perl, NOT part of the RE
Google: Woodchuck Draft CiderProducers of Woodchuck Draft Cider in Spingfield, VT.www.woodchuck.com/ - 17k - Cached - Similar pages
Slide from Dorr/Monz
![Page 4: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/4.jpg)
4 LING 6932 Spring 2007
Regular Expressions
Regular expressions are CASE SENSITIVEThe pattern /woodchuck/ will not match the string WoodchuckDisjunction /[wW]oodchuck/
Slide from Dorr/Monz
![Page 5: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/5.jpg)
5 LING 6932 Spring 2007
Regular Expressions
Ranges [A-Z]
Slide from Dorr/Monz
![Page 6: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/6.jpg)
6 LING 6932 Spring 2007
Regular Expressions
Negation /[^a]/ ^: caret
‘match any single character except a’
Slide from Dorr/Monz
![Page 7: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/7.jpg)
7 LING 6932 Spring 2007
Regular Expressions
Operators ? , * and +? (0 or 1)
/woodchucks?/ woodchuck or woodchucks/colou?r/ color or colour
* (0 or more)/oo*h!/ oh! or ooh! or ooooh!
+ (1 or more)
• /o+h!/ oh! or ooh! or ooooh!
related to the immediately preceding character or regular expression
*+
Stephen Cole Kleene Wild card ./beg.n/ begin or began or begun
any character between beg and n (except a carriage return)
Slide from Dorr/Monz
![Page 8: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/8.jpg)
8 LING 6932 Spring 2007
Regular Expressions
Anchors ^ and $ start of line
/^[A-Z]/ “Ramallah, Palestine”
/^[^A-Z]/ “¿verdad?” “really?”
end of line
/\.$/ “It is over.”
/.$/ ?
Boundaries \b and \B/\bon\b/ “on my way” “Monday” (boundary)
/\Bon\b/ “automaton” (non-boundary)
Slide from Dorr/Monz
![Page 9: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/9.jpg)
9 LING 6932 Spring 2007
Disjunction, Grouping, Precedence
Disjunction |
/yours|mine/ “it is either yours or mine”
/gupp(y|ies)/ “guppy” or “guppies”
Column 1 Column 2 Column 3 …How do we express this?/Column[0-9]*/ ‘space’ /(Column[0-9]*)*/ NOT a RE character
matches the word Column, followed by one number, followed by zero or more spaces, the whole pattern repeated any number of times (zero or more times)
Slide from Dorr/Monz
![Page 10: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/10.jpg)
10 LING 6932 Spring 2007
Disjunction, Grouping, Precedence
Operator Precedence HierarchyParenthesis ()Counters * + ? Sequences and anchors the ^my end$Disjunction |
REs are greedy!They always match the largest string they can
Slide from Dorr/Monz
![Page 11: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/11.jpg)
11 LING 6932 Spring 2007
Example
Find me all instances of the word “the” in a text.
/the/Misses capitalized examples
/[tT]he/Returns “other” or “theology”
/\b[tT]he\b/ matches “the” or “The”
/[^a-zA-Z][tT]he[^a-zA-Z]/
/(^|[^a-zA-Z])[tT]he[^a-zA-Z]/Matches “the_” or “the25”
![Page 12: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/12.jpg)
12 LING 6932 Spring 2007
Errors
The process we just went through was based on two fixing kinds of errors
Not matching things that we should have matched (The)
– False negatives
Matching strings that we should not have matched (there, then, other)
– False positives
![Page 13: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/13.jpg)
13 LING 6932 Spring 2007
Errors cont.
We’ll be telling the same story for many tasksReducing the error rate for an application often involves two antagonistic efforts:
Increasing accuracy (minimizing false positives)Increasing coverage (minimizing false negatives).
![Page 14: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/14.jpg)
14 LING 6932 Spring 2007
More complex RE example
Regular expressions for prices/$[0-9]+/
Doesn’t deal with fractions of dollars
/$[0-9]+\.[0-9][0-9]/Doesn’t allow $199, not at a word boundary
/\b$[0-9]+(\.[0-9]0-9])?\b)/
![Page 15: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/15.jpg)
15 LING 6932 Spring 2007
Advanced operators
Regular expression operators for counting
RE Match
{n} exactly n occurrences of the previous character or expression
{n,m} from n to m occurrences of the previous character or expression
{n, } at least n occurrences of the previous character or expression
/a\.{24}z/ a followed by 24 dots followed by z
![Page 16: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/16.jpg)
16 LING 6932 Spring 2007
Advanced operators
To refer to characters that are special themselves precede them with a backslash
RE Match Example Strings Matched
\* an asterisk “*” “K*A*P*L*A*N”
\. a period “.” “Dr.Livingston, I presume.”
\? A question mark “?” “Would you light my candle?”
\n a newline
\t tab
![Page 17: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/17.jpg)
17 LING 6932 Spring 2007
Advanced operators
Slide from Dorr/Monz
![Page 18: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/18.jpg)
18 LING 6932 Spring 2007
Substitutions and Memory
Substitution operator s/regexp1/regexp2/ (UNIX, Perl)
s/colour/color/s/colour/color/g
Substitute as many times as possible!
Case insensitive matching
s/colour/color/i
Slide from Dorr/Monz
![Page 19: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/19.jpg)
19 LING 6932 Spring 2007
Substitutions and Memory
Substitutions“the Xer they were, the Xer they will be”constrain the two X’s to be the same string
/the (.*)er they were, the $1er they will be/
/the (.*)er they (.*), the $1er they $2/
Using numbered memories or registers: $1, $2, etc. used to refer back to matches
An extended feature of regular expressions
Slide from Dorr/Monz
![Page 20: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/20.jpg)
20 LING 6932 Spring 2007
Eliza [Weizenbaum, 1966]
User: Men are all alikeELIZA: IN WHAT WAYUser: They’re always bugging us about something or otherELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE?User: Well, my boyfriend made me come hereELIZA: YOUR BOYFRIEND MADE YOU COME HEREUser: He says I’m depressed much of the timeELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
![Page 21: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/21.jpg)
21 LING 6932 Spring 2007
Eliza-style regular expressions
s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/s/.* all .*/IN WHAT WAY/s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Step 1: replace first person with second person references
s/\bI(’m | am)\b /YOU ARE/g
s/\bmy\b /YOUR/g
S/\bmine\b /YOURS/g
Step 2: use substitutions that look for relevant patterns in the input and create an appropriate output (reply)
Step 3: use scores to rank possible transformations
Slide from Dorr/Monz
![Page 22: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/22.jpg)
22 LING 6932 Spring 2007
Summary on REs so far
Regular expressions are perhaps the single most useful tool for text manipulation
Dumb but ubiquitous
Eliza: you can do a lot with simple regular-expression substitutions
![Page 23: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/23.jpg)
23 LING 6932 Spring 2007
Three Views
Three equivalent formal ways to look at what we’re up to (thanks to Martin Kay)
Regular Expressions
Regular LanguagesFinite State Automata
![Page 24: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/24.jpg)
24 LING 6932 Spring 2007
Finite State Automata
Terminology: Finite State Automata, Finite State Machines, FSA, Finite AutomataRegular expressions are one way of specifying the structure of finite-state automata.FSAs and their close relatives are at the core of most algorithms for speech and language processing.
![Page 25: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/25.jpg)
25 LING 6932 Spring 2007
Finite-state Automata (Machines)
/^baa+!$/
q0 q1 q2 q3 q4
b a a !
a
state transitionfinalstate
baa! baaa! baaaa! baaaaa! ...
Slide from Dorr/Monz
![Page 26: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/26.jpg)
26 LING 6932 Spring 2007
Sheep FSA
We can say the following things about this machineIt has 5 statesAt least b, a, and ! are in its alphabetq0 is the start stateq4 is the final (= accept) stateIt has 5 transitions
![Page 27: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/27.jpg)
27 LING 6932 Spring 2007
More Formally: Defining an FSA
You can specify an FSA by enumerating the following things.
a finite set of states: Q
a finite alphabet of symbols: the start state: q0
The set of accepting/final states: F such that FQ
A transition function (q,i) that maps Qx to Q
Given a state qQ and an input symbol i, (q,i) returns a new state q’Q.
![Page 28: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/28.jpg)
28 LING 6932 Spring 2007
Yet Another View
State-transition table
![Page 29: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/29.jpg)
29 LING 6932 Spring 2007
Recognition
Recognition is the process of determining if a string should be accepted by a machineOr… it’s the process of determining if a string is in the language we’re defining with the machineOr… it’s the process of determining if a regular expression matches a string
![Page 30: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/30.jpg)
30 LING 6932 Spring 2007
Recognition
Traditionally, (Turing’s idea, 1936) this process is depicted with a tape.
http://www.cs.princeton.edu/introcs/75turing/
![Page 31: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/31.jpg)
31 LING 6932 Spring 2007
Recognition - Execution
Start in the start stateExamine the current input in the active cellConsult the table: a finite table of instructions (a state transition diagram) that specifies exactly what action the machine takes at each stepGo to a new state and update the tape pointer.Until you run out of tape.
![Page 32: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/32.jpg)
32 LING 6932 Spring 2007
Input Tape
b a a a
q0 q1 q2 q3 q3 q4
!
0 1 2 3 4
b a a !a
ACCEPT
Slide from Dorr/Monz
![Page 33: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/33.jpg)
33 LING 6932 Spring 2007
Input Tape
a b a ! b
q0
0 1 2 3 4
b a a !a
REJECT
Slide from Dorr/Monz
![Page 34: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/34.jpg)
34 LING 6932 Spring 2007
Adding a failing state
q0 q1 q2 q3 q4
b a a !
a
qFa
!
b
! b ! b
b
a
!
Slide from Dorr/Monz
![Page 35: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/35.jpg)
35 LING 6932 Spring 2007
Tracing D-Recognize
![Page 36: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/36.jpg)
36 LING 6932 Spring 2007
Key Points
Deterministic means that at each point in processing there is always one unique thing to do (no choices).D-recognize is a simple table-driven interpreterThe algorithm is universal for all unambiguous languages.To change the machine, you change the table.
![Page 37: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/37.jpg)
37 LING 6932 Spring 2007
Key Points
Deterministic Pattern Example: Consider a set of traffic lights; the sequence of lights is red - red/amber - green - amber - red. The sequence can be pictured as a state machine, where the different states of the traffic lights follow each other.
Each state is dependent solely on the previous state, so if the lights are green, an amber light will always follow - that is, the system is deterministic. Deterministic systems are relatively easy to understand and analyse, once the transitions are fully known.
![Page 38: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/38.jpg)
38 LING 6932 Spring 2007
Key Points
Crudely therefore… matching strings with regular expressions (a la Perl) is a matter of
translating the expression into a machine (table) and passing the table to an interpreter
![Page 39: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/39.jpg)
39 LING 6932 Spring 2007
Recognition as Search
You can view this algorithm as state-space search.States are pairings of tape positions and state numbers.Operators are compiled into the tableGoal state is a pairing with the end of tape position and a final accept state
![Page 40: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/40.jpg)
40 LING 6932 Spring 2007
Generative Formalisms
A formal Language is a model m which can both generate and recognize all and only the strings of a formal language; each string is composed of symbols from a finite set of symbols (alphabet)
L(m) ‘a formal language L characterized by the model m’Finite-state automata define formal languages (without having to enumerate all the strings in the language)The term Generative is based on the view that you can run the machine as a generator to get strings from the language.
![Page 41: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/41.jpg)
41 LING 6932 Spring 2007
Generative Formalisms
FSAs can be viewed from two perspectives:Acceptors that can tell you if a string is in the language (recognition)Generators to produce all and only the strings in the language (production)
![Page 42: LING 6932 Topics in Computational Linguistics](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815a6e550346895dc7ce29/html5/thumbnails/42.jpg)
42 LING 6932 Spring 2007
Summary
Regular expressions are just a compact textual representation of FSAsRecognition is the process of determining if a string/input is in the language defined by some machine.
Recognition is straightforward with deterministic machines.