Finite State Automata and Tries

29
Finite State Automata and Tries Sambhav Jain IIIT Hyderabad

description

Finite State Automata and Tries. Sambhav Jain IIIT Hyderabad. Think !!!. How to store a dictionary in computer? How to search for an entry in that dictionary? Say you have each word length exactly equal to 10 characters and can take any letter from ‘a-z’ - PowerPoint PPT Presentation

Transcript of Finite State Automata and Tries

Page 1: Finite State Automata and Tries

Finite State Automata and Tries

Sambhav JainIIIT Hyderabad

Page 2: Finite State Automata and Tries

Finite State Automata and Tries 2

Think !!!

• How to store a dictionary in computer?

• How to search for an entry in that dictionary?

– Say you have each word length exactly equal to 10 characters and can take any letter from ‘a-z’

Eg. aaaaaaaaaa, abcdefghij, …. etc Language = [a-z]{10} - RegEx

Page 3: Finite State Automata and Tries

Finite State Automata and Tries 3

A Simple Way

• aaaaaaaaaa• aaaaaaaaab• aaaaaaaaac• ….• ….• ….• ….• zzzzzzzzzz

A Linear Sorted List of Entries

Page 4: Finite State Automata and Tries

Finite State Automata and Tries 4

A Simple Way

• aaaaaaaaaa• aaaaaaaaab• aaaaaaaaac• ….• ….• ….• ….• zzzzzzzzzz

Character to be stored = 2610

= 1.41167096 × 1014

Each character take 1 Byte

~ 141 TB

Page 5: Finite State Automata and Tries

Finite State Automata and Tries 5

Smart Way !

a b c d w x y z

a b c d w x y z

a b c d w x y z

……………………………………………..

……………………………………………..

……………………………………………..

………………………………..……………………………………………………………………………………….

Page 6: Finite State Automata and Tries

Finite State Automata and Tries 6

Smart Way !

a b c d w x y z

a b c d w x y z

a b c d w x y z

……………………………………………..

……………………………………………..

……………………………………………..

………………………………..……………………………………………………………………………………….

•Total Storage = 26x10 = 260 bytes•Traverse 10 nodes

Page 7: Finite State Automata and Tries

Finite State Automata and Tries 7

Does it work for Natural Language

• Oxford Advanced English Learner 20th Edition– A quarter of a million distinct English words,

excluding inflections, and words from technical and regional vocabulary not covered by the OED

• After inflections ? – eat,eats,eaten,eating …..

• What after multiple inflexion ???– beauty, beautiful, beautifully …

Page 8: Finite State Automata and Tries

Finite State Automata and Tries 8

Example (Store & Search)

e

e

a

s

t

n

i n g

Page 9: Finite State Automata and Tries

Finite State Automata and Tries 9

Example

e

e

a

s

t

n

i n g

b

Page 10: Finite State Automata and Tries

Finite State Automata and Tries 10

Example

e

e

a

s

t

n

i n g

b

f

s

a

Page 11: Finite State Automata and Tries

Finite State Automata and Tries 11

Example

e

e

a

s

t

n

i n g

b

f

s

a

tei

r

w

Page 12: Finite State Automata and Tries

Finite State Automata and Tries 12

Inflectional morphology

• Deals with word forms of a root, when there is no change in lexical category.

• Each word form gives different values of features like gender, number, person, etc.

Page 13: Finite State Automata and Tries

Finite State Automata and Tries 13

Paradigm

• For a given root, there are many word forms with different features.

• Ex. Forms of Hindi root laDakA (boy)

Direct Oblique

Singular laDakA laDake

Plural laDake laDakoM

Page 14: Finite State Automata and Tries

Finite State Automata and Tries 14

Paradigm

- 'laDakoM' is plural with oblique case - given by feature structure {num=pl,

case=obl} - 'laDake' stands for two feature structures + Singular oblique (Ex. laDake ne kahA ...) - where oblique means 'laDake' is followed

by a postposition marker + plural direct case (Ex. laDake Aye)

Page 15: Finite State Automata and Tries

Finite State Automata and Tries 15

Paradigmo Paradigms - What operation is done on root to obtain word forms - Model using pairs: (delete string, add string) | direct oblique ---|----------------------- sg | (O,O) (A,e) pl | (A,e) (A,oM) o List roots with paradigms they follow: - ghoDA follows paradigm laDakA - charkhA follows paradigm laDakA - laDakA follows paradigm laDakA•

Page 16: Finite State Automata and Tries

Finite State Automata and Tries 16

l k | | a a | | D p | | -------- a | | | a A D | | | k ------- | | | | ------------ | I i | | | ------- | A e o | | | A | | | | | | A e o M M | M

Page 17: Finite State Automata and Tries

Finite State Automata and Tries 17

Abstracting out suffixes

k l | | a a | | p D | | a --------- | | | D #1 a A | | k (#1) I

#1: Corresponds to paradigm for 'laDakA'

Page 18: Finite State Automata and Tries

Finite State Automata and Tries 18

- Suffix trie (forward)

#1 | -------------- | | | e o A | M

Page 19: Finite State Automata and Tries

Finite State Automata and Tries 19

• Can we further optimize our search ?- Use knowledge of paradigms

- Use suffix tree

Page 20: Finite State Automata and Tries

Finite State Automata and Tries 20

• Store suffix tree in main memory• Store rest of the categorized by paradigm in

hard disk• Do backward search for suffix tree• Identify the paradigm• Search only in that paradigm set• Eg. if ‘–ing’ occur you first won’t be searching

word like home, cat, god …

Page 21: Finite State Automata and Tries

Finite State Automata and Tries 21

Finite State Automata

• Trie is a data structure

• FSA is the computational approach

• Slight difference in representation – Putting characters on edges rather than nodes

Page 22: Finite State Automata and Tries

Finite State Automata and Tries 22

+ / \ l / \ k + + a | | a | | + + D | | p | | + + a | | a | | + + k | | D | | + + \ / 0 \ / 0 +______ e/ \o \ A / \ \ (+) + (+) | |M (+)

Page 23: Finite State Automata and Tries

Finite State Automata and Tries 23

FSAo A deterministic finite-state machine formally is - Q: A finite set of states (Ex.:{q0,q1,q2}) - SIGMA: A finite set of input alphabet (Ex.: {a,b,c}) - Start state: A state in Q, from which machine starts (Ex.: q0) - F: A set of accepting states (Ex.: {q2}) - DELTA (q,i): A transition function or transition matrix where: - q MEMBER Q, i MEMBER SIGMA, - DELTA(q,i) MEMBER Q

Thus, DELTA(q,i): Q x SIGMA --> Q

Page 24: Finite State Automata and Tries

Finite State Automata and Tries 24

RECOGNITION Problem

• Till now we were handling only RECOGNITION problem

• If FSA reach a final state at the end of input string then EXIST

• Else NOT

Page 25: Finite State Automata and Tries

Finite State Automata and Tries 25

• But we seek analyzed output• We want the machine to tell– Root– Gender– Number– Person– Case– Etc ……

Page 26: Finite State Automata and Tries

Finite State Automata and Tries 26

Finite State TransducerFST is like the finite state automation defined earlier, except each arc is labelled by a pair of symbols: i:o where i: symbol in input string o: symbol output by FST when are is taken

+ Ex. arc in finite state transducer corresponding to 'e' in 'ladake'

e : ((+pl, -direct), (+sg, +dir)) q1 +----------------->--------------------+ q2

Two pairs of symbols: i : o - i is: 'e' - o is: '((+pl, -direct), (+sg, +dir))'

+ Ex. Morph Analyzer: Match input with i, if successful go ahead & produce o in output

Page 27: Finite State Automata and Tries

Finite State Automata and Tries 27

o Formally: Finite state transducer - Q: Finite set of states q0, ..., qN - SIGMA_IN: Finite set of input symbols - SIGMA_OUT: Finite set of pairs output symbols - q0: Start state (q0 IN Q) - F: Set of final accepting states (F SUBSET Q) - DELTA (q, i:o) : For every state q, gives a set of states that can be reached from q with i in SIGMA_IN, and o in SIGMA_OUT.

Page 28: Finite State Automata and Tries

Finite State Automata and Tries 28

Example

• on board

Page 29: Finite State Automata and Tries

Finite State Automata and Tries 29

Tools for FSA

• Lex• OpenFST– (www.openfst.org/)

• AT&T FSM Toolkit – (http://www2.research.att.com/~fsmtools/fsm/)