October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular...

26
October 2004 CSA3050 NL Algorithms 1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota

Transcript of October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular...

Page 1: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 1

CSA3050: Natural Language Algorithms

Words, Strings and

Regular Expressions

Finite State Automota

Page 2: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 2

This lecture

• Outline– Words– The language of words– FSAs in Prolog

• Acknowledgement– Jurafsky and Martin, Speech and Language

Processing, Prentice Hall 2000– Blackburn and Steignitz: NLP Techiques in Prolog:

http://www.coli.uni-sb.de/~kris/nlp-with-prolog/html/

Page 3: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 3

What is a Word?

• A series of speech sounds that symbolizes meaning without being divisible into smaller units

• Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark

• A set of linguistic forms produced by combining a single base with various inflectional elements without change in the part-of-speech elements

• A number of bytes processed as a unit.

Page 4: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 4

Information Associated with Words

• Spelling– orthographic– phonological

• Syntax– POS– Valency

• Semantics– Meaning – Relationship to other words

Page 5: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 5

Properties of Words

• Sequence– characters pollution– phonemes

• Delimitation– whitespace– other?

• Structure– simple ("atomic“) words– complex ("molecular") words

Page 6: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 6

Complex Words

• enlargementen + large + ment(en + large) + menten + (large + ment)

• affixation– prefix– suffix– infix

Page 7: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 7

Sets Underly the Formation of Complex Words

disreunen

largechargeinfectcodedecide

edingeeerly

+ +

prefixes roots suffixes

Page 8: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 8

Structure of Complex Words

• Complex words are made by concatenating elements chosen from – a set of prefixes– a set of roots– a set of suffixes

• The set of valid words for a given human language (e.g. English, Maltese) can be regarded as a formal language.

Page 9: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 9

The Language of Words

• What kind of formal language is the language of words?

• One which can be constructed out of– A characteristic set of basic symbols (alphabet)– A characteristic set of combining operations

• Union (disjunction) • Concatenation• Closure (iteration)

• Regular Language; Regular Sets

Page 10: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 10

Characterising Classes of Set

CLASS OFSETS or LANGUAGES

NOTATION MACHINE

Page 11: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 11

Regular Expressions

• Notation for describing regular sets

• Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word)

• Xerox Finite State tools use a somewhat different notation, but similar function.

Page 12: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 12

Regular Expressions

a a simple symbol

A B concatenation

A | B alternation operator

A & B intersection operator

A* Kleene star

Page 13: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 13

Characterising Classes of Set

CLASS OFSETS or LANGUAGES

NOTATION MACHINE

Page 14: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 14

Finite Automaton

• A finite automaton comprises• A finite set of states Q• An alphabet of symbols I• A start state q0 Q• A set of final states F Q• A transition function δ(q,i) which maps a

state q Q and a symbol i I to a new state q' Q

Page 15: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 15

Encoding FSAs in Prolog

• Three predicates– initial/1initial(s) – s is an initial state

– final/1final(f) – f is a final state

– arc/3arc(s,t,c)there is an arc from s to t labelled c

Page 16: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 16

Example 1: FSA

initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,2,h).

1-

2

3

4=

h

ha

!

Page 17: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 17

Example 2: FSA with jump arc

initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,1,#).

1-

2

3

4=

h

#a

!

Page 18: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 18

Example 3: NDA

initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(2,1,a).

1-

2

3

4=

h a

a

!

Page 19: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 19

A Recogniser

recognize1(Node,[ ]) :-    final(Node).

recognize1(Node1,String) :-    arc(Node1,Node2,Label),    traverse1(Label,String,NewString),    recognize1(Node2,NewString).

traverse1(Label,[Label|Symbols],Symbols).

Page 20: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 20

TraceCall: (7) test1([h, a, !]) Call: (8) initial(_L181) Exit: (8) initial(1) Call: (8) recognize1(1, [h, a, !]) Call: (9) arc(1, _L199, _L200) Exit: (9) arc(1, 2, h) Call: (9) traverse1(h, [h, a, !], _L201) Exit: (9) traverse1(h, [h, a, !], [a, !]) Call: (9) recognize1(2, [a, !]) Call: (10) recognize1(3, [!]) Call: (11) recognize1(4, []) Call: (12) final(4) Exit: (12) final(4) Exit: (11) recognize1(4, []) Exit: (10) recognize1(3, [!]) Exit: (9) recognize1(2, [a, !]) Exit: (8) recognize1(1, [h, a, !]) Exit: (7) test1([h, a, !])

Page 21: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 21

Generation

• test1(X)

• X = [h, a, !] ;

• X = [h, a, h, a, !] ;

• X = [h, a, h, a, h, a, !] ;

• X = [h, a, h, a, h, a, h, a, !] ;

• etc.

Page 22: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 22

3 Related Frameworks

REGULARLANGS/SETS

REGULAREXPRESSIONS

FINITE STATENETWORKS

describe recognise

Page 23: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 23

Regular Operations

• Operations– Concatenation– Union– Closure

• Over What– Language– Expressions– FS Automota

Page 24: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 24

Concatenation over Reg. Expression and LanguageRegular Expression

E1: = [a|b]

E2: = [c|d]

E1 E2 =

[a|b] [c|d]

Language

L1 = {"a", "b"}

L2 = {"c", "d"}

L1 L2 =

{"ac", "ad", "bc", "bd"}

Page 25: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 25

Concatenation overFS Automata

a

b

c

d

a

b

c

d

Page 26: October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.

October 2004 CSA3050 NL Algorithms 26

Issues

• Handling jump arcs.

• Handling non-determinism

• Computing operations over networks.

• Maintaining multiple states in DB

• Representation.