Alexander Gelbukh Gelbukh

41
Special Topics in Computer Science Special Topics in Computer Science Advanced Topics in Information Advanced Topics in Information Retrieval Retrieval Lecture 10: Lecture 10: Natural Language Processing Natural Language Processing and IR. and IR. Syntax and structural Syntax and structural disambiguation disambiguation Alexander Gelbukh www.Gelbukh.com

description

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation. Alexander Gelbukh www.Gelbukh.com. Previous Chapter: Conclusions. - PowerPoint PPT Presentation

Transcript of Alexander Gelbukh Gelbukh

Page 1: Alexander Gelbukh Gelbukh

Special Topics in Computer ScienceSpecial Topics in Computer Science

Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval

Lecture 10: Lecture 10: Natural Language Processing and IR. Natural Language Processing and IR.

Syntax and structural disambiguation Syntax and structural disambiguation Alexander Gelbukh

www.Gelbukh.com

Page 2: Alexander Gelbukh Gelbukh

2

Previous Chapter: Previous Chapter: ConclusionsConclusions

Tagging, word sense disambiguation, andanaphora resolution are cases of disambiguation ofmeaning

Useful in translation, information retrieval, and textundertanding

Dictionary-based methods good but expensive

Statistical methods cheap and sometimes imperfect... but not always (if very

large corpora are available)

Page 3: Alexander Gelbukh Gelbukh

3

Previous Chapter: Research topicsPrevious Chapter: Research topics

Too many to list New methods Lexical resources (dictionaries) = Computational linguistics

Page 4: Alexander Gelbukh Gelbukh

4

ContentsContents

Language levels Syntax

Dependency approach Constituency-based approach Head-driven approach

Grammars and parsing Ambiguity and disambiguation

Page 5: Alexander Gelbukh Gelbukh

5

Language levelsLanguage levels

Letters are built up into words Words into sentences Sentences into <...> text

Each level has its own representation This allows for modular processing

A module describes one levelor transforms from one level to another

Page 6: Alexander Gelbukh Gelbukh

6

Source of language complexity: 1-DSource of language complexity: 1-D

This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the

Language

Text (speech)

Meaning Meaning

........Text Text.......

Bra

in 1

Brain 2

Page 7: Alexander Gelbukh Gelbukh

7

Knowledge Knowledge

Lan-guage

Lan-guage

This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture.

Text

Source of language complexity: 1-DSource of language complexity: 1-D

Page 8: Alexander Gelbukh Gelbukh

8

Linguistic processorLinguistic processortranslates between representationstranslates between representations

Linguisticmodule

Meanings

This is an example of the output text ofthe system. This is an example of theoutput text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is an

Texts

Linguisticmodule

Appliedsystem

Page 9: Alexander Gelbukh Gelbukh

9

General scheme of text General scheme of text processingprocessing

L inguistic processor

Applied system

(e.g., Expert system)

Out-put

In-put

(Semantic) representation

Linguistic processor uses linguistic knowledge Applied system uses other types of knowledge

(e.g., Artificial Intelligence)

Page 10: Alexander Gelbukh Gelbukh

10

Language levelsLanguage levels

Morphological: words Syntactic: sentences Semantic: meaning Pragmatic: intention ...?

Page 11: Alexander Gelbukh Gelbukh

11

This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture.

LanguageText Meaning

Morphologicalrepresentation

Syntacticrepresentation

Morpho-logicaltrans-former

Syntac-tic

trans-former

Seman-tic

trans-former

Semanitcrepresentation

Surfacerepresentation

Fine structure of linguistic processor

Page 12: Alexander Gelbukh Gelbukh

12

Example of textExample of text

““Science is important for Science is important for our country.our country.

The Government pays it The Government pays it much attention.”much attention.”

Page 13: Alexander Gelbukh Gelbukh

13

Textual representationTextual representation

Text is a sequence of letter.

S c i e n c e i s S c i e n c e i s i m p o r t a n t i m p o r t a n t f o r o u r c f o r o u r c o u n t r y . T h e o u n t r y . T h e G o v e r n m e n G o v e r n m e n t p a y s i t t p a y s i t m u c h a t t e n m u c h a t t e n t i o n t i o n ..

Page 14: Alexander Gelbukh Gelbukh

14

Linguistic processor

Morpho-logical

analyzer

Semantic analyzer

Syntactic parser

Morphologicalanalysis

Morfological analysisMorfological analysis

Page 15: Alexander Gelbukh Gelbukh

15

Morphological Morphological representationrepresentation

A sequence of words.The THE article definite, plural/singular

science SCIENCE noun singular

is BE verb present, 3rd person, sing.

important IMPORTANT adjective

for FOR preposition

our WE pronoun possessive

country COUNTRY noun singular

Page 16: Alexander Gelbukh Gelbukh

16

Linguistic processor

Morpho-logical

analyzer

Semantic analyzer

Syntactic parser

Syntacticparsing

Syntactic parsingSyntactic parsing

Page 17: Alexander Gelbukh Gelbukh

17

Syntactic representation Syntactic representation

A sequence of syntactic trees.

BE

SCIENCE IMPORTANT

COUNTRY

WE

of

PAY

GOVERNMENT ATTENTION IT

MUCH

Page 18: Alexander Gelbukh Gelbukh

18

Syntactic representationSyntactic representation

What happened?

With whom happened?

... their details

PAY

GOVERNMENT ATTENTION IT

MUCH

Page 19: Alexander Gelbukh Gelbukh

19

Linguistic processor

Morpho-logical

analyzer

Semantic analyzer

Syntactic parser

Semanticanalysis

Semantic analysisSemantic analysis

Next lecture...Next lecture...

Page 20: Alexander Gelbukh Gelbukh

20

SyntaxSyntax

The structure describing the relationships between words in a sentence

Describes the relationships implied by grammatical characteristics not by meaning

Often allows for simple paraphrasing John reads the book The book is read by John

Page 21: Alexander Gelbukh Gelbukh

21

Early approach: Dependency syntaxEarly approach: Dependency syntax

Tree Nodes: words Arcs: modified by

Modifies means adds details,clarifies, chooses of many...makes more specific

Arcs are typed Types are: subject, object, attribute, ...

PAY

GOVERNMENT ATTENTION IT

MUCH

Subject

Obje

ct

Recipient

Att

ribute

Page 22: Alexander Gelbukh Gelbukh

22

... Dependency syntax... Dependency syntax

General situation: pay More specifically: the one

where: who pays is government what is paid is attention to whom it is paid is it

More specifically: attention that is much

PAY

GOVERNMENT ATTENTION IT

MUCH

Subject

Obje

ct

Recipient

Att

ribute

Page 23: Alexander Gelbukh Gelbukh

23

Advantages/disadvantages of Advantages/disadvantages of Dependency SyntaxDependency Syntax

Advantages Solid linguistic base Rather direct translation into semantics Easily applicable to languages with free word order

Korean? Russian, Latin This is why solid linguistic base: good for classical

languages!

Disadvantages No nice mathematical base No simple algorithms

Page 24: Alexander Gelbukh Gelbukh

24

Most popular approach: Constituency Most popular approach: Constituency (Phrase Structure grammars)(Phrase Structure grammars)

Tree Nodes: nested segments of the phrase

Cannot intersect, only nested Usually are labeled with part-of-speech names

Arcs: nesting In classical approach, arcs are not labeled

[[Our Government ] [pays [ much attention] [to it ] ] ]

Page 25: Alexander Gelbukh Gelbukh

25

ConstituencyConstituency

[[Our Government ] [pays [ much attention] [to it ] ] ]Our Government

pays

much attention

to it

Page 26: Alexander Gelbukh Gelbukh

26

ConstituencyConstituency

[[OurR GovernmentN ]NP

[paysV [ muchA attentionN]NP [toP itR ]PP ] VP]S

R: pronoun NP: noun phraseN: noun VP: verb phraseV: verb PP: prepositional phraseA: adjective S: sentence

Page 27: Alexander Gelbukh Gelbukh

27

Constituency: graphical representationConstituency: graphical representation

[[Our Government ]NP [pays [ much attention]NP [to it ]PP ] VP]S

S VP

NP NP PP

NP VP NP NP

R N V A N P R

Our Government pays much attention to it

Page 28: Alexander Gelbukh Gelbukh

28

Phrase structure grammarPhrase structure grammar

Enumerates possible configurations at nodes Usually recursive

S NP VP

NP A NP

NP R NP

NP P NP

NP N

VP VP NP PP

VP V

S VP

NP NP PP

NP VP NP NP

R N V A N P R

Our Government pays much attention to it

Page 29: Alexander Gelbukh Gelbukh

29

Context-independency hypothesisContext-independency hypothesis

A configuration is possible or not,regardless of where it is used Wherever you find VP NP PP, it can be VP Wherever you find NP VP, it can be S If you can put together S that covers all the sentence,

it is a grammatically correct description With this, given a suitable grammar, you can

List all sentences of a language List only correct sentences of that language

List all and only correct structures Correctness means a native speaker’s intuition

Page 30: Alexander Gelbukh Gelbukh

30

Generative ideaGenerative idea

Find a grammar to list all and only correct sentences (with their structures) of a language

This is a complete description of that language!

How can be useful in analysis? Reverse the grammar

Page 31: Alexander Gelbukh Gelbukh

31

ParsingParsing

Given a grammar and a sentence Find all possible structures That describe this sentence with this grammar

Many methods. Not discussed today.A lot of research. Very fast algorithms

Complexity: cubic in the number of words in the sentence (there are better methods, up to 2.8)

Problem: combinatorics of variants

Page 32: Alexander Gelbukh Gelbukh

32

Advantages and disadvantages of cAdvantages and disadvantages of consitituency approachonsitituency approach

Advantages Nice mathematics, very well understood Efficient analysis algorithms, very well-elaborated Good for languages with fixed word order

English. Chinese?

Disadvantages Difficult translation into semantics Bad when it comes to freer word order

Even in English! Worse in other languages

Page 33: Alexander Gelbukh Gelbukh

33

Head-driven approachesHead-driven approaches

Combine some advantages of dependency-based and constituency-based approaches

Syntax is still fixed-order. But word dependency information is added Easier translation into semantics More linguistically-based

How? In each constituent, the main word (head) is marked It modifies the head of the larger constituent

[[Our Government ] [pays [ much attention] [to it ] ] ]

Page 34: Alexander Gelbukh Gelbukh

34

Syntactic ambiguitySyntactic ambiguity

I see a cat with a telescope I see [a cat] [with a telescope]

I use a telescope to see a cat

I see [a cat [with a telescope]] I see a cat that has a telescope

Nearly any preposition causes ambiguity Dozens, thousands, millions of variants for a sentence!

Because their numbers multiply I see a cat with a telescope in a garden at the shore of a river

Page 35: Alexander Gelbukh Gelbukh

35

Ambiguity resolutionAmbiguity resolution

Syntactic means are not enough Is telescope more related to see or to cat?

Statistical methods: is it used with see or cat? Dictionary-based methods: does it share more meaning

with see or cat?• Path length in a dictionary of semantic relationships

Ideally, context should be analyzed, and reasoning applied: I see a cat with a telescope. It keeps the telescope in its

left paw. Now no good methods for this.

Page 36: Alexander Gelbukh Gelbukh

36

Shallow parsingShallow parsing

Due to the HUGE problems in resolving ambiguity Do not resolve it! Do what you can de wellI see [a cat] [with a telescope] [in a garden] [at the shore] [of a river]

Better than nothing Can be done well

Page 37: Alexander Gelbukh Gelbukh

37

EvaluationEvaluation

PARSEVAL international contents A practical parser usually gives only one variant

Implies disambiguation!

Manually built corpora (treebanks) Compare what the program did with what humans di

d

Page 38: Alexander Gelbukh Gelbukh

38

One of the uses in IR:One of the uses in IR:Lexical ambiguity resolutionLexical ambiguity resolution

Syntactic analysis helps in POS disambiguation: Oil is used well in Mexico. Oil well is used in Mexico. Well = ?

But does not help in WSD: I deposited my money in an international bank. I live on a beautiful bank of Han river.

Page 39: Alexander Gelbukh Gelbukh

39

Research topicsResearch topics

Faster algorithms E.g. parallel

Handling linguistic phenomena not handled bycurrent approaches

Ambiguity resolution! Statistical methods A lot can be done

Page 40: Alexander Gelbukh Gelbukh

40

ConclusionsConclusions

Syntax structure is one of intermediate representationsof a text for its processing

Helps text understanding Thus reasoning, question answering, ...

Directly helps POS tagging Resolves lexical ambiguity of part of speech But not WSD-type ambiguities

A big science in itself, with 50 (2000?) years of history

Page 41: Alexander Gelbukh Gelbukh

41

Thank you!Till June 8? 6 pm

Semantics