Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural...

Special Topics in Computer ScienceSpecial Topics in Computer Science

Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval

Lecture 9: Lecture 9: Natural Language Processing and IR. Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution Tagging, WSD, and Anaphora Resolution

Alexander Gelbukh

www.Gelbukh.com

2

Previous Chapter: Previous Chapter: ConclusionsConclusions

Reducing synonyms can help IR Better matching Ontologies are used. WordNet

Morphology is a variant of synonymy widely used in IR systems

Precise analysis: dictionary-based analyzers Quick-and-dirty analysis: stemmers

Rule-based stemmers. Porter stemmer Statistical stemmers

3

Previous Chapter: Research topicsPrevious Chapter: Research topics

Constructing and application of ontologies Building of morphological dictionaries Treatment of unknown words with morphological

analyzers Development of better stemmers

Statistical stemmers?

4

ContentsContents

Tagging: for each word, determine its POS (Part of Speech: noun, ...) and grammatical characteristics

WSD (Word Sense Disambiguation):for each word, determine which homonym is used

Anaphora resolution:For a pronoun (it, ...), determine what it refers to

5

Tagging: The problemTagging: The problem

Ambiguity of parts of speech rice flies like sand = insects living in rice consider send good? = rice can fly similarly to sand? ... insect of a container with rice...? We can fly like sand ... We think fly like sand...

Ambiguity of grammatical characteristics He have read the book He will read the book... He read the book

Very frequent phenomenon, nearly at each word!

6

Tagger...Tagger...

A program that looking at the context and decides what the part of speech (and other characteristics) are

Input: He will read the book

Morphological analysis He<...> will<Ns | Va> read<Vpa | Vpp | Vinf> the<...>

? ? ? ? ?

Ns = noun singular, Tags: TaggerVa = verb auxiliary,

Vpa = verb past

Vpp = verb past participle, Vinf = verb infinitive, ...

7

...Tagger...Tagger

Input of tagger He<...> will<Ns | Va> read<Vpa | Vpp | Vinf> the<...>

Task: Choose one! Output:

He<...> will<Va> read<Vinf> the<...>

How we do it? He will<N> not possible Va will<Va> read Vinf This is simple, but imagine He is ambiguous... Explosion

8

ApplicationsApplications

Used for word sense disambiguation: Oil well in Mexico is used. Oil is used well in Mexico. For stemming and lemmatization Important for matching in information retrieval

Greatly speed ups syntactic analysis Tagging is local No need to process the whole sentence to find that a

certain tag is incorrect

9

How: Parsing?How: Parsing?

We can find all the syntactic structures Only the correct variants will enter the syntactic struc

ture will + Vinf form a syntactic unit will + Vpa do not

Problems Computationally expensive What to do with ambiguities?

• fly rice like sand

• Depends on what you need

10

Statistical taggerStatistical tagger

Example: TnT tagger Based on Hidden Markov Model (HMM) Idea:

Some words are more probable after some other words Find these probabilities Guess the word if you know the nearby ones

Problem: Letter strings denote meanings “x is more probable after y” are meanings, not strings so guess what you cannot see: meanings

11

Hidden Markov Model: IdeaHidden Markov Model: Idea

A system changes its state What a person thinks Random... but not completely (how?)

In each state, it emits an output What he says when he thinks something Random... but somehow (?) depends on what he thinks

We know the sequence of produced outputs Text: we can see it!

Guess what were the underlying states Hidden: we cannot see them

12

Hidden Markov Model: HypothesesHidden Markov Model: Hypotheses

A finite set of states: q1 ... qN (invisible) POS and grammatical characteristics (language)

A finite set of observations: v1 ... vM

Strings we see in the corpus (language)

A random sequence of states xi

POS in the

Probabilities of state transitions P(xi+1| xi) Language rules and use

Probabilities of observations P(vk| xi) words expressing the meanings: Vinf: ask, V3: asks

13

Hidden Markov Model: ProblemHidden Markov Model: Problem

Same observation corresponds to different meaning Vinf: read, Vpp: read

Looking at what we can see, guess what we cannot This is why hidden

Given a sequence of observations oi

The text: sequence of letter strings. Training set

Guess the sequence of states xi

The POS of each word

Our hypotheses on xi depend on each other Highly combinatorial task

14

Hidden Markov Model: SolutionsHidden Markov Model: Solutions

Need to find the parameters of the model: P(xi+1| xi)

P(vk| xi)

Optimal way! To maximize the probability of generation this specific output

Optimization methods from Operation Research are used More details? Not so simple...

15

Brill Tagger (rule-based)Brill Tagger (rule-based)

Erik Brill Makes an initial assumption about

POS tags in the text Uses context-dependent rewriting

rules to correct some tags Applies them iteratively Learns the rules from a training corpus The rules are in human-understandable form

You can correct them manually to improve the tagger Unlike HMM which are not understandable

16

Word Sense DisambiguationWord Sense Disambiguation

Query: international bank in Seoul Bank: 한 원

financial institution Korean $ river shore superior official place to store something 한상용 ... ... ... ...

Hotel located at the beautiful bank of Han river. Relevant for the query?

POS is the same. Tagger will not distinguish them

17


Translation 대원군 Great Governor of the Court 만원 10 thousand won international bank banco internacional river bank orilla del río

Information retrieval Document retrieval: is really useful? Same info Passage retrieval: can prove very useful!

Semantic analysis

18

Representation of word sensesRepresentation of word senses

1. Explanations. Semantic dictionaries Bank1 is an institution to keep money

Bank2 is a sloppy edge of a river

2. Synsets and ontology: WordNew (HanNet: Chinese) Synonyms: {bank, shore}

• WordNet terminology: synset #12345

• Corresponds to all ways to call a concept

Relationships: #12345 IS_PART_OF #67890 {river, stream} #987 IS_A #654 {institution, organization}

WordNet has also glosses

19

TaskTask

Given a text (probably POS-tagged) Tag each word with its synset number #123 or

dictionary number bank1

Input: Mary keeps the money in a bank. Han river’s bank is beautiful.

Output Mary keeps<1> the money<1> in a bank<1> Han river’s bank<2> is beautiful.

20

Lesk algorithmLesk algorithm

Michael Lesk Explanatory dictionary

Bank1 is an institution to keep money

Bank2 is a sloppy edge of a river

Mary keeps her money (savings) in a bank. Choose the sense which has more words in common

with immediate context Improvements (Pedersen, Gelbukh & Sidorov)

Use synonyms when no direct matches Use synonyms of synonyms, ...

21

Other word relatedness measuresOther word relatedness measures

Lexical chains in WordNet The length of the path in the graph of relationships

Mutual information: frequent co-occurrences Collocations (Bolshakov & Gelbukh)

Keep in bank1

Bank2 of river

Very large dictionary of such combinations

Number of words in common between explanations Recursive: common words or related words

(Gelbukh & Sidorov)

22

Other methodsOther methods

Hidden Markov Models Logical reasoning

23

Yarowsky’s PrinciplesYarowsky’s Principles

David Yarowsky One sense per text! One sense per collocation I keep my money in the bank1. This

is an international bank1 with a great capital. The ba

nk2 is located near Han river.

3 words vote for ‘institution’, one for ‘shore’ Institution!

bank1 is located near Han river.

24

Anaphora resolutionAnaphora resolution

Mainly pronouns. Also co-reference: when two words refer to the

same? John took cake from the table and ate it. John took cake from the table and washed it. Translation into Spanish: la ‘she’ table / lo ‘he’ cake Methods:

Dictionaries Different sources of evidence Logical reasoning

25


Translation Information retrieval:

Can improve frequency counts (?) Passage retrieval: can be very important

26

Mitkov’s knowledge poor methodMitkov’s knowledge poor method

Ruslan Mitkov Rule-based and statistical-based approach Uses simple information on POS and general word cl

asses Combines different sources of evidence

27

Hidden AnaphoraHidden Anaphora

John bought a house. The kitchen is big. = that house’s kitchen

John was eating. The food was delicious. = “that eating” ’s food

John was buried. The widow was mad with grief. “that burying” ’s death’s widow

Intersection of scenarios of the concepts(Gelbukh & Sidorov) house has a kitchen burying results from death & widow results from death

28

EvaluationEvaluation

Senseval and TREC international competitions Korean track available

Human annotated corpus Very expensive Inter-annotator agreement is often low! A program cannot do what humans cannot do Apply the program and compare with the corpus Accuracy

Sometimes the program cannot tag a word Precision, recall

29

Research topicsResearch topics

Too many to list New methods Lexical resources (dictionaries) = Computational linguistics

30

ConclusionsConclusions

Tagging, word sense disambiguation, andanaphora resolution are cases of disambiguation ofmeaning

Useful in translation, information retrieval, and textundertanding

Dictionary-based methods good but expensive

Statistical methods cheap and sometimes imperfect... but not always (if very

large corpora are available)

31

Thank you!Till May 31? June 1?

6 pm

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural...

Documents

Transcript of Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural...