Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural...
-
Upload
hayden-romero -
Category
Documents
-
view
219 -
download
3
Transcript of Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural...
Special Topics in Computer ScienceSpecial Topics in Computer Science
Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval
Lecture 9: Lecture 9: Natural Language Processing and IR. Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution Tagging, WSD, and Anaphora Resolution
Alexander Gelbukh
www.Gelbukh.com
2
Previous Chapter: Previous Chapter: ConclusionsConclusions
Reducing synonyms can help IR Better matching Ontologies are used. WordNet
Morphology is a variant of synonymy widely used in IR systems
Precise analysis: dictionary-based analyzers Quick-and-dirty analysis: stemmers
Rule-based stemmers. Porter stemmer Statistical stemmers
3
Previous Chapter: Research topicsPrevious Chapter: Research topics
Constructing and application of ontologies Building of morphological dictionaries Treatment of unknown words with morphological
analyzers Development of better stemmers
Statistical stemmers?
4
ContentsContents
Tagging: for each word, determine its POS (Part of Speech: noun, ...) and grammatical characteristics
WSD (Word Sense Disambiguation):for each word, determine which homonym is used
Anaphora resolution:For a pronoun (it, ...), determine what it refers to
5
Tagging: The problemTagging: The problem
Ambiguity of parts of speech rice flies like sand = insects living in rice consider send good? = rice can fly similarly to sand? ... insect of a container with rice...? We can fly like sand ... We think fly like sand...
Ambiguity of grammatical characteristics He have read the book He will read the book... He read the book
Very frequent phenomenon, nearly at each word!
6
Tagger...Tagger...
A program that looking at the context and decides what the part of speech (and other characteristics) are
Input: He will read the book
Morphological analysis He<...> will<Ns | Va> read<Vpa | Vpp | Vinf> the<...>
? ? ? ? ?
Ns = noun singular, Tags: TaggerVa = verb auxiliary,
Vpa = verb past
Vpp = verb past participle, Vinf = verb infinitive, ...
7
...Tagger...Tagger
Input of tagger He<...> will<Ns | Va> read<Vpa | Vpp | Vinf> the<...>
Task: Choose one! Output:
He<...> will<Va> read<Vinf> the<...>
How we do it? He will<N> not possible Va will<Va> read Vinf This is simple, but imagine He is ambiguous... Explosion
8
ApplicationsApplications
Used for word sense disambiguation: Oil well in Mexico is used. Oil is used well in Mexico. For stemming and lemmatization Important for matching in information retrieval
Greatly speed ups syntactic analysis Tagging is local No need to process the whole sentence to find that a
certain tag is incorrect
9
How: Parsing?How: Parsing?
We can find all the syntactic structures Only the correct variants will enter the syntactic struc
ture will + Vinf form a syntactic unit will + Vpa do not
Problems Computationally expensive What to do with ambiguities?
• fly rice like sand
• Depends on what you need
10
Statistical taggerStatistical tagger
Example: TnT tagger Based on Hidden Markov Model (HMM) Idea:
Some words are more probable after some other words Find these probabilities Guess the word if you know the nearby ones
Problem: Letter strings denote meanings “x is more probable after y” are meanings, not strings so guess what you cannot see: meanings
11
Hidden Markov Model: IdeaHidden Markov Model: Idea
A system changes its state What a person thinks Random... but not completely (how?)
In each state, it emits an output What he says when he thinks something Random... but somehow (?) depends on what he thinks
We know the sequence of produced outputs Text: we can see it!
Guess what were the underlying states Hidden: we cannot see them
12
Hidden Markov Model: HypothesesHidden Markov Model: Hypotheses
A finite set of states: q1 ... qN (invisible) POS and grammatical characteristics (language)
A finite set of observations: v1 ... vM
Strings we see in the corpus (language)
A random sequence of states xi
POS in the
Probabilities of state transitions P(xi+1| xi) Language rules and use
Probabilities of observations P(vk| xi) words expressing the meanings: Vinf: ask, V3: asks
13
Hidden Markov Model: ProblemHidden Markov Model: Problem
Same observation corresponds to different meaning Vinf: read, Vpp: read
Looking at what we can see, guess what we cannot This is why hidden
Given a sequence of observations oi
The text: sequence of letter strings. Training set
Guess the sequence of states xi
The POS of each word
Our hypotheses on xi depend on each other Highly combinatorial task
14
Hidden Markov Model: SolutionsHidden Markov Model: Solutions
Need to find the parameters of the model: P(xi+1| xi)
P(vk| xi)
Optimal way! To maximize the probability of generation this specific output
Optimization methods from Operation Research are used More details? Not so simple...
15
Brill Tagger (rule-based)Brill Tagger (rule-based)
Erik Brill Makes an initial assumption about
POS tags in the text Uses context-dependent rewriting
rules to correct some tags Applies them iteratively Learns the rules from a training corpus The rules are in human-understandable form
You can correct them manually to improve the tagger Unlike HMM which are not understandable
16
Word Sense DisambiguationWord Sense Disambiguation
Query: international bank in Seoul Bank: 한 원
financial institution Korean $ river shore superior official place to store something 한상용 ... ... ... ...
Hotel located at the beautiful bank of Han river. Relevant for the query?
POS is the same. Tagger will not distinguish them
17
ApplicationsApplications
Translation 대원군 Great Governor of the Court 만원 10 thousand won international bank banco internacional river bank orilla del río
Information retrieval Document retrieval: is really useful? Same info Passage retrieval: can prove very useful!
Semantic analysis
18
Representation of word sensesRepresentation of word senses
1. Explanations. Semantic dictionaries Bank1 is an institution to keep money
Bank2 is a sloppy edge of a river
2. Synsets and ontology: WordNew (HanNet: Chinese) Synonyms: {bank, shore}
• WordNet terminology: synset #12345
• Corresponds to all ways to call a concept
Relationships: #12345 IS_PART_OF #67890 {river, stream} #987 IS_A #654 {institution, organization}
WordNet has also glosses
19
TaskTask
Given a text (probably POS-tagged) Tag each word with its synset number #123 or
dictionary number bank1
Input: Mary keeps the money in a bank. Han river’s bank is beautiful.
Output Mary keeps<1> the money<1> in a bank<1> Han river’s bank<2> is beautiful.
20
Lesk algorithmLesk algorithm
Michael Lesk Explanatory dictionary
Bank1 is an institution to keep money
Bank2 is a sloppy edge of a river
Mary keeps her money (savings) in a bank. Choose the sense which has more words in common
with immediate context Improvements (Pedersen, Gelbukh & Sidorov)
Use synonyms when no direct matches Use synonyms of synonyms, ...
21
Other word relatedness measuresOther word relatedness measures
Lexical chains in WordNet The length of the path in the graph of relationships
Mutual information: frequent co-occurrences Collocations (Bolshakov & Gelbukh)
Keep in bank1
Bank2 of river
Very large dictionary of such combinations
Number of words in common between explanations Recursive: common words or related words
(Gelbukh & Sidorov)
22
Other methodsOther methods
Hidden Markov Models Logical reasoning
23
Yarowsky’s PrinciplesYarowsky’s Principles
David Yarowsky One sense per text! One sense per collocation I keep my money in the bank1. This
is an international bank1 with a great capital. The ba
nk2 is located near Han river.
3 words vote for ‘institution’, one for ‘shore’ Institution!
bank1 is located near Han river.
24
Anaphora resolutionAnaphora resolution
Mainly pronouns. Also co-reference: when two words refer to the
same? John took cake from the table and ate it. John took cake from the table and washed it. Translation into Spanish: la ‘she’ table / lo ‘he’ cake Methods:
Dictionaries Different sources of evidence Logical reasoning
25
ApplicationsApplications
Translation Information retrieval:
Can improve frequency counts (?) Passage retrieval: can be very important
26
Mitkov’s knowledge poor methodMitkov’s knowledge poor method
Ruslan Mitkov Rule-based and statistical-based approach Uses simple information on POS and general word cl
asses Combines different sources of evidence
27
Hidden AnaphoraHidden Anaphora
John bought a house. The kitchen is big. = that house’s kitchen
John was eating. The food was delicious. = “that eating” ’s food
John was buried. The widow was mad with grief. “that burying” ’s death’s widow
Intersection of scenarios of the concepts(Gelbukh & Sidorov) house has a kitchen burying results from death & widow results from death
28
EvaluationEvaluation
Senseval and TREC international competitions Korean track available
Human annotated corpus Very expensive Inter-annotator agreement is often low! A program cannot do what humans cannot do Apply the program and compare with the corpus Accuracy
Sometimes the program cannot tag a word Precision, recall
29
Research topicsResearch topics
Too many to list New methods Lexical resources (dictionaries) = Computational linguistics
30
ConclusionsConclusions
Tagging, word sense disambiguation, andanaphora resolution are cases of disambiguation ofmeaning
Useful in translation, information retrieval, and textundertanding
Dictionary-based methods good but expensive
Statistical methods cheap and sometimes imperfect... but not always (if very
large corpora are available)
31
Thank you!Till May 31? June 1?
6 pm