Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural...
-
Upload
melanie-corbett -
Category
Documents
-
view
218 -
download
0
Transcript of Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural...
Special Topics in Computer ScienceSpecial Topics in Computer Science
Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval
Lecture 8: Lecture 8: Natural Language Processing and IR. Natural Language Processing and IR. Synonymy, Morphology, and Stemming Synonymy, Morphology, and Stemming
Alexander Gelbukh
www.Gelbukh.com
2
Previous Chapter: Previous Chapter: ConclusionsConclusions
Parallel computing can improve response time for each query and/or throughput: number of queries processed with same speed
Document partitioning is simple good for distributed computing
Term partitioning is good for some data structures Distributed computing is MIMD computing with slow
communication SIMD machines are good for Signature files
Both are out of favor now
3
Previous Chapter: Research topicsPrevious Chapter: Research topics
How to evaluate the speedup New algorithms Adaptation of existing algorithms Merging the results is a bottleneck
Meta search engines
Creating large collections with judgements Is recall important?
4
ProblemProblem
Recall image retrieval: Find images similar in color, size, ... Find photos of Korean President ? Find nice girls ? (Don’s show ugly ones!)
Looks very stupid Lacks understanding Too difficult
Text retrieval is no exception Find stories with sad beginning and happy end ? Lacks understanding Difficult but possible
5
Possible?Possible?
Text is intended to facilitate understanding Supposedly, even partial understanding should help Degrees of understanding:
Character strings (what is used now): well, geese, him Words (often used now): goose, he Concepts: hole in the ground (well), Roh Moo-Hyun Complex concepts: oil well, hot dog Situations (sentences, paragraphs) The story (direct meaning) The message (pragmatics, intended impact)
6
Easy?Easy?
Main problems: Multiple ways to say the same
• Query does not match the doc
• Difficult to specify all variants
Ambiguity of the text• False alarms in matching
Lack of implicit knowledge of the computer• The computer “does not understand” the message
• Difficult to make inferences
Natural Language Processing tries to solve them
7
SolutionsSolutions
Multiple ways to say the same? Normalizing: transforming to a “standard” variant
Ambiguity of the text? Ambiguity resolution Normalizing to one of the variants Perhaps the main problem in natural language processing
Lack of implicit knowledge of the computer? Dictionaries, grammars Knowledge on language structure is needed in all tasks Knowledge of world is useful for advanced task Knowledge on language use is a substitute
8
SynonymySynonymy
Multiple ways to say the same Or at least when the difference does not matter Can be substituted in any (many?) context
Lexical synonymy Woman / female, professor / teacher Dictionaries
Phrase-level or sentence-level synonymy They game a book / I was given a book by them Syntactic analyzers
Semantic-level synonymy Reasoning
9
Not only synonymyNot only synonymy
Multiple ways to say the same (synonymy) less: more general (hypernymy) more: more specific (hyponymy)
Complete synonyms are rare professor teacher Abbreviations are usually (almost) complete synonyms
When the differences do not matter, can be treated as synonymy
But: different data structures and methods
10
Lexical-level synonymyLexical-level synonymy
Lexical synonymy Woman / female Mixed-type synonymy: USA / United States
Morphology is a kind of synonymy (actually hyponymy) ‘geese’ = ‘goose’ + ‘many’ Russian ‘knigu’ = ‘kniga’ + ‘dative role’ the “second” part of the meaning is either not important o
r is another term
Morphology is a very common problem in IR
11
Lexical synonymyLexical synonymy
Woman / female Dictionaries
Synonym dictionaries WordNet
Automatic learning of synonymy Clustering of contexts If the contexts are very similar, then possible synonyms Problem: preserves meaning? Monday / Tuesday An interesting solution: compare dictionary definitions
12
Uses in IRUses in IR
Query expansion Add synonyms of the word to the query and process
normally Flexible, slow Best for lexical synonymy: few synonyms, doubtful
Reducing at index time When reading the documents, reduce each word to a
“standard” synonym Fast, rigid Best for morphology: many synonyms, less doubtful
Hierarchical indexing
13
Hierarchical indexingHierarchical indexing
(Gelbukh, Sidorov, Guzman-Arenas 2002) Tree of concepts
Living things• Animals
1. a. Cat, b. cats
2. a. Dog, b. dogs
• Persons3. a. Professor, b. professors
4. a. Student, b. students
Order vocabulary by the order of the leaves of tree Query expansion is done by ranges:
cat: 1, living things: 1-4
14
MorphologyMorphology
One of the large concerns in IR Can be done
precisely approximately (quick-and-dirty)
Level of generalization inflection: student – students derivation: study – student
Ambiguity all variants one variant
15
... morphology... morphology
Result is The unique ID The dictionary form A “stem”: part of the same string
16
Morphological analyzersMorphological analyzers
Precise analysis Ambiguous
Give all variants Tables: to table or the table? Spanish charlas: charla ‘talk’ or charlar ‘to talk’ Russian dush: dush ‘shower’ or dusha ‘soul’ Common in languages with developed morphology For short words, some 3 – 5 – 10 variants
Dictionaries are used
17
Morphological systemMorphological system
Dictionary specifies: Stem: bak-, ask- POS (part of speech): verb Inflection class (what endings it accepts): 1, 2
Tables of endings specify Paradigms:
1. -e -es -ed -ed -ing
2. -, -s -ed -ed -ing
Meanings: participle, ...
18
... morphological system... morphological system
Algorithm Decompose the word into an existing stem and ending Check compatibility of stem and ending Give the stem ID and ending meaning
Ambiguous Many variants of decompositions Many stems with different IDs Many endings with different meaning
• -ed: past or participle
Problem: words absent in dictionary
19
StemmingStemming
Substitute for real analysis Both inflection and derivation
Quick-and-dirty Only one variant Result: a part of the string
• gene, genial gen-
Cheap development bad results simple description. Standard
Often used in academic research Used to be used in real systems, but now less
20
Porter stemmerPorter stemmer
Martin Porter, 1980 Standard stemmer Provides equal basis
for evaluation ofdifferent IR programs
Uses “measure” m: [C](VC){m}[V].
m=0 TR, EE, TREE, Y, BY. m=1 TROUBLE, OATS, TREES, IVY. m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
21
... Porter stemmer... Porter stemmer
Step 1a SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat
22
... Porter stemmer... Porter stemmer
Step 1b (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing
23
... Porter stemmer... Porter stemmer
If 2nd or 3rd rule successful AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter
• hopp(ing) -> hop
• tann(ed) -> tan
• fall(ing) -> fall
• hiss(ing) -> hiss
• fizz(ed) -> fizz
(m=1 and *o) -> E • fail(ing) -> fail
• fil(ing) -> file
24
... Porter stemmer... Porter stemmer
Step 1c (*v*) Y -> I
• happy -> happi
• sky -> sky
25
... Porter stemmer... Porter stemmer
Step 2 (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible
26
... Porter stemmer... Porter stemmer
Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good
27
... Porter stemmer... Porter stemmer
Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler
28
... Porter stemmer... Porter stemmer
Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas
Step 5b (m > 1 and *d and *L) -> single letter
• controll -> control
• roll -> roll
29
Statistical stemmersStatistical stemmers
Take a list of words Construct a model of language that “generates” it
The “best” one The simplest one? How to find?
List of stems, list of endings Determine their probabilities
Usage statistics
Decompose any input string into a stem and an ending Take the most probable variant
30
Research topicsResearch topics
Constructing and application of ontologies Building of morphological dictionaries Treatment of unknown words with morphological
analyzers Development of better stemmers
Statistical stemmers?
31
ConclusionsConclusions
Reducing synonyms can help IR Better matching Ontologies are used. WordNet
Morphology is a variant of synonymy widely used in IR systems
Precise analysis: dictionary-based analyzers Quick-and-dirty analysis: stemmers
Rule-based stemmers. Porter stemmer Statistical stemmers
32
Thank you!Till May 24? 25?, 6
pm