Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural...

32
Special Topics in Computer Science Special Topics in Computer Science Advanced Topics in Information Advanced Topics in Information Retrieval Retrieval Lecture 8: Lecture 8: Natural Language Processing Natural Language Processing and IR. and IR. Synonymy, Morphology, and Synonymy, Morphology, and Stemming Stemming Alexander Gelbukh www.Gelbukh.com

Transcript of Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural...

Page 1: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

Special Topics in Computer ScienceSpecial Topics in Computer Science

Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval

Lecture 8: Lecture 8: Natural Language Processing and IR. Natural Language Processing and IR. Synonymy, Morphology, and Stemming Synonymy, Morphology, and Stemming

Alexander Gelbukh

www.Gelbukh.com

Page 2: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

2

Previous Chapter: Previous Chapter: ConclusionsConclusions

Parallel computing can improve response time for each query and/or throughput: number of queries processed with same speed

Document partitioning is simple good for distributed computing

Term partitioning is good for some data structures Distributed computing is MIMD computing with slow

communication SIMD machines are good for Signature files

Both are out of favor now

Page 3: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

3

Previous Chapter: Research topicsPrevious Chapter: Research topics

How to evaluate the speedup New algorithms Adaptation of existing algorithms Merging the results is a bottleneck

Meta search engines

Creating large collections with judgements Is recall important?

Page 4: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

4

ProblemProblem

Recall image retrieval: Find images similar in color, size, ... Find photos of Korean President ? Find nice girls ? (Don’s show ugly ones!)

Looks very stupid Lacks understanding Too difficult

Text retrieval is no exception Find stories with sad beginning and happy end ? Lacks understanding Difficult but possible

Page 5: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

5

Possible?Possible?

Text is intended to facilitate understanding Supposedly, even partial understanding should help Degrees of understanding:

Character strings (what is used now): well, geese, him Words (often used now): goose, he Concepts: hole in the ground (well), Roh Moo-Hyun Complex concepts: oil well, hot dog Situations (sentences, paragraphs) The story (direct meaning) The message (pragmatics, intended impact)

Page 6: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

6

Easy?Easy?

Main problems: Multiple ways to say the same

• Query does not match the doc

• Difficult to specify all variants

Ambiguity of the text• False alarms in matching

Lack of implicit knowledge of the computer• The computer “does not understand” the message

• Difficult to make inferences

Natural Language Processing tries to solve them

Page 7: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

7

SolutionsSolutions

Multiple ways to say the same? Normalizing: transforming to a “standard” variant

Ambiguity of the text? Ambiguity resolution Normalizing to one of the variants Perhaps the main problem in natural language processing

Lack of implicit knowledge of the computer? Dictionaries, grammars Knowledge on language structure is needed in all tasks Knowledge of world is useful for advanced task Knowledge on language use is a substitute

Page 8: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

8

SynonymySynonymy

Multiple ways to say the same Or at least when the difference does not matter Can be substituted in any (many?) context

Lexical synonymy Woman / female, professor / teacher Dictionaries

Phrase-level or sentence-level synonymy They game a book / I was given a book by them Syntactic analyzers

Semantic-level synonymy Reasoning

Page 9: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

9

Not only synonymyNot only synonymy

Multiple ways to say the same (synonymy) less: more general (hypernymy) more: more specific (hyponymy)

Complete synonyms are rare professor teacher Abbreviations are usually (almost) complete synonyms

When the differences do not matter, can be treated as synonymy

But: different data structures and methods

Page 10: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

10

Lexical-level synonymyLexical-level synonymy

Lexical synonymy Woman / female Mixed-type synonymy: USA / United States

Morphology is a kind of synonymy (actually hyponymy) ‘geese’ = ‘goose’ + ‘many’ Russian ‘knigu’ = ‘kniga’ + ‘dative role’ the “second” part of the meaning is either not important o

r is another term

Morphology is a very common problem in IR

Page 11: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

11

Lexical synonymyLexical synonymy

Woman / female Dictionaries

Synonym dictionaries WordNet

Automatic learning of synonymy Clustering of contexts If the contexts are very similar, then possible synonyms Problem: preserves meaning? Monday / Tuesday An interesting solution: compare dictionary definitions

Page 12: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

12

Uses in IRUses in IR

Query expansion Add synonyms of the word to the query and process

normally Flexible, slow Best for lexical synonymy: few synonyms, doubtful

Reducing at index time When reading the documents, reduce each word to a

“standard” synonym Fast, rigid Best for morphology: many synonyms, less doubtful

Hierarchical indexing

Page 13: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

13

Hierarchical indexingHierarchical indexing

(Gelbukh, Sidorov, Guzman-Arenas 2002) Tree of concepts

Living things• Animals

1. a. Cat, b. cats

2. a. Dog, b. dogs

• Persons3. a. Professor, b. professors

4. a. Student, b. students

Order vocabulary by the order of the leaves of tree Query expansion is done by ranges:

cat: 1, living things: 1-4

Page 14: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

14

MorphologyMorphology

One of the large concerns in IR Can be done

precisely approximately (quick-and-dirty)

Level of generalization inflection: student – students derivation: study – student

Ambiguity all variants one variant

Page 15: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

15

... morphology... morphology

Result is The unique ID The dictionary form A “stem”: part of the same string

Page 16: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

16

Morphological analyzersMorphological analyzers

Precise analysis Ambiguous

Give all variants Tables: to table or the table? Spanish charlas: charla ‘talk’ or charlar ‘to talk’ Russian dush: dush ‘shower’ or dusha ‘soul’ Common in languages with developed morphology For short words, some 3 – 5 – 10 variants

Dictionaries are used

Page 17: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

17

Morphological systemMorphological system

Dictionary specifies: Stem: bak-, ask- POS (part of speech): verb Inflection class (what endings it accepts): 1, 2

Tables of endings specify Paradigms:

1. -e -es -ed -ed -ing

2. -, -s -ed -ed -ing

Meanings: participle, ...

Page 18: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

18

... morphological system... morphological system

Algorithm Decompose the word into an existing stem and ending Check compatibility of stem and ending Give the stem ID and ending meaning

Ambiguous Many variants of decompositions Many stems with different IDs Many endings with different meaning

• -ed: past or participle

Problem: words absent in dictionary

Page 19: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

19

StemmingStemming

Substitute for real analysis Both inflection and derivation

Quick-and-dirty Only one variant Result: a part of the string

• gene, genial gen-

Cheap development bad results simple description. Standard

Often used in academic research Used to be used in real systems, but now less

Page 20: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

20

Porter stemmerPorter stemmer

Martin Porter, 1980 Standard stemmer Provides equal basis

for evaluation ofdifferent IR programs

Uses “measure” m: [C](VC){m}[V].

m=0 TR, EE, TREE, Y, BY. m=1 TROUBLE, OATS, TREES, IVY. m=2 TROUBLES, PRIVATE, OATEN, ORRERY.

Page 21: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

21

... Porter stemmer... Porter stemmer

Step 1a SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat

Page 22: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

22

... Porter stemmer... Porter stemmer

Step 1b (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing

Page 23: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

23

... Porter stemmer... Porter stemmer

If 2nd or 3rd rule successful AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter

• hopp(ing) -> hop

• tann(ed) -> tan

• fall(ing) -> fall

• hiss(ing) -> hiss

• fizz(ed) -> fizz

(m=1 and *o) -> E • fail(ing) -> fail

• fil(ing) -> file

Page 24: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

24

... Porter stemmer... Porter stemmer

Step 1c (*v*) Y -> I

• happy -> happi

• sky -> sky

Page 25: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

25

... Porter stemmer... Porter stemmer

Step 2 (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible

Page 26: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

26

... Porter stemmer... Porter stemmer

Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good

Page 27: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

27

... Porter stemmer... Porter stemmer

Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler

Page 28: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

28

... Porter stemmer... Porter stemmer

Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas

Step 5b (m > 1 and *d and *L) -> single letter

• controll -> control

• roll -> roll

Page 29: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

29

Statistical stemmersStatistical stemmers

Take a list of words Construct a model of language that “generates” it

The “best” one The simplest one? How to find?

List of stems, list of endings Determine their probabilities

Usage statistics

Decompose any input string into a stem and an ending Take the most probable variant

Page 30: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

30

Research topicsResearch topics

Constructing and application of ontologies Building of morphological dictionaries Treatment of unknown words with morphological

analyzers Development of better stemmers

Statistical stemmers?

Page 31: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

31

ConclusionsConclusions

Reducing synonyms can help IR Better matching Ontologies are used. WordNet

Morphology is a variant of synonymy widely used in IR systems

Precise analysis: dictionary-based analyzers Quick-and-dirty analysis: stemmers

Rule-based stemmers. Porter stemmer Statistical stemmers

Page 32: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.

32

Thank you!Till May 24? 25?, 6

pm