This Class
description
Transcript of This Class
![Page 1: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/1.jpg)
This Class This Class
How stemming is used in IR Stemming algorithms Frakes: Chapter 8 Kowalski: pages 67-76
![Page 2: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/2.jpg)
Stemming algorithmsStemming algorithms
Affix removing stemmers Dictionary lookup stemmers n-gram stemmers Successor variety stemmers
![Page 3: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/3.jpg)
StemmingStemming
Conflation - combining morphological term variants
Done manually or automatically Automatic algorithms called stemmer
s
![Page 4: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/4.jpg)
Stemming algorithmsStemming algorithms
Conflation methods
Manual Automatic
Affix Removal
SuccessorVariety
DictionaryLookup
n-grams
LongestMatch
SimpleRemoval
![Page 5: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/5.jpg)
Stemming is used for:Stemming is used for: Enhance query formulation
(and improve recall)
by providing term variants Reduce size of index files
by combining term variants
into single index term
![Page 6: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/6.jpg)
Stemming during indexingStemming during indexing
Index terms are stemmed words Saves dictionary space One inverted index list for all variants Saves inverted index file space when pos
ition information in document not included
Query terms are also stemmed
![Page 7: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/7.jpg)
Index is not stemmedIndex is not stemmed
In this case the index contains words No compression is achieved No information is lost Enables wild card searches Enables long phrase searches
when position information included
![Page 8: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/8.jpg)
Providing term variantsProviding term variants during search during search
A stemming algorithm generate term variants
Term variants added to query automatically (query expansion)
or The user is provided
with term variants and
decides which ones to include
![Page 9: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/9.jpg)
ExampleExample
A user searching for
ystem users?is provided
in the CATALOG system with
term variants for sers?and ystem
![Page 10: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/10.jpg)
Example (cont.)Example (cont.)Search term: users
Term Occurrences
1. user 15
2. users 1
3. used 3
4. using 2 User selects variants to include in query
![Page 11: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/11.jpg)
Stemmer correctnessStemmer correctness A stemmer can be incorrect by either
– Under-stemming or by
– Over-stemming Over-stemming can reduce precision Under-stemming can affect recall
![Page 12: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/12.jpg)
Over-stemmingOver-stemming Terms with different meanings are confla
ted onsiderate? and
onsider?and
onsideration
should not be stemmed to on? with
ontra?
ontact? etc.
![Page 13: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/13.jpg)
Under-StemmingUnder-Stemming Prevents related terms from being confl
ated Under-stemming
onsideration?to
onsiderat?
prevents conflating it with
onsider
![Page 14: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/14.jpg)
Evaluating stemmersEvaluating stemmers
In information retrieval stemmers are evaluated by their: – effect on retrieval and
– compression rate, and
– not linguistic correctness
![Page 15: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/15.jpg)
Evaluating stemmersEvaluating stemmers
Studies have shown that stemming has a positive effect on retrieval.
Performance of algorithms comparable Results vary between test collections
![Page 16: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/16.jpg)
Affix removal stemmersAffix removal stemmers
Remove – suffixes and and/or
– prefixes from terms
– leaving a stem
![Page 17: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/17.jpg)
Affix removal stemmersAffix removal stemmers
In English stemmers are suffix removers
In other languages,
for example Hebrew,
both prefix and suffix are removed
![Page 18: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/18.jpg)
Affix removal stemmersAffix removal stemmers
Most affix removal stemmers in use are:– iterative - for example, onsideration
?stemmed first to onsiderat?then to onsider
– longest match stemmers using a set of stemming rules.
![Page 19: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/19.jpg)
A simple stemmerA simple stemmer
Harman experimented – concluded minimal stemming helpful
Her simple stemmer changes:– Plural to singular
– Third person to first person
![Page 20: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/20.jpg)
A simple stemmerA simple stemmer
Algorithm changes: kies?to ky? ies->y etrieves?to etrieve? es->s, and oors?to oor? s->NULL (leaves orpus?or ellness? ies?to y?
![Page 21: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/21.jpg)
A simple stemmerA simple stemmer1. word ends in es?but not
ies?or ies?change end to ?
2. word ends in s? but not es? es?or es?change to ?
3. word ends in ?but not s?or s?
remove s
![Page 22: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/22.jpg)
The Paice/Husk stemmerThe Paice/Husk stemmer
Uses a table of rules grouped into sections Section for each last letter of a suffix (rul
es for forms ending in a, then b, etc.) A form is any word or part of a word con
sidered for stemming
![Page 23: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/23.jpg)
The Paice/Husk stemmerThe Paice/Husk stemmer
Each rule specifies a deletion or a replacement of an ending
The order of the rules in each section is important.
Rules tried until one can be applied, and the current form is updated
![Page 24: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/24.jpg)
Rule structureRule structure Each rule contains 5 parts (2 are optional
): An ending (one or more characters in rev
erse order) An optional ntact?flag ??denoting form
not yet stemmed
![Page 25: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/25.jpg)
Rule structureRule structure A digit (>=0) specifying no. characters to
remove An optional string to append (after remo
val) A rule ending with
??denotes stemming should continue
?? terminating the stemming process
![Page 26: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/26.jpg)
Examples of rulesExamples of rules
ei3y>? if form ends in es?then replace the last
3 letters by ?and continue stemming
( ries?becomes ry?
![Page 27: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/27.jpg)
Examples of rulesExamples of rules
u*2.? if form ends with m?and word is intact
remove 2 last letters and terminate stemming.
aximum?is stemmed to axim? but resum?from resumably?remains unchanged
![Page 28: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/28.jpg)
Examples of rulesExamples of rules
lp0.?- if word terminates in ly?terminate. Next rule l2>?does not remove y?from ultiply
ois4j>?causes ion?to be replaced by ?
?acts as dummy ending rovision?converted to rovij?and then
to rovid
![Page 29: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/29.jpg)
Acceptability conditionsAcceptability conditions
Rule not applied unless conditions satisfied
Attempt to prevent over-stemming Without them
ent? ant? ice? ate?
ation?iver?reduce to ? There are 2 rules:
![Page 30: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/30.jpg)
Acceptability conditionsAcceptability conditions
If form starts with a vowel then at least 2 letters must remain (owed/owing->ow but not ear->e)
If a form starts with a consonant then at least 3 letters must remain, and
at least one must be a vowel or
(saying->say, crying->cry, but not string->str, meant->me, or cement->ce)
![Page 31: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/31.jpg)
Acceptability conditionsAcceptability conditions
These rules cause error in the stemming of some short-rooted words
(doing, dying, being). These could be dealt with separately with
a table lookup
![Page 32: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/32.jpg)
Example with Paice stemmingExample with Paice stemming
eparately?- use ?section mismatch ylb1>, yli3y>, ylp0. match yl2>. Form becomes eparate? use rule 1>?in ?section form changes to eparat?- use t section mismatch with acilp4y.? match with a2
>? change form to epar use r section, match with a2.? So ep
![Page 33: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/33.jpg)
Other examplesOther examples
preparation prepare prepared
rule nois4j> fails
rule e1> prepar
rule de2>prepar
rule noix4ct.fails
rule ra2.prep
rule ra2. prep
rule noi2> preparatrule ta2> preparrule ra2.
prep
![Page 34: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/34.jpg)
n-gramsn-grams
Fixed length consecutive series of ?characters
Bigrams:– Sea colony -> (se ea co ol lo on ny)
Trigrams– Sea colony -> (sea col olo lon ony), or
-> (#se sea ea# #co col olo lon ony ny#)
![Page 35: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/35.jpg)
Usage of n-grams Usage of n-grams
Used in world war II by cryptographers Spell checking Text compression Signature files Stemming
![Page 36: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/36.jpg)
n-gram temmersn-gram temmers
Adamson and Borcham (1974) Method for grouping term variants Language independent
![Page 37: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/37.jpg)
n-gram temmersn-gram temmers
Each term transformed to n-gram A similarity value
is generated between
any pair of terms in database,
resulting in a similarity matrix
![Page 38: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/38.jpg)
n-gram temmersn-gram temmers
A clustering method (single link) groups highly similar terms into clusters
Most matrix elements had value 0. Used a cutoff value of 0.6 for their cl
ustering algorithm
![Page 39: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/39.jpg)
Dice Coefficient Dice Coefficient
Many formulas for computing set similarity
Dice coefficient:
S=2(|A B|)/(|A|+|B|) 0 S 1 S=1 if A=B, S=0 if A B=
![Page 40: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/40.jpg)
Sets of Unique BigramsSets of Unique Bigrams
Let A and B denote the sets of unique bigrams associated with two terms, and let C=A B
statistics -> (st ta at ti is st ti ic cs) Set of unique bigrams for statistics:
A={at cs ic is st ta ti}, |A|=7
![Page 41: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/41.jpg)
n-gram temmersn-gram temmers
statistical= (st ta at ti is st ti ic ca al) Set of unique bigrams for statistical
B= {al at ca ic is st ta ti}, |B|=8 C={at ic is ta st ti}, |C|=6 S=2|C|/(|A|+|B|)=2x6/(7+8)=.8
![Page 42: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/42.jpg)
Table lookup methodTable lookup method
Ideally, a table is constructed with stem for every word
Stemming - look up word find stem There is no such data for English Systems use a combination of diction
ary lookup and conflation rules
![Page 43: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/43.jpg)
Dictionary lookup methodDictionary lookup method
INQUERY uses Kstem Kstem is a morphological analyzer t
hat conflates word variants to root form
![Page 44: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/44.jpg)
Dictionary lookup methodDictionary lookup method
Tries to avoid collapsing words with different meaning to same root
The original word or a stemmed version is looked up in a dictionary and replaced by the best stem
![Page 45: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/45.jpg)
Successor variety stemmerSuccessor variety stemmer
Based on work in structural linguistic (Hafer and Weiss)
Performed less well than affix removing stemmers
Given a set of words,
the successor variety (SV) of a string is the number of different characters that follow it in words in the set
![Page 46: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/46.jpg)
Successor variety stemmersSuccessor variety stemmers
Terms : {able, axle, accident, ape, about, apply, application, applies}
The SV of p?is 2 p?is followed by ?in pe?and
by ?in pply application and applies The SV of ?is 4
?followed in set by ? ?? and
![Page 47: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/47.jpg)
SVs for pply?and ppliesSVs for pply?and ppliesPrefix SV Letters Prefix SV Letters
a 4 b, x, c,p
a 4 b, x, c,p
ap 2 e, p ap 2 e, papp 1 l app 1 l
appl * 2 y, i appl * 2 y, iapply 1 blank appli 2 e, c
applie 1 sapplie
s1 blank
* denotes a break point at peak
![Page 48: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/48.jpg)
SV for pplication
Prefix SV Lettersa 4 b, x, c, p
ap 2 e, papp 1 lappl 2 y, i
appli * 3 c, y, eapplic 1 a
applica 1 tapplicat 1 iapplicati 1 o
applicatio 1 napplication 1 blank
![Page 49: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/49.jpg)
Segmenting wordsSegmenting words 4 ways:
– Cut-off SV is reached
– SV eaks
– A substring of a word is equal to another word in the set
eadable?breaks into ead?and ble
– Entropy based method
![Page 50: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/50.jpg)
Selecting a stemSelecting a stem
First segment is selected if it occurs in at most 12 words,
Otherwise the second segment is selected (3 segments are unlikely)
![Page 51: This Class](https://reader036.fdocuments.us/reader036/viewer/2022062518/568144e1550346895db1ac7f/html5/thumbnails/51.jpg)
SummarySummary
All automatic stemmers - sometimes incorrect
n-gram method can be used for different languages
In general affix removing stemmers are more orrect
Longest match stemming does not always generate satisfactory word stems