Discussion Class 3

10
1 Discussion Class 3 Stemming Algorithms

description

Discussion Class 3. Stemming Algorithms. Discussion Classes. Format: Question Ask a member of the class to answer Provide opportunity for others to comment When answering: Give your name. Make sure that the TA hears it. Stand up Speak clearly so that all the class can hear. - PowerPoint PPT Presentation

Transcript of Discussion Class 3

Page 1: Discussion Class 3

1

Discussion Class 3

Stemming Algorithms

Page 2: Discussion Class 3

2

Discussion Classes

Format:

Question

Ask a member of the class to answer

Provide opportunity for others to comment

When answering:

Give your name. Make sure that the TA hears it.

Stand up

Speak clearly so that all the class can hear

Page 3: Discussion Class 3

3

Question 1: Conflation methods

(a) Define the terms: stem, suffix, prefix, conflation, morpheme

(b) Define the terms in the following diagram:

Conflation methods

Manual Automatic (stemmers)

Affix Successor Table n-gramremoval variety lookup

Longest Simplematch removal

Page 4: Discussion Class 3

4

Question 2: Table look-up

(a) What are the advantages and disadvantages of table look-up methods?

(b) When would you use table look-up?

Page 5: Discussion Class 3

5

Question 3: Successor variety methods

Hafer and Weiss defined their technique as:

Let be a word of length n, i is a length i prefix of . Let D be the corpus of words. Di is defined as the subset of D containing the terms whose first i letters match i exactly. The successor variety of i, denoted by Si, is then defined as the number of letters that occupy the i+1 st position of words in Di. A test word of length n has n successor varieties Si, Si, ..., Si.

Explain this definition, using the word "computation" as an example.

Page 6: Discussion Class 3

6

With successor variety methods, how do the following methods of segmentation work?

(a) cutoff method

(b) peak and plateau method

(c) complete word method

Question 4: Successor variety methods

Page 7: Discussion Class 3

7

(a) Explain the following notation:

statistics => st ta at ti is st ti ic csunique diagrams =>at cs ic is st ta ti

statistical => st ta at ti is st ti ic ca alunique diagrams => al at ca ic is st ta ti

(b) Calculate the similarity using Dice's coefficient:

S =

Question 5: n-gram methods

2CA + B

A is the number of unique diagrams in the first termB is the number of unique diagrams in the second termC is the number of shared unique diagrams

(c) How would you use this approach for stemming?

Page 8: Discussion Class 3

8

Question 6: Porter's algorithm

(a) What is an iterative, longest match stemmer?

(b) How is longest match achieved in the Porter algorithm?

Page 9: Discussion Class 3

9

Question 7: Porter's algorithm

Conditions Suffix Replacement Examples

(m > 0) eed ee feed -> feedagreed -> agree

(*v*) ed null plastered -> plasterbled -> bled

(*v*) ing null motoring -> motorsing -> sing

(a) Explain this table

(b) How does this table apply to: "exceeding", "ringed"?

Page 10: Discussion Class 3

10

Question 8: Evaluation

(a) What is the overall effectiveness of stemming?

(b) Give a possible reason why Stemmer A might be better than Stemmer B on Collection X but worse on Collection Y.