Discussion Class 3

Post on 30-Dec-2015

34 views 0 download

description

Discussion Class 3. Stemming Algorithms. Discussion Classes. Format: Question Ask a member of the class to answer Provide opportunity for others to comment When answering: Give your name. Make sure that the TA hears it. Stand up Speak clearly so that all the class can hear. - PowerPoint PPT Presentation

Transcript of Discussion Class 3

1

Discussion Class 3

Stemming Algorithms

2

Discussion Classes

Format:

Question

Ask a member of the class to answer

Provide opportunity for others to comment

When answering:

Give your name. Make sure that the TA hears it.

Stand up

Speak clearly so that all the class can hear

3

Question 1: Conflation methods

(a) Define the terms: stem, suffix, prefix, conflation, morpheme

(b) Define the terms in the following diagram:

Conflation methods

Manual Automatic (stemmers)

Affix Successor Table n-gramremoval variety lookup

Longest Simplematch removal

4

Question 2: Table look-up

(a) What are the advantages and disadvantages of table look-up methods?

(b) When would you use table look-up?

5

Question 3: Successor variety methods

Hafer and Weiss defined their technique as:

Let be a word of length n, i is a length i prefix of . Let D be the corpus of words. Di is defined as the subset of D containing the terms whose first i letters match i exactly. The successor variety of i, denoted by Si, is then defined as the number of letters that occupy the i+1 st position of words in Di. A test word of length n has n successor varieties Si, Si, ..., Si.

Explain this definition, using the word "computation" as an example.

6

With successor variety methods, how do the following methods of segmentation work?

(a) cutoff method

(b) peak and plateau method

(c) complete word method

Question 4: Successor variety methods

7

(a) Explain the following notation:

statistics => st ta at ti is st ti ic csunique diagrams =>at cs ic is st ta ti

statistical => st ta at ti is st ti ic ca alunique diagrams => al at ca ic is st ta ti

(b) Calculate the similarity using Dice's coefficient:

S =

Question 5: n-gram methods

2CA + B

A is the number of unique diagrams in the first termB is the number of unique diagrams in the second termC is the number of shared unique diagrams

(c) How would you use this approach for stemming?

8

Question 6: Porter's algorithm

(a) What is an iterative, longest match stemmer?

(b) How is longest match achieved in the Porter algorithm?

9

Question 7: Porter's algorithm

Conditions Suffix Replacement Examples

(m > 0) eed ee feed -> feedagreed -> agree

(*v*) ed null plastered -> plasterbled -> bled

(*v*) ing null motoring -> motorsing -> sing

(a) Explain this table

(b) How does this table apply to: "exceeding", "ringed"?

10

Question 8: Evaluation

(a) What is the overall effectiveness of stemming?

(b) Give a possible reason why Stemmer A might be better than Stemmer B on Collection X but worse on Collection Y.