Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of...

25
Improving Suffix Tree Clustering • Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in P that have a non-zero score zero score words: stopwords, too few(<3) or too many( >40%) • Tf-Idf is better 1

Transcript of Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of...

Page 1: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Improving Suffix Tree Clustering

• Base cluster rankings(B) = |B| * f(|P|)|B| is the number of documents in base cluster B|P| is the number of words in P that have a non-zero scorezero score words: stopwords, too few(<3) or too many( >40%)

• Tf-Idf is better

1

Page 2: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Improving Suffix Tree clustering

• Cluster similarity– Page overlap– Add: cluster label distance (word pair distance)

• Google normalised distance• WikiMiner: wikilink similarity

2

Page 3: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Improving suffix tree clustering

• 3rd step: cluster merging– If more than half overlapped pages, then merge– New: HAC

3

Page 4: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

4

Query Directed Web Page Clustering

Daniel CrabtreePeter Andreae, Xiaoying Gao

Victoria University of Wellington

Page 5: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

5

Related Work: Web Page Clustering• All Standard Algorithms

– partitioning (k-means), hierarchical (agglomerative, divisive), …………• Web Features

– structure, hyperlinks, colour• Textual Features

– STC: phrases, Lingo: latent semantic indexing• Word Semantics

– Global document analysis, co-occurrence statistics

• Query is never used

Page 6: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

QDC – Query Directed Clustering

6

1: Find Base Clusters

2: Merge Clusters

3: Split Clusters

4: Select Clusters

5: Clean Clusters

Page 7: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

QDC – 1: Find Base Clusters

• Clean Pages• Identify Base

Clusters• Prune Small

Clusters• Semantic Prune #1• Semantic Prune #2

7

Mac (28)

Car (40)

Auto (25)

Animal (18)

OS (12)

Atari (22)

Game (5)

Service (80)

Forest (11)

cluster size

distance(cluster,query)Score #1 = Score #2 =

Page 8: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Car

Home Page

Toyota Specific

Broad

Query: Jaguar

AmbiguousAmbiguous

QDC – 1: Query Distance

8

Page 9: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

QDC – 1: Find Base Clusters

• Removes Many Base Clusters– Normally Negative Effect on Performance

But …

• Query Directed Score– Reliable Guide to Cluster Quality– Removes just Low Quality Clusters– Improves Performance

9

Page 10: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

QDC – 2: Merge Clusters

• Merging

10

Mac (28)

Car (40)

Auto (25)

Animal (18)

OS (12)

Atari (22)

Car, Auto (40)

Mac, OS (28)

Page 11: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

QDC – 2: Merge Clusters

• Single-link Clustering• Similarity Function

– Extension (by page overlap)– Intension (by description similarity)

• Global document analysis: co-occurrence frequency relative to expected frequency if independent

11

Page 12: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

QDC – 2: Merge Clusters

• Reducing Page Overlap Threshold– Normally Negative Effect on Performance

But …

• Description Similarity– More semantically related clusters merge

• Increasing cluster coverage

– Fewer semantically unrelated clusters merge• Increasing cluster quality

12

Page 13: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

QDC – 3: Split Clusters

• Single Link Merging– Cluster Chaining (Drifting)

• Hierarchical Agglomerative– Distance Measure: Path Length

13

Page 14: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

QDC – 4: Select Clusters• ESTC cluster selection algorithm

– Heuristic based hill-climbing search with look-ahead and advanced branch and bound pruning

• Original heuristic– Page Coverage and Cluster Overlap

• New heuristic– Page Coverage and Cluster Overlap– Pages Not Covered and Cluster Quality

14

Page 15: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

QDC – 5: Clean Clusters

• Page-Cluster Relevance– Based on Base Cluster Membership– Cluster Size, Cluster Quality

• Remove Outliers and Erroneous Inclusions• Sorting improves usability

1513

Page 16: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Evaluation

• Algorithm Efficiency on 250 Documents– Ten Times Faster than STC– One Hundred Times Faster than K-means

• Algorithm Performance– External Evaluation against a rich gold standard

• Real World Usability– Informal Usability Comparison with four algorithms

• K-means, ESTC, Lingo, Vivisimo

16

Page 17: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Evaluation: Algorithm Performance• External Evaluation against a rich gold standard • Four Algorithms

– STC, ESTC, K-means, Random• Four Data Sets

– Salsa, Jaguar, GP, Victoria University• Eleven Measurements

– Average and Weighted: Quality, Coverage, Precision, Recall, and Entropy + Mutual Information

• Snippets and Full Page Text

17

Page 18: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Evaluation: Quality and Coverage

18

Page 19: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Evaluation: Improvement over Random

19

Page 20: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Evaluation: Precision and Recall

20

Page 21: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Evaluation: Entropy and Mutual Information

21

Page 22: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Evaluation: Real World Usability

• QDC finds broader topics– Maximizes probability of

refinement– Simplifies user’s decision process

• Fewer choices• Less chance of multiple relevant

choices

• Fewer semantically meaningless clusters

22

Jaguar Results

Page 23: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Evaluation: Real World Usability

• Performance better than indicated by external evaluation– No penalty for overly specific clusters since gold standard

included them

• External evaluation shows QDC clusters have: – Fewer irrelevant pages– Cover more relevant pages

23

Page 24: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Conclusion

• QDC: New Web Page Clustering Algorithm• Key innovations:

– Query Directed Scoring– Merging using cluster descriptions– Solve cluster chaining by splitting– Improved cluster selection heuristic

• Vastly improved performance over other algorithms– External evaluation – Informal usability evaluation

24

Page 25: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

25

Further Extension• Use Phrases rather than just Words

– STC, Lingo show large improvement possible

• Use Wiki Link similarity (WikiMiner) instead of GND• Future work:

– Improve cluster description similarity merging to consider entire description

– Common shared phrases as key features, use VSM, build vectors for each cluster, new weighting

– Formal usability evaluation