Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

44
Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps Alberto P. Garc´ ıa-Plaza, V´ ıctor Fresno, Raquel Mart´ ınez NLP & IR Group, UNED December 12, 2008

description

http://nlp.uned.es/~alpgarcia/pub_index.php

Transcript of Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Page 1: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic BasedRepresentation and Self-organizing Maps

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez

NLP & IR Group, UNED

December 12, 2008

Page 2: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Table of Contents

1 Objectives

2 Our Approach: Extended Fuzzy Combination of Criteria(EFCC)

3 Experiment Description

4 Results

5 Conclusion

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 2

Page 3: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Table of Contents

1 Objectives

2 Our Approach: Extended Fuzzy Combination of Criteria(EFCC)

3 Experiment Description

4 Results

5 Conclusion

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 3

Page 4: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Objectives

Group HTML documents by content similarity.

Self-Organizing Maps (SOM) to organize, visualize andnavigate through the collection.

Term weighting function taking advantage of HTML tags

Combining, by means of fuzzy logic, heuristic criteria based onthe inherent semantics of some HTML tags and word positionsin the document.

Hypothesis

An improvement in document representation will involve anincrease in map quality.

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 4

Page 5: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Table of Contents

1 Objectives

2 Our Approach: Extended Fuzzy Combination of Criteria(EFCC)

1 Fuzzy Logic2 EFCC3 Linguistic Variables4 Knowledge Base

3 Experiment Description

4 Results

5 Conclusion

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 5

Page 6: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Fuzzy logic

Capturing human expert knowledge.

Close to natural language.

Knowledge base: defined by a set of IF-THEN rules.

Linguistic variables

Defined using natural language words and fuzzy sets.These sets allow the description of the membership degree ofan object to a particular class.

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 6

Page 7: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Table of Contents

1 Objectives

2 Our Approach: Extended Fuzzy Combination of Criteria(EFCC)

1 Fuzzy Logic2 EFCC3 Linguistic Variables4 Knowledge Base

3 Experiment Description

4 Results

5 Conclusion

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 7

Page 8: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Extended Fuzzy Combination of Criteria

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 8

Page 9: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Extended Fuzzy Combination of Criteria

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 9

Page 10: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Extended Fuzzy Combination of Criteria

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 10

Page 11: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Extended Fuzzy Combination of Criteria

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 11

Page 12: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Extended Fuzzy Combination of Criteria

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 12

Page 13: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Extended Fuzzy Combination of Criteria

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 13

Page 14: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Extended Fuzzy Combination of Criteria

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 14

Page 15: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Extended Fuzzy Combination of Criteria

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 15

Page 16: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Extended Fuzzy Combination of Criteria

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 16

Page 17: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Extended Fuzzy Combination of Criteria

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 17

Page 18: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Extended Fuzzy Combination of Criteria

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 18

Page 19: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Table of Contents

1 Objectives

2 Our Approach: Extended Fuzzy Combination of Criteria(EFCC)

1 Fuzzy Logic2 EFCC3 Linguistic Variables4 Knowledge Base

3 Experiment Description

4 Results

5 Conclusion

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 19

Page 20: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Linguistic Variables

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 20

Page 21: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Linguistic Variables

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 21

Page 22: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Linguistic Variables

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 22

Page 23: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Linguistic Variables

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 23

Page 24: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Linguistic Variables

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 24

Page 25: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Linguistic Variables

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 25

Page 26: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Table of Contents

1 Objectives

2 Our Approach: Extended Fuzzy Combination of Criteria(EFCC)

1 Fuzzy Logic2 EFCC3 Linguistic Variables4 Knowledge Base

3 Experiment Description

4 Results

5 Conclusion

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 26

Page 27: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Knowledge Base

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 27

Page 28: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Knowledge Base

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 28

Page 29: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Knowledge Base

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 29

Page 30: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Knowledge Base

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 30

Page 31: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Table of Contents

1 Objectives

2 Our Approach: Extended Fuzzy Combination of Criteria(EFCC)

3 Experiment Description

1 Dimensionality Reduction2 Document Map3 Evaluation Methods

4 Results

5 Conclusion

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 31

Page 32: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Dimensionality Reduction

Input vectors dimension ranging from 100 to 5000

Stopwords, puntuaction marks suffixes, and words occurringless than 50 times in the whole corpus were removed.

Two well known methods:

Document frequency reduction.Random projection method.

Three proposed rank-based methods:

Most Valued Terms.Fixed reduction method.More Frequent Terms until n level.

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 32

Page 33: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Table of Contents

1 Objectives

2 Our Approach: Extended Fuzzy Combination of Criteria(EFCC)

3 Experiment Description

1 Dimensionality Reduction2 Document Map3 Evaluation Methods

4 Results

5 Conclusion

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 33

Page 34: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Document Map Construction

Benchmark dataset for clustering: Banksearch1

10000 documents10 classes

SOM size was set equal to the number of classes of inputdocuments, i.e. 5x2, in order to compare clustering results.

1M. P. Sinka and D. W. Corne. A large benchmark dataset for web document clustering. Soft Computing

Systems: Design, Management, and Applications, 2002.

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 34

Page 35: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Table of Contents

1 Objectives

2 Our Approach: Extended Fuzzy Combination of Criteria(EFCC)

3 Experiment Description

1 Dimensionality Reduction2 Document Map3 Evaluation Methods

4 Results

5 Conclusion

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 35

Page 36: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Evaluation Methods

Weighted average of the F-measure for each class.

After mapping the collection in the trained map, the classwith greater number of documents mapped on a neuron willbe selected to label the unit.

All the document vectors in a neuron which class is differentfrom the neuron label will be counted as errors.

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 36

Page 37: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Table of Contents

1 Objectives

2 Our Approach: Extended Fuzzy Combination of Criteria(EFCC)

3 Experiment Description

4 Results

5 Conclusion

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 37

Page 38: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Best reduction for each term weighting function

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 38

Page 39: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

MFTn reduction provides stability

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 39

Page 40: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

EFCC+MFTn obtains its best results with thesmallest number of features

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 40

Page 41: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Table of Contents

1 Objectives

2 Our Approach: Extended Fuzzy Combination of Criteria(EFCC)

3 Experiment Description

4 Results

5 Conclusion

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 41

Page 42: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Conclusion

Unsupervised document representation method, based onfuzzy logic, focused on clustering HTML documents by meansof self-organizing maps.

MFTn reduction is the most stable reduction in all cases.

EFCC representation allows to obtain better results using asmaller vocabulary.

Smaller number of features needed to represent the inputdocuments and SOM unit vectors, which implies animprovement in computational cost.

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 42

Page 43: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Thank You!

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 43

Page 44: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps

Web Page Clustering Using a Fuzzy Logic Based Representation and Self-organizing Maps

Objectives Our Approach Experiment Description Results Conclusion

Related Work

VSM Topic Document Weighting ModifiesInformation Type Function SOM

Self organization ofa Massive Document Yes Yes Text Shannon’s Entrophy NoCollection2

Document Clustering Yes No Text Binary, TF, TF-IDF Nousing Phrases3

Document Clustering Yes Yes Text ESVM, HSVM, HyM Nousing WordNet4

Conceptional SOM5 Yes No Text TF Yes

2T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, J. Honkela, V. Paatero, and A. Saarela. Self organization of a

massive document collection. IEEE Trans. on Neural Networks, 2000.3

J. Bakus, M. Hussin, and M. Kamel. A som-based document clustering using phrases. In ICONIP, 2002.4

C. Hung and S. Wermter. Neural network based document clustering using wordnet ontologies. Int. J.Hybrid Intell. Syst., 2004

5Y. Liu, X. Wang, and C. Wu. Consom: A conceptional som model for text clustering. In Neurocomputing,

2008

Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez, NLP & IR Group, UNED slide 44