CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez...

27
CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses Computing Research Center, ITCR

Transcript of CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez...

Page 1: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 1

Measuring Contribution of HTML Features in WebDocument Clustering

Oldemar Rodríguez

School of Mathematics, UCR

and Predisoft

Esteban Meneses

Computing Research Center, ITCR

Page 2: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 2

Motivation

Page 3: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 3

Motivation

Which HTML feature is the most important to provide good clustering results?

Using symbolic objects to cluster web documents.

15th World Wide Web Conference (2006)

Page 4: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007

HTML Document Clustering

Find meaningful groups from a web document collection.

Effectively represent web document clusters for further analysis.

Page 5: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 5

HTML Document

Page 6: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 6

Page 7: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 7

Classical Representations

• Different approaches for representing a web document.

<5,22,19,4,...,38>

Page 8: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 8

Vectorial Representation

• Every document is represented by a vector inn-dimensional space.

• Bag of words scheme. Each variable represents the relative weight of a term in the document.

Page 9: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 9

Symbolic Objects

• Real-life objects are too complex to be represented by points in a vectorial space. [Bock&Diday, 2000]

• Symbolic objects overcome this limitation by representing concepts rather than individuals.

• In a symbolic data array each variable can have one of many data types: sets, intervals, histograms, trees, graphs, functions, fuzzy data, etc.

Page 10: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007

Symbolic Data Table

Page 11: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007

Multivariate Numeric Analysis

Individual Age Profession Wage Location

3457 36 Lawyer 2,500.00 San José

1251 28 Teacher 1,750.00 Alajuela

3245 39 Doctor 2,400.00 San José

7635 33 Teacher 1,900.00 Alajuela

3245 35 Engineer 1,850.00 Alajuela

5367 27 Engineer 1,900.00 Heredia

6486 34 Manager 1,600.00 Heredia

Individual Age Profession Wage

San José [36,39] {Law, 50%,Doc,50%} [2,4 – 2,5]

Alajuela [28,35] {Tea,66%,Eng,33%} [1,75 – 1,9]

Heredia [2,34] {Eng,50%,Mgn,50%} [1,6 – 1,9]

Multivariate Symbolic Analysis

Millions…

Hundreds…

Data

Concepts

From relational data bases to symbolic data bases

Symbolic Data Table

Page 12: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 12

Relational Data Base Symbolic Data Base

100% knowledge

15 Gigabyte

90 % knowledge

10.3 Megabyte

Symbolic Data Base

Page 13: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 13

Symbolic Representations

• A complex representation that takes into account: term frequency, word order and phrases.

Page 14: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 14

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

The K-Means Clustering Method

Page 15: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 15

But, there are some problems …….

Page 16: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 16

Distance Measures

Page 17: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 17

Teorema: Igualdad de Fisher

• Inercia total Inercia total = Inercia inter-clases Inercia inter-clases

+ +

Inercia intra-clasesInercia intra-clases

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 18: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 18

1. Representar una clase por su centro de gravedad, esto es, por su vector de promedios.

2. ¿Qué es el centro de gravedad?

Problemas en el caso simbólico:

Page 19: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007

¿Qué el centro de gravedad?

Page 20: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007

Page 21: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 21

Evaluation Criteria

1. Rand Index

2. Mutual Information

3. F-Measure

4. Entropy

Page 22: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 22

Experiments

Page 23: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 23

Experiments

Page 24: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 24

Experiments

Page 25: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 25

Experiments

Text 0.2894

Title 0.2584

Bold 0.0379

Anchor 0.1689

Header 0.1009

Graph 0.1229

Tree 0.0212

WebKB

Text 0.7035

Graph 0.2515

Tree 0.0449

20Newsgroup

Page 26: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 26

Conclusions

• Symbolic representations are richer and more flexible than classical representations.

• The text in the HTML document seems to be the more important factor to cluster HTML documents.

Page 27: CLEI 2007 1 Measuring Contribution of HTML Features in Web Document Clustering Oldemar Rodríguez School of Mathematics, UCR and Predisoft Esteban Meneses.

CLEI 2007 27

Thank you!