A text mining approach on automatic generation of web directories and hierarchies

20
1 Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and T echnology A text mining approach on automatic generation of web directories and hierarchies Advisor Dr. Hsu Reporter Chun Kai Chen Author Hsin-Chang Yang and Chung-Hong Lee 2004. Expert Systems with Applications 645-663

description

A text mining approach on automatic generation of web directories and hierarchies. Advisor : Dr. Hsu Reporter : Chun Kai Chen Author : Hsin-Chang Yang and Chung-Hong Lee. 2004. Expert Systems with Applications 645-663. Outline. Motivation Objective The text mining process - PowerPoint PPT Presentation

Transcript of A text mining approach on automatic generation of web directories and hierarchies

Page 1: A text mining approach on automatic generation of web directories and hierarchies

1Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

A text mining approach on automatic generation of web directories and hierarchies

Advisor : Dr. Hsu

Reporter : Chun Kai Chen

Author : Hsin-Chang Yang and Chung-Hong Lee

2004. Expert Systems with Applications 645-663

Page 2: A text mining approach on automatic generation of web directories and hierarchies

2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline

Motivation Objective The text mining process Automatic generation of web directories Experimental Results Summary

Page 3: A text mining approach on automatic generation of web directories and hierarchies

3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

The classification of web pages into proper directories and the organization of directory hierarchies are generally performed by human experts.

Page 4: A text mining approach on automatic generation of web directories and hierarchies

4

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objective

In this work, we provide a corpus-based method that applies a kind of text mining techniques on a corpus of web pages to automatically create web directories and organize them into hierarchies.

Page 5: A text mining approach on automatic generation of web directories and hierarchies

5

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

S1

S2

S3

web directories

Generation of directories

Generation of directory hierarchies

Si

Automatic generation of web directories

網頁 萃取文章資料 SOM(DCM) SOM(WCM)

dID 全文 大陸 中國 海基會

Y001 大陸專家指兩岸協商… 1 0 1

Y002 大陸成立經濟改革研究會.. 1 0 0

The text mining process

two-level hierarchy

Si+1

Page 6: A text mining approach on automatic generation of web directories and hierarchies

6

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Automatic generation of web directories

Generation of directory hierarchies─ The super-cluster generation process algorithm

Generation of directories─ identify cluster themes by examining the WCM─ selects the word that is the most important to a super-clusterDCM

WCM

stop criteria

Page 7: A text mining approach on automatic generation of web directories and hierarchies

7

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental Results

The experiments show that our method can produce comprehensible and reasonable web directories and hierarchies.

Page 8: A text mining approach on automatic generation of web directories and hierarchies

8

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction(1/3)

Information finding is thus a serious problem for the web since most users find it hard to obtain the information using current information retrieval strategies.

Two kinds of strategies are now adopted by the web communities, namely searching and browsing.

Page 9: A text mining approach on automatic generation of web directories and hierarchies

9

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction(2/3)

Since the link structures may be considered static during browsing─ the selection of starting pages plays the most

important role when a user tries to find his goal in minimum time

Therefore, many commercial or academic web sites actively collect web pages and sort them into web directories ─ to provide users the starting points in the browsing

process

Page 10: A text mining approach on automatic generation of web directories and hierarchies

10

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction(3/3)

Most existing web directories were created manually by human specialists. ─ Yahoo!

Such limitation is mainly caused by the gigantic amount of web pages produced and being produced

Page 11: A text mining approach on automatic generation of web directories and hierarchies

11

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Related work

category hierarchy─ predefined category hierarchy (Yahoo!)─ automatically developing category hierarchy

topic identification─ mutually related text excerpts

Self-organizing map algorithm

Page 12: A text mining approach on automatic generation of web directories and hierarchies

12

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The text mining process(1/2)

The method is based on the self-organizing map learning algorithm and requires no human intervention during the construction of web directories and hierarchies.

網頁 萃取文章資料 SOM(DCM) SOM(WCM)

dID 全文 大陸 中國 海基會

Y001 大陸專家指兩岸協商… 1 0 1

Y002 大陸成立經濟改革研究會.. 1 0 0

The text mining process

Page 13: A text mining approach on automatic generation of web directories and hierarchies

13

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The text mining process(2/2) labeling process

─ each document will associate with a neuron in the map. We record such associations and form the DCM.

─ In the DCM, each neuron is labeled by a list of documents which are considered similar and are in the same cluster.

─ In the same manner, we label each word to some neuron in the map and form the WCM.

Page 14: A text mining approach on automatic generation of web directories and hierarchies

14

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Generation of directory hierarchies(1/3) The two-level hierarchy generation process

─ the parent node is the constructed super-cluster─ the child nodes are the clusters that compose the super-cluster─ can be further applied to every super-cluster to establish the next level

of this hierarchy

The overall hierarchy ─ iteratively using such top–down approach ─ until a stop criterion is satisfied

Page 15: A text mining approach on automatic generation of web directories and hierarchies

15

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Generation of directory hierarchies(2/3) To form a super-cluster

─ the distance between two clusters( 二維空間座標距離 )

─ the dissimilarity between two clusters( 神經元向量相似度 )

─ the supporting cluster similarity we can determine the significance of a cluster by examining the overall similarity that is contribut

ed by its neighboring clusters. doc(i) : 神經元 i 的文件數量 Bi : 神經元 i 的鄰近神經元 index F: is a monotonically increasing function

The dominating clusters─ has locally maximal supporting cluster similarity─ the centroid of a super-cluster, which contains several child clusters

Page 16: A text mining approach on automatic generation of web directories and hierarchies

16

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Generation of directory hierarchies(3/3) In Step 3 of the super-cluster generation

process algorithm we set three stop criteria.─ The first criterion stops finding super-clusters

if there is no neuron left for selection.

─ The second criterion, which limits the number of dominating clusters, to constrain the breadth of hierarchies.

─ The third criterion constrains the depth of a hierarchy.

Page 17: A text mining approach on automatic generation of web directories and hierarchies

17

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

S1 S2

S3

Page 18: A text mining approach on automatic generation of web directories and hierarchies

18

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Generation of directories

In this work, we try to identify cluster themes, i.e. directory labels, by examining the WCM.─ selects the word that is the most important to a supe

r-cluster

Page 19: A text mining approach on automatic generation of web directories and hierarchies

19

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Summary In this paper, we present a method to automatically generate

─ web directory hierarchies and identify directory labels.

Experiments show that our method could ─ successfully cluster the documents into directories, ─ reveal the hierarchical structure among these directories, ─ and assign a label to each directory.

However, fully automatic process may not provide the best solutions for these tasks that interfere so much with human beings.

Thus, in our opinions, a kind of semi-automatic process which uses the proposed method as a preprocessing stage should be plausible to meet the general requirements.

Page 20: A text mining approach on automatic generation of web directories and hierarchies

20

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Personal Opinion

Application─ such as text categorization, thesaurus construction,

ontology learning, multilingual information retrieval

Advantage─ fully automatic process , which can automatically

create web director hierarchies without the intervention of human beings

Disadvantage─ may not provide the best solutions