Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace...

Back

Efficient Text and Semi-structured Data Mining: Towards Knowledge

Discovery in the Cyberspace

Hiroki ArimuraDepartment of Informatics, Kyushu University, Japan

Joint work with Tatsuya Asai, Shinji Kawasoe, Kenji Abe, Junichiro Abe, Ryoichi Fujino, Hiroshi Sakamoto, Setsuo Arikawa (Kyushu Univ), Shinichi Shimozono (Kyutech)

Supported by Grant-in-aid for Scientific Research on Priority Area “Discovery Science” and "Infomatics"; Japan Science & Tech Co., PRESTO

Back

OutlineEfficient Text Data Mining Fast and Robust Text Mining Algorithm (ALT'98,

ISSAC'98, DS'98) Efficient Text Index for Data Mining (CPM'01 , CPM'02) Text Mining on External Storage (PAKDD'00)

Applications– Interactive Document browsing– Keyword discovery form Web

Towards Semi-structured Data Mining Efficinet Frequent Tree Miner (SDM'02, PKDD'02) Mining Semi-structured Data Streams (ICDM '02)

Information Extraction from Web (GI'00, FLAIRS'01)

Conclusion

Back

Efficient Text Data Mining with Optimized Pattern Discovery

Joint work with Junichiro Abe, Ryoichi Fujino, Hiroshi Sakamoto, Setsuo Arikawa (Kyushu Univ), Shinichi Shimozono (Kyutech)

Back

Large Text Databases

have emerged in early 90’s with the rapid progress of computer and network technologies ．

– Web pages (OPENTEXT Index, GBs to TBs)– A collection of XML / SGML documents.– Genome databases (GenBank, PIR)– Bibliographic databases (INSPEC, ResearchIndex)– Emails or plain texts on a file system.

Huge, Heterogeneous, unstructured data Traditional database/ data mining technology cannot

work!

Back

Our Research Goal

Text Mining as an inverse of IRDevelop a new access tool for text data

that Interactively supports human discovery from large text collections

Key: Fast and robust text mining methods

User

Text Data Mining System

Text DatabasesAGGAGGTCACA 30

CCAAA

AACACTGTGTGACA

GTGT CACA TGTTTCTGT AGGAGGT

Web pages, XML/SGML archives Genome databases E-mails, text files

BackBrowsing a large collection of documents

with unknown vocabulary and structure

Reuters 21578: 21578 articles from Reuters newswires from Feb. to Oct. in 1987 on economy and international affairs

<vessels >

<ships>

<gulf ><shipping >

<iranian >

<port >

<iran >

<the gulf >

<strike >

<attack ><silk worm missile>

<us.>

<wheat>

<dallers>

<sea men>

<strike >

Text data mining for survey and

browsing

Information Retrieval

Direct browsing

Back

Proximity Word Association Patterns

Association rules over arbitrary subwords. Ordered: 　　 ordering among subwords Proximity: 　 the distance of consecutive

subwords are within constant k (proximity)

GTGTCACATGTTTCTGTAGAAAGAGGTCCACACA AGGAT CCAA

CACA AGGAT CCAA＊＊

GTGTCACAAATTCTGTAGTATCA

Parameters: the maximum number of substrings d & the proximity k

Back

Related Research Feldman and Dagan (KDD’96)

– Association rules over keywords: “Arab”, “Egipt”, “Iran” => “Oil”– Using Apriori-style algorithm of Agrawal et al (1994)

Motowani (SIGMOD'97)– Correlations over keywords

Mannila and Toivonen (KDD’96)– Episodes patterns (Partially ordered set of events)

Wang, Chirn, Marr, Shapiro, Shasha, Zhang (SIGMOD'94)– Word Association Patterns without proximity

AGAG ＊ TATA ＊ AGAT– A generate-and-test algorithm + heuristics– Implementation for d = 2, or d = 1 + approximate matching

Iliopoulos, Makris, Sioutas, Tsakalidis, Tsichlas (CPM'02, this conference!)– Model Identification Problem for maximal pairs of strings (2-dim)– Common or frequent pattern discovery for d = 2 and proximity

Back

How to find?

What properties

separate the target data from the rest of

the data?

Goal: to find those patterns that characterize the target collection

Back

Minimization　of•Prediction　Error　•Information　Entropy•Gini　Index

Optimized Rule/Pattern DiscoveryData Mining Optimized data mining (IBM DM group,1996 - 2000)Learning Theory Agnostic PAC learning (1990s)Statistics Vapnik ＆ Chervonenkis theory (1970s)

p: ratio of positives that a pattern matches

Good rect.8 positives2 negatives

Bad rect.9 positives9 negatives

f(p): im

purity function

50% 100%0%

Impurity　function(p)

Back

Goodness of a pattern= Goodness of the split by the pattern= Weighted average of the values of

impurity function at matched and unmatched sets

Optimized Rule/Pattern Discovery

Pattern

Population N

f(p):

impurity

function

50% 100%0%

Back

Optimized Rule/Pattern Discovery

Split !

Evaluation function for pattern GS,() = (N1/ N) (M1/N1) + (N0/ N) (M0/N0)

(M1/N1) (M0/N0)Impurity function

S1

Population N1

S0

Population N0

Pattern

Population N

Back

Optimal Pattern Discovery Problem

Given: a set S of documents and an objective function : S {0, 1}.

Problem: Find an optimal pattern that minimizes the evaluation function

GS,() = (N1/ N) (M1/N1) + (N0/ N) (M0/N0)

BackRelation to Robust Probabilistic Learning

Statistical Decision Theory in 70sVapnik ＆ Chervonenkis theory (1970s)

Computational Learning Theory in 90sAgnostic PAC-leaning / Robust Trainability (Kearns et al. '92)An algorithm that efficiently solves the classification error

minimization problem is an efficient robust learner, that is, it can approximate arbitrary distribution that generates the examples from the view of classification. (Hausser 1990)

Intractable in general (Kearns et al. 1992)

Empirical machine learning in 90sThe power of simple rules + rigorous optimization (Weiss;

Holte)

Data Mining & COLT in 90s (middle)Efficient algorithms for simple geometric patterns

Back

Application to Text Mining

Traditional method (Frequency-based) Finding most frequent patterns in the target set T.

Many trivial patterns (stop-words) may hide less frequent interesting patterns

Traditional stop-word elimination in IR may not work

the

a an that

of

wit

h

iran

ian

oil

quw

aiti

tan

ker

atta

ck

Iran

ian

oil

pla

tfor

mqu

wai

ti ta

nke

rS

ilkw

orm

mis

sile

vocaburary

freq

uenc

y Target dataset

Back

Application to Text Mining

Optimized Data Mining Finding optimal patterns

Uses an average dataset B of documents as a control dataset for canceling trivial patterns

Finds those patterns that appear more frequently in the target set T and less frequently in the control set B.

the

a an that

of

wit

h

iran

ian

oil

quw

aiti

tan

ker

atta

ck

Iran

ian

oil

pla

tfor

mqu

wai

ti ta

nke

rS

ilkw

orm

mis

sile

vocaburary

freq

uenc

y

Target dataset

Background dataset

Back

Proximity Word Association Patterns

Association rules over arbitrary subwords. Ordered: 　　 ordering among subwords Proximity: 　 the distance of consecutive

subwords are within constant k (proximity)

GTGTCACATGTTTCTGTAGAAAGAGGTCCACACA AGGAT CCAA

CACA AGGAT CCAA＊＊

GTGTCACAAATTCTGTAGTATCA

Parameters: the maximum number of substrings d & the proximity k

Back

Straightforward algorithm: Case: the number d of substrings is bounded

Procedure:– Enumerate all O(n2d) proximity patterns

built from O(n2) subwords of the text. – For each pattern p, compute the score in

linear time.

The straightforward algorith requires O(n2d+1) time and too slow to apply to real world databases

We require more efficient algorithms that run in time, say, O(n) to O(n log n) on real datasets.

Back

Theoretical result: PositiveNumber d of substrings is bounded

アルゴリズム詳細

Theorem: For a set of random texts of total size N, Split-Merge algorithm finds all the k-proximity d-word association patterns that minimize the prediction error in average time O(kd (log N)d+1 N) and space O(max(k, d) N).

Proc. ISAAC'98, LNCS 1533, 1998; New Generation Computing. 2000

d = 2 ～ 4, k = 2 ～ 8 (words), log N = 10 ～ 20 (Reuters21578 collection of 15.6MB)

A large constant in practice

Back

Theoretical result: NegativeNumber d of substrings is unbounded

Details of Algorithm

Theorem: If the number d of subwords is unbounded, then there is no polynomial time approximation algorithm that solves the optimal pattern problem above in arbitrary small approximation ratio assuming P≠NP (MAXSNP-hard) ．

Proc. ISAAC'98, LNCS 1533, 1998; New Generation Computing. 2000

Back

Suffix tree– represent all the substrings

in O(n) space.

Problems– Not space efficient.

– Dynamic reconstruction is not easy.

– Not suitable for implementation on the secondary storage.

Suffix array

973625814

987654321

abcabbca$

abbca$

a$

bbca$

bcabbca$

bca$

cabbca$

ca$

$

Compactly represents all the substrings with a 1-dimensional integer array.

(1990, Manber etal.)

Suffix tree & array 987654321

Text

$acbbacba

Suffix tree

a

b

bc

a$

cabbca$

$

b

bca$

ca

bbca$

$

ca

$

$

bbca$

4

1

8

5

2

6

3

7

9

(1976, McCreight)

Data structures for efficiently storing all of O(n2) substrings in O(n) space

Back Basic Idea

– Reducing the best k-proximity d-word association pattern

– to the best d-dim box over the rank space

The position space:consists of all possible pairs of positions that differ at most k

The rank space:consists of all pairs of the suffices of the text ordered lexicographically.

k

translation by suffix array

Back

An O(kd (log N)d+1 N) -time Algorithm

Improvement of a generate-test algorithms Using d-dim Orthogonal

Range Tree Structure

Mean Height O(log N)

Two dimensional case

BackImplementation: From Trees to Arrays Efficient full text Index for text mining.

– Replacing Tree with one-dim arrays– Most operations in Split-Merge Algorithm can be

efficiently implemented by Suffix + Height arrays.

• Enumeration of substrings and its occurrences is done in linear time by simulating the DFS of the "virtual" suffix tree with scanning the Height array.

• Reconstruction (restriction) of the suffix and the height arrays can be done with O(n log n) integer sorting and O(1) time LCA/range-minima computation. (Farach-Colton, Ferragina, Muthukrishnan '00)

T. Kasai, G. Lee, H. Arimura, S. Arikawa, K.S. Park, "Linear-time Longest-common-prefix computation in suffix arrays and its applications", CPM'01; H. Arimura, CPM'01 talk.

BackImplementation: More Practical Algorithm Split-Merge-with-Array algorithm (SMA)

– Re-implemetation of SMT with Suffix + Height arrays– has the same time complexity and the slightly imploved

space complexity to SMA in average. – Easy to implement and scalable due to a simple data

structure which extensively uses one-dimensional arrays and sorting and mapping operations over them.

Theorem: For a set of random texts of total size N, the SMA algorithm finds all the k-proximity d-word association patterns that minimize the prediction error in average time O( N (log N)d+1 ) and space O( max(k,d) N).

Back

Prototype system

Based on computational geometry techniques Built on a full text index called the suffix array Virtual traversal technique over suffix array. Space requirement is reduced to O(dn) with small

constants by the extensive use of suffix array and tertiary quick sort.

g++ on Solaris 2.6, Sun Ultra Sparc IIi, 250MHz.

Back

Running time

d k Time (s) d k Time (s)

1 - 0.64 2 2 2.30

2 2 7.00 2 4 3.81

3 2 33.60 2 8 6.65

4 2 170.38 2 16 10.59

5 2 934.00 2 32 14.81

6 2 1405.82 2 64 14.67

Summary for various values of parameters d and k

•Data: 15.2MB (SHIP data from Reuters 21578 data)•Sun micro., Ultra SPARC II 300MHz, 512MB, g++ on Solaris 2.6.•Best 200 patterns with entropy minimization

Back

Experiments on Document Browsing

Data– Reuters-21578 text collection (Lewis, 1997)– 21578 articles on economy and international

affairs for 8 months from Feb to Oct in 1997. – Each article is tagged with dates and topics (ship,

grain, wheat, gold, ... ). – Ascii data of total size 27.6MB （ 15.2 MB

removing tags) Task

– To find the optimized patterns that distinguish the sentences appearing in the articles of category ship from those with other categories

Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.

Back

500 patterns found (time 0.860000 sec)#pos and #neg: 2970 7887 id eval P1 N1 ------------------------------------- 1 0.568 209 23 <gulf > 2 0.572 136 3 <ships > 3 0.572 142 6 <shipping > 4 0.572 132 3 <iranian > 5 0.573 134 9 <iran > 6 0.576 108 6 <port > 7 0.576 111 8 <the gulf > 8 0.577 111 14 <strike > 9 0.577 81 0 <vessels > 10 0.578 86 4 <attack >next (11-20)?

１． Best ten phrases with high entropy value#pos and #neg: 2970 7887 id eval P1 N1 -------------------------------------261 0.585 12 0 <mhi >262 0.585 12 0 <mclean >263 0.585 12 0 <lloyds shipping intelligence >264 0.585 12 0 <iranian oil platform >265 0.585 12 0 <herald of free >266 0.585 12 0 <began on >267 0.585 12 0 <bagged >268 0.585 12 0 <24 - >269 0.585 12 0 <18 - >270 0.585 12 0 <120 pct >next (271-280)?

２ . Phrases with middle entropy values (rank 261-)

%time 0.120000138 patterns found (time 0.120000 sec) #pos and #neg: 2970 7887 id 　 eval P1 N1 ------------------------------------- 1 0.586 5 0 <attack> 　 <on > 　 <an > 　 <iranian > 　 <oil > 2 0.586 4 0 <attack> 　 <an > 　 <iranian> 　 <oil > 　 <platform > 3 0.586 3 0 <attack> 　 <on > 　 <u.s.-flagged > 　 <in kuwaiti> 　<waters > 4 0.586 3 0 <attack> 　 <on > 　 <iranian > 　 <oil ><platform > 11 0.586 3 0 <attack> 　 <on > 　 <a > 　 <ship > 　 <kuwaiti > 12 0.586 3 0 <attack> 　 <on> 　 <a > 　 <ship > 　 <in kuwaiti >

３． Best ten k-proximity d-word patterns with high entropy value, where k = 1, d = 5 and the first word is "attack".

Finding the phrases that characterize the articles of category ship from the articles with other categories.

Application to document browsing


Back

# of Title words # of Stop words

Title words from <TITLE> section of Reuters newswires Stop words from the standard stopword lists for Brown corpus Measuring the ratio of title/stop words in a phrase found.

Optimization based data mining vs. Frequency based data mining

0

10

20

30

40

50

60

70

80

90

1 ～ 100 101 ～ 200201 ～ 300301 ～ 400401 ～ 500501 ～ 600601 ～ 700701 ～ 800801 ～ 900901 ～ 1000

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Title Words 比 Stop Words 　比平均単語数

Frequency maximization with positve examples alone

1.0

0.0

1.5

0.0

Reuter Ship Entropy

0

10

20

30

40

50

60

70

80

1 ～ 100101 ～ 200201 ～ 300301 ～ 400401 ～ 500501 ～ 600601 ～ 700701 ～ 800801 ～ 9000

0.5

1

1.5

2

2.5

Title Words 比 Stop Words 　比平均単語数

Entropy minimization with positive and negative examples

1.0

0.0

2.0

0.0


Back

Target dataset: the base-set for the query "HONDA"

Best 200 hits by AltaVistaTM

root set S

Back linkspointing to S

Randomly selected 50 pages

Forward linksfrom S

base set T 1,000 ～ 5,000 pages

All pages of distance one from pages in S

Discovery of Important Keywords in the Cyberspace

Control dataset: the base-set for the query "Softbank" Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.

BackFrequency-based vs. Optimization-based

Rank Pattern Rank Pattern0 <the> 0 <honda >1 <and> 1 <prelude >2 <to> 2 <i >3 <a> 3 <car >4 <of> 4 <parts >5 <for> 5 <engine >6 <in> 6 <99 >7 <is> 7 <rear >8 <I> 8 <vtec >9 <honda> 9 <exhaust >

10 <on> 10 <miles >11 <s> 11 <bike >12 <with> 12 <motorcycle >13 <you> 13 <racing >14 <it> 14 <black >15 <or> 15 <si >16 <this> 16 <me >17 <that> 17 <tires >18 <are> 18 <fuel >19 <99> 19 <my >

Frequency Maximazation Entropy Miminization

Mining patterns in the target/postive dataset (HONDA data) using background/negative dataset (SOFTBANK)

Automobile co. and internet business Both are Automobile companies


BackDependence on Background/Negative Data

Rank Other Pattern Rank Other Pattern0 0 <honda > 0 0 <honda > 1 1 <prelude > 1 1 <prelude >2 33 <i > 2 8 <vtec >3 >350 <car > 3 15 <si >4 >350 <parts > 4 11 <bike >5 >350 <engine > 5 6 <99 >6 5 <99 > 6 12 <motorcycle >7 >350 <rear > 7 25 <the honda >8 2 <vtec > 8 41 <prelude si >9 24 <exhaust > 9 20 <civic >

10 108 <miles > 10 35 <honda prelude >11 4 <bike > 11 48 <98 >12 6 <motorcycle > 12 53 <valkyrie >13 >350 <racing > 13 60 <99 time >14 19 <black > 14 37 <honda s >15 3 <si > 15 36 <rims >16 28 <me > 16 40 <looking >17 >350 <tires > 17 30 <looking for >18 >350 <fuel > 18 67 <scooters >19 >350 <my > 19 14 <black >

HONDA vs. SOFTBANK HONDA vs. TOYOTA

Mining patterns in the target/postive dataset (HONDA data) varying the background/negative dataset (SOFTBANK and TOYOTA)

Automobile co. and internet business Both are Automobile companies


Back

Conclusion Text databases Optimized Pattern Discovery

– Proximity phrase association patterns

Fast and robust text mining algorithms– Split-Merge algorithm for finding the optimal patterns– Levelwise-Scan algorithm for large disk-resident data.

Applications– Interactive Document browsing– Web Mining

Please visit: http://www.i.kyushu-u.ac.jp/~arim/

<ARTICLE status = “draft”><TITLE> Fast Text Data Mining with optimal pattern discovery</TITLE><AUTHOR> H. Arimura </AUTHOR><AUTHOR> T. KASAI </AUTHOR><AUTHOR> A. WATAKI </AUTHOR><AUTHOR> S. Arikawa </AUTHOR><ABSTRACT> This paper consider the efficient discovery of a simple class of patterns from large text databases. </ABSTRACT><SECTION>

<TITLE> Introduction </TITLE><BODY> Recent progress of network and strage technology enable us to collect and accumulate ...</BODY>

</SECTION><SECTION>

<TITLE> Preliminaries </TITLE><BODY> In this section, we give basic definitions and results on ... </BODY>

</SECTION> ...</ARTICLE>

Semi-structured Data

TITLE

AUTHORAUTHOR

AUTHOR

ABSTRACT

TITLE BODY TITLE BODY

SECTION SECTION SECTION

LINK

FIGURE

FIGURE

Web & XML data

Back

Theoretical results

Theorem: If the maximum sizek of subwords is unbounded, For any e > 0, there exists no polynomial time (770/767 - e)-approximation algorithm for the maximum agreement problem for labeled ordered trees of unbounded size on an unbounded label alphabet if P /=NP.

アルゴリズム詳細

Theorem: The algorithm OPTT solves the maximum agreement problem for labeled ordered trees in average time O(kk bk N).

(Note: A straightforward algorithm has super linear time complexity when the number of labels grows in N).

Proc. SIAM Data Mining 02 (2001), and Proc. PKDD'02 (2002)

Back

順序木枚挙グラフ

T1 T2 T3⊥ T4

•ルートは空木．

•各ノードは，順序木であり，その最右拡張を子供としてもつ

Back研究成果：半構造データマイニング高速な頻出順序木パターン発見アルゴリ

ズム– Efficient Substructure Discovery from Large

Semi-structured Data– Asai, Abe, Kawasoe, Arimura, Sakamoto, Arikawa– Proc. 2nd SIAM International Conference on Data

Mining (SDM'02), Arlington, April 2002. (To appear)– 電気通信学会 DE 研 (10 月 ) ；　 AI 学会研究報告 SIG-FAI/KBS （ 11

月） ; DEWS '02.

結合ルール発見の半構造データへの拡張

BackFREQT アルゴリズムの正当性と計算量定理提案のアルゴリズム Find-Freq-Trees は，任意の頻度閾値 0 ＜ σ≦1 に対して，すべての頻出な順序木パターンをちょうど枚挙する．

最近，厳密な計算量の上限と下限を示すことができた（ PKDD'02, Aug 2002 ）

– 上限：定数パターンに対して線形

– 下限：任意サイズのパターンでは，１に近い誤差率での近似さえ， NP 完全

Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace...

Documents

Transcript of Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace...