Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace...
-
date post
18-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace...
Back
Efficient Text and Semi-structured Data Mining: Towards Knowledge
Discovery in the Cyberspace
Hiroki ArimuraDepartment of Informatics, Kyushu University, Japan
Joint work with Tatsuya Asai, Shinji Kawasoe, Kenji Abe, Junichiro Abe, Ryoichi Fujino, Hiroshi Sakamoto, Setsuo Arikawa (Kyushu Univ), Shinichi Shimozono (Kyutech)
Supported by Grant-in-aid for Scientific Research on Priority Area “Discovery Science” and "Infomatics"; Japan Science & Tech Co., PRESTO
Back
OutlineEfficient Text Data Mining Fast and Robust Text Mining Algorithm (ALT'98,
ISSAC'98, DS'98) Efficient Text Index for Data Mining (CPM'01 , CPM'02) Text Mining on External Storage (PAKDD'00)
Applications– Interactive Document browsing– Keyword discovery form Web
Towards Semi-structured Data Mining Efficinet Frequent Tree Miner (SDM'02, PKDD'02) Mining Semi-structured Data Streams (ICDM '02)
Information Extraction from Web (GI'00, FLAIRS'01)
Conclusion
Back
Efficient Text Data Mining with Optimized Pattern Discovery
Joint work with Junichiro Abe, Ryoichi Fujino, Hiroshi Sakamoto, Setsuo Arikawa (Kyushu Univ), Shinichi Shimozono (Kyutech)
Back
Large Text Databases
have emerged in early 90’s with the rapid progress of computer and network technologies .
– Web pages (OPENTEXT Index, GBs to TBs)– A collection of XML / SGML documents.– Genome databases (GenBank, PIR)– Bibliographic databases (INSPEC, ResearchIndex)– Emails or plain texts on a file system.
Huge, Heterogeneous, unstructured data Traditional database/ data mining technology cannot
work!
Back
Our Research Goal
Text Mining as an inverse of IRDevelop a new access tool for text data
that Interactively supports human discovery from large text collections
Key: Fast and robust text mining methods
User
Text Data Mining System
Text DatabasesAGGAGGTCACA 30
CCAAA
AACACTGTGTGACA
GTGT CACA TGTTTCTGT AGGAGGT
Web pages, XML/SGML archives Genome databases E-mails, text files
BackBrowsing a large collection of documents
with unknown vocabulary and structure
Reuters 21578: 21578 articles from Reuters newswires from Feb. to Oct. in 1987 on economy and international affairs
<vessels >
<ships>
<gulf ><shipping >
<iranian >
<port >
<iran >
<the gulf >
<strike >
<attack ><silk worm missile>
<us.>
<wheat>
<dallers>
<sea men>
<strike >
Text data mining for survey and
browsing
Information Retrieval
Direct browsing
Back
Proximity Word Association Patterns
Association rules over arbitrary subwords. Ordered: ordering among subwords Proximity: the distance of consecutive
subwords are within constant k (proximity)
GTGTCACATGTTTCTGTAGAAAGAGGTCCACACA AGGAT CCAA
CACA AGGAT CCAA* *
GTGTCACAAATTCTGTAGTATCA
Parameters: the maximum number of substrings d & the proximity k
Back
Related Research Feldman and Dagan (KDD’96)
– Association rules over keywords: “Arab”, “Egipt”, “Iran” => “Oil”– Using Apriori-style algorithm of Agrawal et al (1994)
Motowani (SIGMOD'97)– Correlations over keywords
Mannila and Toivonen (KDD’96)– Episodes patterns (Partially ordered set of events)
Wang, Chirn, Marr, Shapiro, Shasha, Zhang (SIGMOD'94)– Word Association Patterns without proximity
AGAG * TATA * AGAT– A generate-and-test algorithm + heuristics– Implementation for d = 2, or d = 1 + approximate matching
Iliopoulos, Makris, Sioutas, Tsakalidis, Tsichlas (CPM'02, this conference!)– Model Identification Problem for maximal pairs of strings (2-dim)– Common or frequent pattern discovery for d = 2 and proximity
Back
How to find?
What properties
separate the target data from the rest of
the data?
Goal: to find those patterns that characterize the target collection
Back
Minimization of•Prediction Error •Information Entropy•Gini Index
Optimized Rule/Pattern DiscoveryData Mining Optimized data mining (IBM DM group,1996 - 2000)Learning Theory Agnostic PAC learning (1990s)Statistics Vapnik & Chervonenkis theory (1970s)
p: ratio of positives that a pattern matches
Good rect.8 positives2 negatives
Bad rect.9 positives9 negatives
f(p): im
purity function
50% 100%0%
Impurity function(p)
Back
Goodness of a pattern= Goodness of the split by the pattern= Weighted average of the values of
impurity function at matched and unmatched sets
Optimized Rule/Pattern Discovery
Pattern
Population N
f(p):
impurity
function
50% 100%0%
Back
Optimized Rule/Pattern Discovery
Split !
Evaluation function for pattern GS,() = (N1/ N) (M1/N1) + (N0/ N) (M0/N0)
(M1/N1) (M0/N0)Impurity function
S1
Population N1
S0
Population N0
Pattern
Population N
Back
Optimal Pattern Discovery Problem
Given: a set S of documents and an objective function : S {0, 1}.
Problem: Find an optimal pattern that minimizes the evaluation function
GS,() = (N1/ N) (M1/N1) + (N0/ N) (M0/N0)
BackRelation to Robust Probabilistic Learning
Statistical Decision Theory in 70sVapnik & Chervonenkis theory (1970s)
Computational Learning Theory in 90sAgnostic PAC-leaning / Robust Trainability (Kearns et al. '92)An algorithm that efficiently solves the classification error
minimization problem is an efficient robust learner, that is, it can approximate arbitrary distribution that generates the examples from the view of classification. (Hausser 1990)
Intractable in general (Kearns et al. 1992)
Empirical machine learning in 90sThe power of simple rules + rigorous optimization (Weiss;
Holte)
Data Mining & COLT in 90s (middle)Efficient algorithms for simple geometric patterns
Back
Application to Text Mining
Traditional method (Frequency-based) Finding most frequent patterns in the target set T.
Many trivial patterns (stop-words) may hide less frequent interesting patterns
Traditional stop-word elimination in IR may not work
the
a an that
of
wit
h
iran
ian
oil
quw
aiti
tan
ker
atta
ck
Iran
ian
oil
pla
tfor
mqu
wai
ti ta
nke
rS
ilkw
orm
mis
sile
vocaburary
freq
uenc
y Target dataset
Back
Application to Text Mining
Optimized Data Mining Finding optimal patterns
Uses an average dataset B of documents as a control dataset for canceling trivial patterns
Finds those patterns that appear more frequently in the target set T and less frequently in the control set B.
the
a an that
of
wit
h
iran
ian
oil
quw
aiti
tan
ker
atta
ck
Iran
ian
oil
pla
tfor
mqu
wai
ti ta
nke
rS
ilkw
orm
mis
sile
vocaburary
freq
uenc
y
Target dataset
Background dataset
Back
Proximity Word Association Patterns
Association rules over arbitrary subwords. Ordered: ordering among subwords Proximity: the distance of consecutive
subwords are within constant k (proximity)
GTGTCACATGTTTCTGTAGAAAGAGGTCCACACA AGGAT CCAA
CACA AGGAT CCAA* *
GTGTCACAAATTCTGTAGTATCA
Parameters: the maximum number of substrings d & the proximity k
Back
Straightforward algorithm: Case: the number d of substrings is bounded
Procedure:– Enumerate all O(n2d) proximity patterns
built from O(n2) subwords of the text. – For each pattern p, compute the score in
linear time.
The straightforward algorith requires O(n2d+1) time and too slow to apply to real world databases
We require more efficient algorithms that run in time, say, O(n) to O(n log n) on real datasets.
Back
Theoretical result: PositiveNumber d of substrings is bounded
アルゴリズム詳細
Theorem: For a set of random texts of total size N, Split-Merge algorithm finds all the k-proximity d-word association patterns that minimize the prediction error in average time O(kd (log N)d+1 N) and space O(max(k, d) N).
Proc. ISAAC'98, LNCS 1533, 1998; New Generation Computing. 2000
d = 2 ~ 4, k = 2 ~ 8 (words), log N = 10 ~ 20 (Reuters21578 collection of 15.6MB)
A large constant in practice
Back
Theoretical result: NegativeNumber d of substrings is unbounded
Details of Algorithm
Theorem: If the number d of subwords is unbounded, then there is no polynomial time approximation algorithm that solves the optimal pattern problem above in arbitrary small approximation ratio assuming P≠NP (MAXSNP-hard) .
Proc. ISAAC'98, LNCS 1533, 1998; New Generation Computing. 2000
Back
Suffix tree– represent all the substrings
in O(n) space.
Problems– Not space efficient.
– Dynamic reconstruction is not easy.
– Not suitable for implementation on the secondary storage.
Suffix array
973625814
987654321
abcabbca$
abbca$
a$
bbca$
bcabbca$
bca$
cabbca$
ca$
$
Compactly represents all the substrings with a 1-dimensional integer array.
(1990, Manber etal.)
Suffix tree & array 987654321
Text
$acbbacba
Suffix tree
a
b
bc
a$
cabbca$
$
b
bca$
ca
bbca$
$
ca
$
$
bbca$
4
1
8
5
2
6
3
7
9
(1976, McCreight)
Data structures for efficiently storing all of O(n2) substrings in O(n) space
Back Basic Idea
– Reducing the best k-proximity d-word association pattern
– to the best d-dim box over the rank space
The position space:consists of all possible pairs of positions that differ at most k
The rank space:consists of all pairs of the suffices of the text ordered lexicographically.
k
translation by suffix array
Back
An O(kd (log N)d+1 N) -time Algorithm
Improvement of a generate-test algorithms Using d-dim Orthogonal
Range Tree Structure
Mean Height O(log N)
Two dimensional case
BackImplementation: From Trees to Arrays Efficient full text Index for text mining.
– Replacing Tree with one-dim arrays– Most operations in Split-Merge Algorithm can be
efficiently implemented by Suffix + Height arrays.
• Enumeration of substrings and its occurrences is done in linear time by simulating the DFS of the "virtual" suffix tree with scanning the Height array.
• Reconstruction (restriction) of the suffix and the height arrays can be done with O(n log n) integer sorting and O(1) time LCA/range-minima computation. (Farach-Colton, Ferragina, Muthukrishnan '00)
T. Kasai, G. Lee, H. Arimura, S. Arikawa, K.S. Park, "Linear-time Longest-common-prefix computation in suffix arrays and its applications", CPM'01; H. Arimura, CPM'01 talk.
BackImplementation: More Practical Algorithm Split-Merge-with-Array algorithm (SMA)
– Re-implemetation of SMT with Suffix + Height arrays– has the same time complexity and the slightly imploved
space complexity to SMA in average. – Easy to implement and scalable due to a simple data
structure which extensively uses one-dimensional arrays and sorting and mapping operations over them.
Theorem: For a set of random texts of total size N, the SMA algorithm finds all the k-proximity d-word association patterns that minimize the prediction error in average time O( N (log N)d+1 ) and space O( max(k,d) N).
Back
Prototype system
Based on computational geometry techniques Built on a full text index called the suffix array Virtual traversal technique over suffix array. Space requirement is reduced to O(dn) with small
constants by the extensive use of suffix array and tertiary quick sort.
g++ on Solaris 2.6, Sun Ultra Sparc IIi, 250MHz.
Back
Running time
d k Time (s) d k Time (s)
1 - 0.64 2 2 2.30
2 2 7.00 2 4 3.81
3 2 33.60 2 8 6.65
4 2 170.38 2 16 10.59
5 2 934.00 2 32 14.81
6 2 1405.82 2 64 14.67
Summary for various values of parameters d and k
•Data: 15.2MB (SHIP data from Reuters 21578 data)•Sun micro., Ultra SPARC II 300MHz, 512MB, g++ on Solaris 2.6.•Best 200 patterns with entropy minimization
Back
Experiments on Document Browsing
Data– Reuters-21578 text collection (Lewis, 1997)– 21578 articles on economy and international
affairs for 8 months from Feb to Oct in 1997. – Each article is tagged with dates and topics (ship,
grain, wheat, gold, ... ). – Ascii data of total size 27.6MB ( 15.2 MB
removing tags) Task
– To find the optimized patterns that distinguish the sentences appearing in the articles of category ship from those with other categories
Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.
Back
500 patterns found (time 0.860000 sec)#pos and #neg: 2970 7887 id eval P1 N1 ------------------------------------- 1 0.568 209 23 <gulf > 2 0.572 136 3 <ships > 3 0.572 142 6 <shipping > 4 0.572 132 3 <iranian > 5 0.573 134 9 <iran > 6 0.576 108 6 <port > 7 0.576 111 8 <the gulf > 8 0.577 111 14 <strike > 9 0.577 81 0 <vessels > 10 0.578 86 4 <attack >next (11-20)?
1. Best ten phrases with high entropy value#pos and #neg: 2970 7887 id eval P1 N1 -------------------------------------261 0.585 12 0 <mhi >262 0.585 12 0 <mclean >263 0.585 12 0 <lloyds shipping intelligence >264 0.585 12 0 <iranian oil platform >265 0.585 12 0 <herald of free >266 0.585 12 0 <began on >267 0.585 12 0 <bagged >268 0.585 12 0 <24 - >269 0.585 12 0 <18 - >270 0.585 12 0 <120 pct >next (271-280)?
2 . Phrases with middle entropy values (rank 261-)
%time 0.120000138 patterns found (time 0.120000 sec) #pos and #neg: 2970 7887 id eval P1 N1 ------------------------------------- 1 0.586 5 0 <attack> <on > <an > <iranian > <oil > 2 0.586 4 0 <attack> <an > <iranian> <oil > <platform > 3 0.586 3 0 <attack> <on > <u.s.-flagged > <in kuwaiti> <waters > 4 0.586 3 0 <attack> <on > <iranian > <oil ><platform > 11 0.586 3 0 <attack> <on > <a > <ship > <kuwaiti > 12 0.586 3 0 <attack> <on> <a > <ship > <in kuwaiti >
3. Best ten k-proximity d-word patterns with high entropy value, where k = 1, d = 5 and the first word is "attack".
Finding the phrases that characterize the articles of category ship from the articles with other categories.
Application to document browsing
Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.
Back
# of Title words # of Stop words
Title words from <TITLE> section of Reuters newswires Stop words from the standard stopword lists for Brown corpus Measuring the ratio of title/stop words in a phrase found.
Optimization based data mining vs. Frequency based data mining
0
10
20
30
40
50
60
70
80
90
1 ~ 100 101 ~ 200201 ~ 300301 ~ 400401 ~ 500501 ~ 600601 ~ 700701 ~ 800801 ~ 900901 ~ 1000
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Title Words 比 Stop Words 比 平均単語数
Frequency maximization with positve examples alone
1.0
0.0
1.5
0.0
Reuter Ship Entropy
0
10
20
30
40
50
60
70
80
1 ~ 100101 ~ 200201 ~ 300301 ~ 400401 ~ 500501 ~ 600601 ~ 700701 ~ 800801 ~ 9000
0.5
1
1.5
2
2.5
Title Words 比 Stop Words 比 平均単語数
Entropy minimization with positive and negative examples
1.0
0.0
2.0
0.0
Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.
Back
Target dataset: the base-set for the query "HONDA"
Best 200 hits by AltaVistaTM
root set S
Back linkspointing to S
Randomly selected 50 pages
Forward linksfrom S
base set T 1,000 ~ 5,000 pages
All pages of distance one from pages in S
Discovery of Important Keywords in the Cyberspace
Control dataset: the base-set for the query "Softbank" Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.
BackFrequency-based vs. Optimization-based
Rank Pattern Rank Pattern0 <the> 0 <honda >1 <and> 1 <prelude >2 <to> 2 <i >3 <a> 3 <car >4 <of> 4 <parts >5 <for> 5 <engine >6 <in> 6 <99 >7 <is> 7 <rear >8 <I> 8 <vtec >9 <honda> 9 <exhaust >
10 <on> 10 <miles >11 <s> 11 <bike >12 <with> 12 <motorcycle >13 <you> 13 <racing >14 <it> 14 <black >15 <or> 15 <si >16 <this> 16 <me >17 <that> 17 <tires >18 <are> 18 <fuel >19 <99> 19 <my >
Frequency Maximazation Entropy Miminization
Mining patterns in the target/postive dataset (HONDA data) using background/negative dataset (SOFTBANK)
Automobile co. and internet business Both are Automobile companies
Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.
BackDependence on Background/Negative Data
Rank Other Pattern Rank Other Pattern0 0 <honda > 0 0 <honda > 1 1 <prelude > 1 1 <prelude >2 33 <i > 2 8 <vtec >3 >350 <car > 3 15 <si >4 >350 <parts > 4 11 <bike >5 >350 <engine > 5 6 <99 >6 5 <99 > 6 12 <motorcycle >7 >350 <rear > 7 25 <the honda >8 2 <vtec > 8 41 <prelude si >9 24 <exhaust > 9 20 <civic >
10 108 <miles > 10 35 <honda prelude >11 4 <bike > 11 48 <98 >12 6 <motorcycle > 12 53 <valkyrie >13 >350 <racing > 13 60 <99 time >14 19 <black > 14 37 <honda s >15 3 <si > 15 36 <rims >16 28 <me > 16 40 <looking >17 >350 <tires > 17 30 <looking for >18 >350 <fuel > 18 67 <scooters >19 >350 <my > 19 14 <black >
HONDA vs. SOFTBANK HONDA vs. TOYOTA
Mining patterns in the target/postive dataset (HONDA data) varying the background/negative dataset (SOFTBANK and TOYOTA)
Automobile co. and internet business Both are Automobile companies
Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.
Back
Conclusion Text databases Optimized Pattern Discovery
– Proximity phrase association patterns
Fast and robust text mining algorithms– Split-Merge algorithm for finding the optimal patterns– Levelwise-Scan algorithm for large disk-resident data.
Applications– Interactive Document browsing– Web Mining
Please visit: http://www.i.kyushu-u.ac.jp/~arim/
<ARTICLE status = “draft”><TITLE> Fast Text Data Mining with optimal pattern discovery</TITLE><AUTHOR> H. Arimura </AUTHOR><AUTHOR> T. KASAI </AUTHOR><AUTHOR> A. WATAKI </AUTHOR><AUTHOR> S. Arikawa </AUTHOR><ABSTRACT> This paper consider the efficient discovery of a simple class of patterns from large text databases. </ABSTRACT><SECTION>
<TITLE> Introduction </TITLE><BODY> Recent progress of network and strage technology enable us to collect and accumulate ...</BODY>
</SECTION><SECTION>
<TITLE> Preliminaries </TITLE><BODY> In this section, we give basic definitions and results on ... </BODY>
</SECTION> ...</ARTICLE>
Semi-structured Data
TITLE
AUTHORAUTHOR
AUTHOR
ABSTRACT
TITLE BODY TITLE BODY
SECTION SECTION SECTION
LINK
FIGURE
FIGURE
Web & XML data
Back
Theoretical results
Theorem: If the maximum sizek of subwords is unbounded, For any e > 0, there exists no polynomial time (770/767 - e)-approximation algorithm for the maximum agreement problem for labeled ordered trees of unbounded size on an unbounded label alphabet if P /=NP.
アルゴリズム詳細
Theorem: The algorithm OPTT solves the maximum agreement problem for labeled ordered trees in average time O(kk bk N).
(Note: A straightforward algorithm has super linear time complexity when the number of labels grows in N).
Proc. SIAM Data Mining 02 (2001), and Proc. PKDD'02 (2002)
Back
順序木枚挙グラフ
T1 T2 T3⊥ T4
•ルートは空木.
•各ノードは,順序木であり,その最右拡張を子供としてもつ
Back
Back研究成果:半構造データマイニング 高速な頻出順序木パターン発見アルゴリ
ズム– Efficient Substructure Discovery from Large
Semi-structured Data– Asai, Abe, Kawasoe, Arimura, Sakamoto, Arikawa– Proc. 2nd SIAM International Conference on Data
Mining (SDM'02), Arlington, April 2002. (To appear)– 電気通信学会 DE 研 (10 月 ) ; AI 学会研究報告 SIG-FAI/KBS ( 11
月) ; DEWS '02.
結合ルール発見の半構造データへの拡張
BackFREQT アルゴリズムの正当性と計算量定理 提案のアルゴリズム Find-Freq-Trees は,任意の頻度閾値 0 < σ≦1 に対して,すべての頻出な順序木パターンをちょうど枚挙する.
最近,厳密な計算量の上限と下限を示すことができた( PKDD'02, Aug 2002 )
– 上限:定数パターンに対して線形
– 下限:任意サイズのパターンでは,1に近い誤差率での近似さえ, NP 完全
Back