IEPAD: Information Extraction based on Pattern Discovery
Chia-Hui Chang
National Central University, Taiwan
http://www.csie.ncu.edu.tw/~chia
2001/5/4 2
Outline
Introduction Problem definition Related Work
System architecture Extraction rule generation Experiments Summary and future work
2001/5/4 3
Introduction
Web information integration multi-search engines, e.g. Metacrawler shopping agents etc.
Common tasks Data collection Information extraction
2001/5/4 4
Information Extraction
Information Extraction (IE) Input: Html pages Output: A set of records
2001/5/4 5
Related Work
Extractor Generation Hand-coded wrappers by observation Machine learning based approach
• WIEN (Kushmeric), 1997• SoftMealy (Hsu), 1998• STALKER (Muslea), 1999
Fully automatic approach• Embley et al, 1999• Chang et al, 2000
2001/5/4 6
System Architecture
Rule Generator
ExtractorExtraction Results
Html Page
Patterns
Pattern Viewer
Extraction Rule
Users
Html Pages
2001/5/4 7
Pattern Discovery based IE
Motivation• Display of multiple records often forms a repeated
pattern• The occurrences of the pattern are spaced regularly
and adjacently
Now the problem becomes ...• Find regular and adjacent repeats in a string
2001/5/4 8
The Rule Generator
Translator PAT tree construction Pattern validator Rule Composer
HTML Page
Token Translator
PAT TreeConstructor
Validator
Rule Composer
PAT trees andMaximal Repeats
Advenced Patterns
Extraction Rules
A Token String
2001/5/4 9
1. Web Page Translation
Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a
special token called TEXT (denoted by a underscore) HTML Example:
<B>Congo</B><I>242</I><BR>
<B>Egypt</B><I>20</I><BR>
Encoded token stringT(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
2001/5/4 10
Various Encoding Schemes
B lo ck -lev e l ta g s T ex t-lev e l ta g sHeadings
Text containers
Lists
Others
H1~H6
P, PRE, BLOCKQUOTE,ADDRESS
UL, OL, LI, DL, DIR,MENU
DIV, CENTER, FORM,HR, TABLE, BR
Logical markup
Physical markup
Special markup
EM, STRONG, DFN, CODE,SAMP, KBD, VAR, CITE
TT, I, B, U, STRIKE, BIG,SMALL, SUB, SUP, FONT
A, BASEFONT, IMG, APPLET,PARAM, MAP, AREA
Figure. 2 Tag classification
2001/5/4 11
Example of BL Encoding
Encoding scheme=Block-Level Tags1’. Only block-level tags are considered, each tag i
s encoded as a token2. Any text between two tags are translated to a spe
cial token called TEXT (denoted by a underscore)
<dl><dt><b>1.</b><b><a ...>MGI 2.4 - Mouse <em>Genome</em> … </a><dd>The Mouse <b>Genome</b> Informatics (MGI) ..<br><span>URL:www.informatics.jax.org/ </span><br><a ...> …</a><a ...>…</a><img src=…><a ...>…</a>Facts about:<a> …</a></dl><dl> <dt> _ <dd> _ <br> _ <br> _ </dl> 1 5 9 64 68
2001/5/4 12
2. PAT Tree Construction
PAT tree: binary suffix tree A Patricia tree constructed over all possible
suffix strings of a text Example
T(<B>) 000
T(</B>) 001
T(<I>) 010
T(</I>) 011
T(<BR>) 100
T(_) 110
000110001010110011100000110001010110011100
T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
Indexing position:suffix 1 000110001010110011100000110001010110011100$suffix 2 110001010110011100000110001010110011100$suffix 3 001010110011100000110001010110011100$suffix 4 010110011100000110001010110011100$suffix 5 110011100000110001010110011100$suffix 6 011100000110001010110011100$suffix 7 100000110001010110011100$suffix 8 000110001010110011100$suffix 9 110001010110011100$suffix10 001010110011100$suffix11 010110011100$suffix12 110011100$suffix13 011100$suffix14 100$
2001/5/4 13
The Constructed PAT Tree
$
12
1
2 2
3 4 5
10
1 8 10
0
1
10000
1
$
0
147
0
5
3
22
$0
16
$0
3 13
7
$0
6
11
13
$
4
19
$0
92
a
b
c
d e
f
g
h
i
j k
l m
Figure 3. The PAT tree for the Congo Code
=0110001010110011100=1010110011100=01010110011100=0110011100=11100
2001/5/4 14
Definition of Maximal Repeats
Let occurs in S in position p1, p2, p3, …, pk is left maximal if there exists at least one (i, j) pair s
uch that S[pi-1]S[pj-1] is right maximal if there exists at least one (i, j) pai
r such that S[pi+||]S[pj+||] is a maximal repeat if it it both left maximal and rig
ht maximal
2001/5/4 15
Finding Maximal Repeats
Definition: Let’s call character S[pi-1] the left character of s
uffix pi
A node is left diverse if at least two leaves in the ’s subtree have different left characters
Lemma: The path labels of an internal node in a PAT tre
e is a maximal repeat if and only if is left diverse
2001/5/4 16
3. Pattern Validator Suppose a maximal repeat are ordered by its position such t
hat suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence.
Characteristics of a Pattern Regularity: Variance coefficient
Adjacency: Density}1|{
}1|{)(
1
1
kippMean
kippStdDevV
ii
ii
||
||*)(
1
pp
kD
k
2001/5/4 17
Pattern Validator (Cont.) Basic ScreeningFor each maximal repeat , compute V() and D()
a) check if the pattern’s variance: V() < 0.5b) check if the pattern’s density: 0.25 < D() < 1.5
V()<0.5
0.25<D()<1.5
Yes
NoDiscard
Yes
Pattern
NoDiscard
Pattern
2001/5/4 18
4. Rule Composer
Occurrence partition Flexible variance threshold control
Multiple string alignment Increase density of a pattern
’
V()<0.5
0.25<D()<1.5
Yes
NoDiscard
Yes
occurrences
No
Occurrence Partition
Multiple String
AlignmentD()<1
Yes
No
V()<0.1No
Discard
2001/5/4 19
Occurrence Partition
Problem Some patterns are divided into several blocks Ex: Lycos, Excite with large regularity
Solution Clustering of the occurrences of such a pattern
Clustering V()<0.1No
Discard
Check densityYes
2001/5/4 20
Multiple String Alignment
Problem Patterns with density less than 1 can extract only part of th
e information
Solution Align k-1 substrings among the k occurrences
A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.
2001/5/4 21
Multiple String Alignment (Cont.)
Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb”
If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'':
a d c w b d
a d c x b -
a d c x b d
The extraction pattern can be generalized as “adc[w|x]b[d|-]”
2001/5/4 22
Pattern Viewer
Java-application based GUI Web based GUI
http://140.115.155.102/WebIEPAD/
2001/5/4 23
The Extractor
Matching the pattern against the encoding token string Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm
Alternatives in a rule matching the longest pattern
What are extracted? The whole record
2001/5/4 24
Experiment Setup
Fourteen sources: search engines Performance measures
Number of patterns Retrieval rate and Accuracy rate
Parameters Encoding scheme Thresholds control
2001/5/4 25
# of Patterns Discovered Using BlockLevel Encoding
Figure 5. Number of Patterns validated
02468
101214
0 0.25 0.5 0.75 1
Density
# o
f p
att
ern
s
r=0.25
r=0.5r=0.75
Average 117 maximal repeats in our test Web pages
2001/5/4 26
Translation
Table 2. Size of translated sequences and number of patterns
Encoding Scheme Length of Sequence No. of Patterns
All Tag 1128 7.9
No Physical 873 6.5
No Special 796 5.7
Block-Level 514 4.4
Average page length is 22.7KB
2001/5/4 27
Accuracy and Retrieval Rate
Table 4. Effect of Advanced Techniques
Method Retrieval Rate Accuracy Rate Matching Percentage
Block-level Encoding 0.86 0.86 0.78
Occurrence Partition 0.92 0.91 0.85
Occurrence Partition +
Multiple String Alignment
0.97 0.94 0.90
Table 3. Basic screening (without Rule Composer)Encoding Scheme Retrieval Rate Accuracy Rate Matching Percentage
All Tag 0.73 0.82 0.60
No Physical 0.82 0.89 0.68
No Special 0.84 0.88 0.70
Block-Level 0.86 0.86 0.78
2001/5/4 28
Accuracy and Retrieval RateTable 5. The performance of multiple string alignment
Search Engine Retrieval Rate Accuracy Rate Matching PercentageAltaVistaCoraExciteGalaxyHotbotInfoseekLycosMagellanMetacrawlerNorthernLightOpenfindSavvysearchStpt.comWebcrawler
1.001.001.001.000.970.980.941.000.900.950.831.000.990.98
1.001.000.970.950.860.940.631.000.960.960.900.951.000.98
0.910.971.000.990.880.870.940.760.780.900.660.970.950.98
Average 0.97 0.94 0.90
2001/5/4 29
Summary
IEPAD: Information Extraction based on Pattern Discovery Rule generator The extractor Pattern viewer
Performance 97% retrieval rate and 94% accuracy rate
2001/5/4 30
Problems
Guarantee high retrieval rate instead of accuracy rate Generalized rule can extract more than the
desired data Only applicable when there are several
records in a Web page, currently
2001/5/4 31
Final
Acknowledgement We would like to thank Lee-Feng Chien, Ming-Jer Lee an
d Jung-Liang Chen for providing their PAT tree code for us.
Reference Chang, C.H. and Lui, S.C. IEPAD: Information Extrac
tion based on Pattern Discovery, WWW10, May. 2001, Hong Kong.
2001/5/4 32
Future Work
Interface for choosing a pattern http://www.csie.ncu.edu.tw/~chia/webiepad/
Multi-level extraction From record boundary extraction to attribute valu
e extraction Extractors in Java and C++
2001/5/4 33
Rule Formatlevel 1 encoding scheme: rulelevel 2 encoding scheme: rule for block 1level 2 encoding scheme: rule for block 2...level 2 encoding scheme, rule for block klevel 1 block 1, level 2 block no for attribute 1level 1 block 1, level 2 block no for attribute 2...level 1 block 1, level 2 block no for attribute t
K 個 block
t個attribute
2001/5/4 34
Example(cont.)Line 0: Blocklevel.h, <DL><DT>String<DD>String<BR>String<BR>String<BR>String</DD></DL>Line 1: Alltag.h, rule for block 1Line 2: Alltag.h, rule for block 2...Line k: Alltag.h, rule for block kLine k+1: level 1 block no, level 2 block no for attribute 1Line k+2: level 1 block no, level 2 block no for attribute 2...Line k+t: level 1 block no, level 2 block no for attribute t
Demoex: 3, 2ex: 5, allex: 5, 1 3
Congo Example
2001/5/4 36
Performance Evaluation
Definition: A pattern is said to enumerate a record if the
overlapping percentage between the record and the pattern is greater than
Three Measures Retrieval Rate Accuracy Rate Matching Percentage
2001/5/4 37
Illustration
Let Gi,j denotes the ordered occurrences pi, pi+1, ..., pj
S=, i=1;For j=1 to k-1 do
If R(Gi,j+1) > then If R(Gi,j) < m then
S= S {Gi,j}; endif i= j+1;endif
endf
Top Related