R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites
description
Transcript of R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites
![Page 1: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/1.jpg)
ROADRUNNER: Towards Automatic Data Extraction
from Large Web Sites
Valter CrescenziGiansalvatore MeccaPaolo Merialdo
VLDB 2001
![Page 2: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/2.jpg)
Overview Automatically generates a wrapper from large
structured Web pages Supports nested structures Efficient approach to large, complex pages with
regular structures
![Page 3: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/3.jpg)
Approach Given a set of example pages Generate a Union-free Regular Expression
(UFRE) Find the least upper bounds on the RE lattice
to generate a wrapper Reduces to find the least upper bound on two
UFRES
![Page 4: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/4.jpg)
Matching/Mismatching Start with the first page and create a RE that defines
the wrapper Match each successive sample against the wrapper Mismatches result in generalizations of the regular
expression Types of mismatches
– String mismatches– Tag mismatches
![Page 5: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/5.jpg)
Example Pages
![Page 6: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/6.jpg)
Example
#PCDATA
String mismatches are used to discover fields of the documentsWrapper is generated by replacing “John Smith” with #PCDATA
![Page 7: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/7.jpg)
Example (Cont.)
#PCDATA
Tag Mismatches :Discovering Optionals First check to see if mismatch is caused by an iterator If not, could be an optional field in wrapper or sample Cross search used to determine possible optionals Image field determined to be optional
– (<img src=…/>)?
![Page 8: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/8.jpg)
Example (Cont.)
#PCDATA
Tag Mismatches :Discovering Optionals First check to see if mismatch is caused by an iterator If not, could be an optional field in wrapper or sample Cross search used to determine possible optionals Image field determined to be optional
– (<img src=…/>)?
(<IMG src=…/>)?
![Page 9: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/9.jpg)
Example (Cont.)
#PCDATA
(<IMG src=…/>)?
#PCDATA
#PCDATA
Tag Mismatches :Discovering Iterators Assume mismatch is caused by repeated elements in a list Match possible squares against earlier squares Generalize the wrapper by finding all contiguous repeated
occurrences– (<li><i>Title:</i>#PCDATA</li>)+
![Page 10: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/10.jpg)
Extracted Result
![Page 11: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/11.jpg)
Recursive Example
![Page 12: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/12.jpg)
Complexity
![Page 13: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/13.jpg)
Discussion Assumptions
– Pages are well-structured– Want to extract at the level of entire fields– Structure can be modeled without disjunctions
Search Space for explaining mismatches is huge– Uses a number of heuristics to prune space
Limited backtracking Limit on number of choices to explore Patterns can not be delimited by optionals
– Will result in pruning possible wrappers
![Page 14: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/14.jpg)
Experimental Result
![Page 15: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/15.jpg)
Comparison with Other Works
![Page 16: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites](https://reader036.fdocuments.us/reader036/viewer/2022062501/56815dfa550346895dcc3632/html5/thumbnails/16.jpg)
Name Struc_ture
Semi Free Single-slot
Multi-slot
Missing items
Permuta_tions
Nested_data
Resilient
WIEN X X XSoftMealy X X X X X X*STALKER X X X * X X XRAPIER X X ? X X X ?SRV X X ? X X X ?WHISK X X X X X X X* ?AutoSlog X X X XROAD_RUNNER X X X X XBYU Onto X X ? X X X X X X
X means the information extraction system has the capability; X* means the information extraction system
has the ability as long as the training corpus can accommodate the required training data; ? Shows that the
systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability.