Effective Web Scraping with OXPath
-
Upload
giovanni-grasso -
Category
Technology
-
view
1.936 -
download
2
description
Transcript of Effective Web Scraping with OXPath
DIADEM domain-centric intelligent automated data extraction methodology
Effective Web Scraping with
http://oxpath.org
Giovanni Grasso - Oxford University
May 15th, 2013 @ WWW developer trackjoint work with Tim Furche, Christian Schallhart,
Wednesday, 15 May 13
OXPath » Lingua Franca for Web Extraction1
A Call for Action in Web Extraction!
Past: Form Filling + HTML Patterns
Now: Interaction + DOM Patterns
getting to the data requires interaction not just form filling
identifying relevant data from rendered DOMs
across several pages
access to all CSS properties (computed style)
2
Wednesday, 15 May 13
3
The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{click/}/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{uncheck/}//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"]/{click/}//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>[//div.property-info//a/{click/}//a[.#=’More Facts’]/{click/}//div.home-facts/table:<facts=(.)>]
2. REFERENCES[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring
documents, databases, and webs. In ICDE, 24–33, 1998.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web
information extraction with Lixto. In VLDB, 119–128, 2001.[3] M. Benedikt and C. Koch. XPath Leashed. CSUR,
41(1):3:1–3:54, 2007.[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller.
Automation and customization of rendered web pages. InUIST, 163–172, 2005.
[5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,1411–1428, 2006.
[6] V. Crescenzi and G. Mecca. Automatic information extractionfrom large websites. JACM, 51(5):731âAS779, 2004.
[7] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing XPath queries. TODS, 30(2):444âAS491, 2005.
[8] G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter:automating & sharing how-to knowledge in the enterprise. InCHI, 1719–1728, 2008.
[9] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In IUI, 97–106,2009.
[10] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources. InICDE, 611–621, 2000.
[11] M. Liu and T. W. Ling. A rule-based query language forHTML. In DASFAA, 6–13, 2001.
[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005.[13] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the
World Wide Web. In DIS, 80–91, 1996.[14] A. Sahuguet and F. Azavant. Building light-weight wrappers
for legacy web data-sources using W4F. In VLDB, 738–741,1999.
[15] N. Sawa, A. Morishima, S. Sugimoto, and H. Kitagawa.Wraplet: Wrapping your web contents with a lightweightlanguage. In SITIS, 387–394, 2007.
[16] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan.Declarative information extraction using datalog withembedded extraction predicates. In VLDB, 1033–1044, 2007.
[17] J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design ofbrowser-oriented data extraction system and plug-ins. JMST,18(2):189–200, 2010.
Wednesday, 15 May 13
3
Seattle
The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{click/}/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{uncheck/}//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"]/{click/}//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>[//div.property-info//a/{click/}//a[.#=’More Facts’]/{click/}//div.home-facts/table:<facts=(.)>]
2. REFERENCES[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring
documents, databases, and webs. In ICDE, 24–33, 1998.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web
information extraction with Lixto. In VLDB, 119–128, 2001.[3] M. Benedikt and C. Koch. XPath Leashed. CSUR,
41(1):3:1–3:54, 2007.[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller.
Automation and customization of rendered web pages. InUIST, 163–172, 2005.
[5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,1411–1428, 2006.
[6] V. Crescenzi and G. Mecca. Automatic information extractionfrom large websites. JACM, 51(5):731âAS779, 2004.
[7] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing XPath queries. TODS, 30(2):444âAS491, 2005.
[8] G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter:automating & sharing how-to knowledge in the enterprise. InCHI, 1719–1728, 2008.
[9] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In IUI, 97–106,2009.
[10] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources. InICDE, 611–621, 2000.
[11] M. Liu and T. W. Ling. A rule-based query language forHTML. In DASFAA, 6–13, 2001.
[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005.[13] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the
World Wide Web. In DIS, 80–91, 1996.[14] A. Sahuguet and F. Azavant. Building light-weight wrappers
for legacy web data-sources using W4F. In VLDB, 738–741,1999.
[15] N. Sawa, A. Morishima, S. Sugimoto, and H. Kitagawa.Wraplet: Wrapping your web contents with a lightweightlanguage. In SITIS, 387–394, 2007.
[16] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan.Declarative information extraction using datalog withembedded extraction predicates. In VLDB, 1033–1044, 2007.
[17] J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design ofbrowser-oriented data extraction system and plug-ins. JMST,18(2):189–200, 2010.
Wednesday, 15 May 13
4
Wednesday, 15 May 13
4
The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{click/}/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{uncheck/}//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"]/{click/}//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
[//div.property-info//a/{click/}//div.home-description:<info=(.)>z
2. REFERENCES[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring
documents, databases, and webs. In ICDE, 24–33, 1998.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web
information extraction with Lixto. In VLDB, 119–128, 2001.[3] M. Benedikt and C. Koch. XPath Leashed. CSUR,
41(1):3:1–3:54, 2007.[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller.
Automation and customization of rendered web pages. InUIST, 163–172, 2005.
[5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,1411–1428, 2006.
[6] V. Crescenzi and G. Mecca. Automatic information extractionfrom large websites. JACM, 51(5):731âAS779, 2004.
[7] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing XPath queries. TODS, 30(2):444âAS491, 2005.
[8] G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter:automating & sharing how-to knowledge in the enterprise. InCHI, 1719–1728, 2008.
[9] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In IUI, 97–106,2009.
[10] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources. InICDE, 611–621, 2000.
[11] M. Liu and T. W. Ling. A rule-based query language forHTML. In DASFAA, 6–13, 2001.
[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005.[13] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the
World Wide Web. In DIS, 80–91, 1996.[14] A. Sahuguet and F. Azavant. Building light-weight wrappers
for legacy web data-sources using W4F. In VLDB, 738–741,1999.
[15] N. Sawa, A. Morishima, S. Sugimoto, and H. Kitagawa.Wraplet: Wrapping your web contents with a lightweightlanguage. In SITIS, 387–394, 2007.
[16] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan.Declarative information extraction using datalog withembedded extraction predicates. In VLDB, 1033–1044, 2007.
[17] J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design ofbrowser-oriented data extraction system and plug-ins. JMST,18(2):189–200, 2010.
Wednesday, 15 May 13
4
The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{click/}/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{uncheck/}//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"]/{click/}//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
[//div.property-info//a/{click/}//div.home-description:<info=(.)>z
2. REFERENCES[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring
documents, databases, and webs. In ICDE, 24–33, 1998.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web
information extraction with Lixto. In VLDB, 119–128, 2001.[3] M. Benedikt and C. Koch. XPath Leashed. CSUR,
41(1):3:1–3:54, 2007.[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller.
Automation and customization of rendered web pages. InUIST, 163–172, 2005.
[5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,1411–1428, 2006.
[6] V. Crescenzi and G. Mecca. Automatic information extractionfrom large websites. JACM, 51(5):731âAS779, 2004.
[7] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing XPath queries. TODS, 30(2):444âAS491, 2005.
[8] G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter:automating & sharing how-to knowledge in the enterprise. InCHI, 1719–1728, 2008.
[9] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In IUI, 97–106,2009.
[10] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources. InICDE, 611–621, 2000.
[11] M. Liu and T. W. Ling. A rule-based query language forHTML. In DASFAA, 6–13, 2001.
[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005.[13] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the
World Wide Web. In DIS, 80–91, 1996.[14] A. Sahuguet and F. Azavant. Building light-weight wrappers
for legacy web data-sources using W4F. In VLDB, 738–741,1999.
[15] N. Sawa, A. Morishima, S. Sugimoto, and H. Kitagawa.Wraplet: Wrapping your web contents with a lightweightlanguage. In SITIS, 387–394, 2007.
[16] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan.Declarative information extraction using datalog withembedded extraction predicates. In VLDB, 1033–1044, 2007.
[17] J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design ofbrowser-oriented data extraction system and plug-ins. JMST,18(2):189–200, 2010.
Wednesday, 15 May 13
4
The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).
Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.
doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}
/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*
To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.
doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{click/}/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{uncheck/}//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"]/{click/}//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>
[//div.property-info//a/{click/}//div.home-description:<info=(.)>z
2. REFERENCES[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring
documents, databases, and webs. In ICDE, 24–33, 1998.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web
information extraction with Lixto. In VLDB, 119–128, 2001.[3] M. Benedikt and C. Koch. XPath Leashed. CSUR,
41(1):3:1–3:54, 2007.[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller.
Automation and customization of rendered web pages. InUIST, 163–172, 2005.
[5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,1411–1428, 2006.
[6] V. Crescenzi and G. Mecca. Automatic information extractionfrom large websites. JACM, 51(5):731âAS779, 2004.
[7] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing XPath queries. TODS, 30(2):444âAS491, 2005.
[8] G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter:automating & sharing how-to knowledge in the enterprise. InCHI, 1719–1728, 2008.
[9] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In IUI, 97–106,2009.
[10] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources. InICDE, 611–621, 2000.
[11] M. Liu and T. W. Ling. A rule-based query language forHTML. In DASFAA, 6–13, 2001.
[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005.[13] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the
World Wide Web. In DIS, 80–91, 1996.[14] A. Sahuguet and F. Azavant. Building light-weight wrappers
for legacy web data-sources using W4F. In VLDB, 738–741,1999.
[15] N. Sawa, A. Morishima, S. Sugimoto, and H. Kitagawa.Wraplet: Wrapping your web contents with a lightweightlanguage. In SITIS, 387–394, 2007.
[16] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan.Declarative information extraction using datalog withembedded extraction predicates. In VLDB, 1033–1044, 2007.
[17] J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design ofbrowser-oriented data extraction system and plug-ins. JMST,18(2):189–200, 2010.
Wednesday, 15 May 13
OXPath » Lingua Franca for Web Extraction1
Wrapper Babel
Wrapper induction & data extraction systems
each invent their own wrapper language
or use its own ad-hoc tool or proprietary language
Mainly pattern matching + imperative navigation
mix of XPath & external flow control
limited interaction with complex interfaces
(simple) form filling & submit
focus on automation via visual interfaces
limited extraction language
no multiway navigation
5
Wednesday, 15 May 13
1 OXPath » Lingua Franca for Web Extraction
Why OXPath?
6
an XPath for data extraction simplicity
learnable
familiarityscalability
Wednesday, 15 May 13
OXPath » The Language2
OXPath = XPath + 4
7
action iteration
extractionstyle
OXPath
Wednesday, 15 May 13
8
Wednesday, 15 May 13
8
Start at kayak.co.uk:
doc("kayak.co.uk")To select an airport, type a few letters and select from completion list
//field().destination/{"Sea" /} //div#smartbox//li[1]/{click /}This will submit the form
Wednesday, 15 May 13
9
Wednesday, 15 May 13
9
Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }
Wednesday, 15 May 13
9
Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }On all result pages
/(//a[.=‘Next’]/{click /})*
Wednesday, 15 May 13
9
Refine the results by unchecking the “2+ stops”:
//*#stops2/{uncheck }On all result pages
/(//a[.=‘Next’]/{click /})*and for each flight
//body.resultrow:<flight>
Wednesday, 15 May 13
9
Extract the attributes
Wednesday, 15 May 13
9
Extract the attributes
Mouseover the ! to extract flight quality warnings
//span.qualityWarningIcon/{mouseover /}
Wednesday, 15 May 13
9
Extract the attributes
Mouseover the ! to extract flight quality warnings
//span.qualityWarningIcon/{mouseover /}Click on the details to extract layovers
Wednesday, 15 May 13
OXPath » The Language2
Actions correspond to DOM events, e.g.,
Executed once on each context node
Return context nodes for contextual actions or
root nodes for new DOM absolute actions {click/}
➊ Actions: Browser Interaction
10
Document
Click
Fill
Mouseover
doc("google.com")
{click}
{“Rio”}
{mouseover}
Wednesday, 15 May 13
OXPath » The Language2
Extraction marker select nodes for extraction
record markers: :<flight>
attribute markers: :<price=string(.)>
Extracted data has tree shape
nesting of extraction markers in OXPath expression definesnesting of records and attribute-record associations in the output
➋ Extraction: Compact Tree Construction
11
Wednesday, 15 May 13
Wednesday, 15 May 13
OXPath » The Language2
Most web sites use pagination techniques for results
traversing paginated results require iteration
⇢ extraction from any unbounded component of a link graph
Kleene Star with action in the iterated expression
OXPath’s evaluation algorithm
buffers in practice only a constant number of pages
➌ Iteration: Kleene Star
13
/(//a[.=’Next’]/{click /})*
/(//body/{scroll /})* ( infinite scroll )
Wednesday, 15 May 13
OXPath » The Language2
Access to all computed style CSS properties via style axis
➍ Style: Querying Visual Attributes
14
Visibility
Font size
Geometry
Color
style::display or style::visibility
style::font-size
style::top, style::left, ...
style::color or style::background-color
Wednesday, 15 May 13
3
Evaluation15
Wednesday, 15 May 13
0
50
100
150
200
0 2 4 6 8 10 12 0
20
40
60
80
100
120
140
160m
em
ory
[M
B]
#pages
[1000] / #re
sults
[100,0
00]
time [h]
memoryextracted matches
visited pages
(b) Millions of resultsConstant Memory16
100,000+ pages, millions of results
Wednesday, 15 May 13
17
2%
13%
85%
page rendering browser initialization OXPathit’s the browser
Wednesday, 15 May 13
0
100
200
300
400
500
600
700
0 20 40 60 80 100 120 140
time w
/o p
age lo
adin
g [se
c]
#pages
OXPathWeb Content Extractor
LixtoVisual Web Ripper
Web HarvestChickenfoot
(b) Norm. evaluation time, <150 p.faster
18
Wednesday, 15 May 13
0
200
400
600
800
1000
1200
1400
1600
0 100 200 300 400 500 600 700 800
time
w/o
pa
ge
loa
din
g [
sec]
Number of pages
OXPathLixto
Web HarvestChickenfoot
(c) Norm. evaluation time, <850 p.even faster 19
Wednesday, 15 May 13
0
50
100
150
200
250
300
350
0 100 200 300 400 500 600 700 800
me
mo
ry [
MB
]
#pages
OXPathLixto
Web HarvestChickenfoot
memory 20
Wednesday, 15 May 13
0
50
100
150
200
250
300
350
0 100 200 300 400 500 600 700 800
me
mo
ry [
MB
]
#pages
OXPathLixto
Web HarvestChickenfoot
memory 20
only hundreds of pages asother tools fail for more pages
Wednesday, 15 May 13
OXPath » System & Evaluation3
Evaluation
21
constant memory
very low overhead on XPath
minimal page buffer
browser boundfast
Wednesday, 15 May 13
4
OXPathUser stories
22
Wednesday, 15 May 13
4
DIADEMUnsupervised Domain-
specific Web Objects Extraction
presented @ World Wide Web 2012 (WWW’12)
23
Wednesday, 15 May 13
24
DIADEM data extraction methodologydomain-centric intelligent automated
Wednesday, 15 May 13
25
DIADEM data extraction methodologydomain-centric intelligent automated
:=
Wednesday, 15 May 13
26
DIADEM data extraction methodologydomain-centric intelligent automated
:=
Wednesday, 15 May 13
27
DIADEM data extraction methodologydomain-centric intelligent automated
:=
Wednesday, 15 May 13
28
DIADEM data extraction methodologydomain-centric intelligent automated
:=
1
Form Understanding & Filling
Flat Text
Context-drivenblock analysis
3
Energy Performance Chart
Maps
Floor plans
Wednesday, 15 May 13
OXPath Wrapper
Cloud extraction
Data integration
4
29
DIADEM data extraction methodologydomain-centric intelligent automated
:=
Single entity (details) pages
Tables
2
Object identification & alignment
Result pages
Flat Text
Context-drivenblock analysis
3
Energy Performance Chart
Maps
Floor plans
Wednesday, 15 May 13
Wrapper induction in DIADEM4
30
Induced Wrapper (partial)
doc(‘wwagency.co.uk’)//select#sale_type_id/{0/} //button.formbtn/{click /} (//div.pagenumlinks[last()]//a[last()]/{click /})* //div.proplist_wrap:<RECORD> [.//span.prop_price:<PRICE=string(.)>] [.//ul.prop_keypoints/li[2]/strong:<BEDROOM_ROOMS=string(.)>] [.//div.prop_statuses//text():<PROPERTY_STATUS=string(.)>] [.//strong.orange:<POSTCODE=string(.)>] //div.prop_img/a/{click /}//body [.//div#propertypage_copy/p[last()-1]:<DESCRIPTION=string(.)>] [.//div#print_contact/address/text()[2]:<ADDRESS=string(.)>] [.//a.~'Map view')]/@href:<MAP=string(.)>] [.//div#propertypage_copy/p[2]:<RECEPTION_ROOMS=string(.)>]
Wednesday, 15 May 13
4
DEQAQuestion Answering
on the Deep Web
presented @International Semantic Web Conference 2012 (ISWC’12)
31
Wednesday, 15 May 13
32
Kindergarden_B
White_Road
1,499,950 £
gr :Offering
rdf:type
dd:hasPrice
Kindergarden_Adbp:near
Domain Specific Triple Store
Question:House near a Kindergarden under 2,000,000 £?
OXPath
OXPath
TBSL
White_Road
Answer:
15
dd:bedrooms
1,499,950 £dd:hasPrice
dbp:near Kindergarden_A
Linking-MetricOXPath
Wednesday, 15 May 13
OXPath » DEQA: Question Answering on the Deep Web4
33
RDF Wrapper (partial)doc(‘wwagency.co.uk’)....
.... //div.proplist_wrap:<gr:Offering> [.//span.prop_price:<dd:hasPrice(xsd:double)=string(.)>] ..... [.//strong.orange:<vcard:postal-code=string(.)>]
.... [.//div.prop_img/a/@href:<foaf:page=string(.)>] //div.prop_img/a/{click /}//body [.//div#propertypage_copy/p[last()-1]:<gr:description=string(.)>] [.//a.~'Map view')]/@href:<wgs84:lat=extractLat(.)>] [.//a.~'Map view')]/@href:<wgs84:long=extractLong(.)>]
Wednesday, 15 May 13
OXPath » DEQA: Question Answering on the Deep Web4
34
Question translation to SPARQL
Edwardian houses close to supermarket for less than 1,000,000 in Oxfordshire
mapping them to specific restrictions, e.g. cheap could be mapped to costs forflats less than 800 pounds per month.
An example of a successful query is “all houses in Abingdon with more than2 bedrooms”:
SELECT ?y WHERE {2 ?y a <http://diadem.cs.ox.ac.uk/ontologies/real-estate#House> .
?y <http://diadem.cs.ox.ac.uk/ontologies/real-estate#bedrooms> ?y0 .4 ?y <http://www.w3.org/2006/vcard/ns#street-address> ?y1 .
FILTER(?y0 > 2) .6 FILTER(regex(?y1,’Abingdon’,’i’)) .}
In that case, TBSL first performs a restriction by class (“House”), then it findsthe town name “Abingdon” from the street address and it performs a filter on thenumber of rooms. Note that most QA systems would not be sufficiently powerfulto include such filters.
Another example is “Edwardian houses close to supermarket for less than1,000,000 in Oxfordshire”, which was translated to the following query:
SELECT ?x0 WHERE {2 ?x0 <http://dbpedia.org/property/near> ?y2 .
?x0 a <http://diadem.cs.ox.ac.uk/ontologies/real-estate#House> .4 ?v <http://purl.org/goodrelations/v1#includes> ?x0 .
?x0 <http://www.w3.org/2006/vcard/ns#street-address> ?y0 .6 ?v <http://diadem.cs.ox.ac.uk/ontologies/real-estate#hasPrice> ?y1 .
?y2 a <http://linkedgeodata.org/ontology/Supermarket> .8 ?x0 <http://purl.org/goodrelations/v1#description> ?y .
FILTER(regex(?y0,’Oxfordshire’,’i’)) .10 FILTER(regex(?y,’Edwardian ’,’i’)) .
FILTER(?y1 < 1000000) .12 }
In that case, the links to LinkedGeoData were used by selecting the “near” prop-erty as well as by finding the correct class from the LinkedGeoData ontology.
3.2 Performance Evaluation
We conclude this evaluation with a brief look at the system performance, fo-cusing on the resource intensive background extraction and linking, which re-quire several hours compared to seconds for the actual query evaluation. Forthe real-estate scenario, the TBSL algorithm requires 7 seconds on average foranswering a natural language query using a remote triple store as backend. Theperformance is quite stable even for complex queries, which required at most 10seconds. So far, the TBSL system has not been heavily optimised in terms ofperformance, since the research focus was clearly to have a very flexible, robustand accurate algorithm. Performance could be improved, e.g., by using fulltextindices for speeding up NLP tasks and queries.
Wednesday, 15 May 13
5
Hands-on
35
Wednesday, 15 May 13
5
Version 1.1 available on http://oxpath.org (via code.google)
JAVA
Maven archetype and Command Line Interface with examples
Output in XML, RDF and Relational DB, custom output handler
Based on HTMLUnit
some limitations (e.g., no style axis)
Ongoing work
WebDriver-based implementation, Javascript in the next future
Visual Interface (record-and-play) as Firefox Extension
Any feedback is welcome! Get in touch with me
OXPath Engine
36
Wednesday, 15 May 13
Live Demo
37
Wednesday, 15 May 13
Questions?
oxpath.org38
Wednesday, 15 May 13