Effective Web Scraping with OXPath

DIADEM domain-centric intelligent automated data extraction methodology

Effective Web Scraping with

http://oxpath.org

Giovanni Grasso - Oxford University

May 15th, 2013 @ WWW developer trackjoint work with Tim Furche, Christian Schallhart,

Wednesday, 15 May 13

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

http://oxpath.org

OXPath » Lingua Franca for Web Extraction1

A Call for Action in Web Extraction!

Past: Form Filling + HTML Patterns

Now: Interaction + DOM Patterns

getting to the data requires interaction not just form filling

identifying relevant data from rendered DOMs

across several pages

access to all CSS properties (computed style)

2


3

The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).

Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.

doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}

/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*

To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.

doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{click/}/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{uncheck/}//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"]/{click/}//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>[//div.property-info//a/{click/}//a[.#=’More Facts’]/{click/}//div.home-facts/table:<facts=(.)>]

2. REFERENCES[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring

documents, databases, and webs. In ICDE, 24–33, 1998.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web

information extraction with Lixto. In VLDB, 119–128, 2001.[3] M. Benedikt and C. Koch. XPath Leashed. CSUR,

41(1):3:1–3:54, 2007.[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller.

Automation and customization of rendered web pages. InUIST, 163–172, 2005.

[5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,1411–1428, 2006.

[6] V. Crescenzi and G. Mecca. Automatic information extractionfrom large websites. JACM, 51(5):731âAS779, 2004.

[7] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing XPath queries. TODS, 30(2):444âAS491, 2005.

[8] G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter:automating & sharing how-to knowledge in the enterprise. InCHI, 1719–1728, 2008.

[9] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In IUI, 97–106,2009.

[10] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources. InICDE, 611–621, 2000.

[11] M. Liu and T. W. Ling. A rule-based query language forHTML. In DASFAA, 6–13, 2001.

[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005.[13] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the

World Wide Web. In DIS, 80–91, 1996.[14] A. Sahuguet and F. Azavant. Building light-weight wrappers

for legacy web data-sources using W4F. In VLDB, 738–741,1999.

[15] N. Sawa, A. Morishima, S. Sugimoto, and H. Kitagawa.Wraplet: Wrapping your web contents with a lightweightlanguage. In SITIS, 387–394, 2007.

[16] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan.Declarative information extraction using datalog withembedded extraction predicates. In VLDB, 1033–1044, 2007.

[17] J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design ofbrowser-oriented data extraction system and plug-ins. JMST,18(2):189–200, 2010.


3

Seattle






doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{click/}/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{uncheck/}//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"]/{click/}//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>[//div.property-info//a/{click/}//a[.#=’More Facts’]/{click/}//div.home-facts/table:<facts=(.)>]




















4


4






doc("zillow.com")/descendant::field()[1]/{"Seattle"}/following::*#GOButton/{click/}/descendant::input[@type=’checkbox’][2]/{uncheck}/following::field()[1]/{uncheck/}//div[.="Beds"]/following-sibling::select/{"3+"/}//a[./text()="More filters"]/{click/}//input[following-sibling[contains(.,"Multi Family")]]/{uncheck/}/(//span.arrowNext/a/{click/})*//ul#search-results/li:<property>

[//div.property-info//a/{click/}//div.home-description:<info=(.)>z




















OXPath » Lingua Franca for Web Extraction1

Wrapper Babel

Wrapper induction & data extraction systems

each invent their own wrapper language

or use its own ad-hoc tool or proprietary language

Mainly pattern matching + imperative navigation

mix of XPath & external flow control

limited interaction with complex interfaces

(simple) form filling & submit

focus on automation via visual interfaces

limited extraction language

no multiway navigation

5


1 OXPath » Lingua Franca for Web Extraction

Why OXPath?

6

an XPath for data extraction simplicity

learnable

familiarityscalability


OXPath » The Language2

OXPath = XPath + 4

7

action iteration

extractionstyle

OXPath


8


8

Start at kayak.co.uk:

doc("kayak.co.uk")To select an airport, type a few letters and select from completion list

//field().destination/{"Sea" /} //div#smartbox//li[1]/{click /}This will submit the form


http://www.rightmove.co.uk


9


9

Refine the results by unchecking the “2+ stops”:

//*#stops2/{uncheck }


9


//*#stops2/{uncheck }On all result pages

/(//a[.=‘Next’]/{click /})*


9


//*#stops2/{uncheck }On all result pages

/(//a[.=‘Next’]/{click /})*and for each flight

//body.resultrow:<flight>


9

Extract the attributes


9


Mouseover the ! to extract flight quality warnings

//span.qualityWarningIcon/{mouseover /}


9


Mouseover the ! to extract flight quality warnings

//span.qualityWarningIcon/{mouseover /}Click on the details to extract layovers



Actions correspond to DOM events, e.g.,

Executed once on each context node

Return context nodes for contextual actions or

root nodes for new DOM absolute actions {click/}

➊ Actions: Browser Interaction

10

Document

Click

Fill

Mouseover

doc("google.com")

{click}

{“Rio”}

{mouseover}





Extraction marker select nodes for extraction

record markers: :<flight>

attribute markers: :<price=string(.)>

Extracted data has tree shape

nesting of extraction markers in OXPath expression definesnesting of records and attribute-record associations in the output

➋ Extraction: Compact Tree Construction

11



Most web sites use pagination techniques for results

traversing paginated results require iteration

⇢ extraction from any unbounded component of a link graph

Kleene Star with action in the iterated expression

OXPath’s evaluation algorithm

buffers in practice only a constant number of pages

➌ Iteration: Kleene Star

13

/(//a[.=’Next’]/{click /})*

/(//body/{scroll /})* ( infinite scroll )



Access to all computed style CSS properties via style axis

➍ Style: Querying Visual Attributes

14

Visibility

Font size

Geometry

Color

style::display or style::visibility

style::font-size

style::top, style::left, ...

style::color or style::background-color


3

Evaluation15


0

50

100

150

200

0 2 4 6 8 10 12 0

20

40

60

80

100

120

140

160m

em

ory

[M

B]

#pages

[1000] / #re

sults

[100,0

00]

time [h]

memoryextracted matches

visited pages

(b) Millions of resultsConstant Memory16

100,000+ pages, millions of results


17

2%

13%

85%

page rendering browser initialization OXPathit’s the browser


0

100

200

300

400

500

600

700

0 20 40 60 80 100 120 140

time w

/o p

age lo

adin

g [se

c]

#pages

OXPathWeb Content Extractor

LixtoVisual Web Ripper

Web HarvestChickenfoot

(b) Norm. evaluation time, <150 p.faster

18


0

200

400

600

800

1000

1200

1400

1600

0 100 200 300 400 500 600 700 800

time

w/o

pa

ge

loa

din

g [

sec]

Number of pages

OXPathLixto


(c) Norm. evaluation time, <850 p.even faster 19


0

50

100

150

200

250

300

350

0 100 200 300 400 500 600 700 800

me

mo

ry [

MB

]

#pages

OXPathLixto


memory 20


0

50

100

150

200

250

300

350

0 100 200 300 400 500 600 700 800

me

mo

ry [

MB

]

#pages

OXPathLixto


memory 20

only hundreds of pages asother tools fail for more pages


OXPath » System & Evaluation3

Evaluation

21

constant memory

very low overhead on XPath

minimal page buffer

browser boundfast


4

OXPathUser stories

22


4

DIADEMUnsupervised Domain-

specific Web Objects Extraction

presented @ World Wide Web 2012 (WWW’12)

23


24

DIADEM data extraction methodologydomain-centric intelligent automated


25


:=


26


:=


27


:=


28


:=

1

Form Understanding & Filling

Flat Text

Context-drivenblock analysis

3

Energy Performance Chart

Maps

Floor plans


OXPath Wrapper

Cloud extraction

Data integration

4

29


:=

Single entity (details) pages

Tables

2

Object identification & alignment

Result pages

Flat Text

Context-drivenblock analysis

3

Energy Performance Chart

Maps

Floor plans


Wrapper induction in DIADEM4

30

Induced Wrapper (partial)

doc(‘wwagency.co.uk’)//select#sale_type_id/{0/} //button.formbtn/{click /} (//div.pagenumlinks[last()]//a[last()]/{click /})* //div.proplist_wrap:<RECORD> [.//span.prop_price:<PRICE=string(.)>] [.//ul.prop_keypoints/li[2]/strong:<BEDROOM_ROOMS=string(.)>] [.//div.prop_statuses//text():<PROPERTY_STATUS=string(.)>] [.//strong.orange:<POSTCODE=string(.)>] //div.prop_img/a/{click /}//body [.//div#propertypage_copy/p[last()-1]:<DESCRIPTION=string(.)>] [.//div#print_contact/address/text()[2]:<ADDRESS=string(.)>] [.//a.~'Map view')]/@href:<MAP=string(.)>] [.//div#propertypage_copy/p[2]:<RECEPTION_ROOMS=string(.)>]


4

DEQAQuestion Answering

on the Deep Web

presented @International Semantic Web Conference 2012 (ISWC’12)

31


32

Kindergarden_B

White_Road

1,499,950 £

gr :Offering

rdf:type

dd:hasPrice

Kindergarden_Adbp:near

Domain Specific Triple Store

Question:House near a Kindergarden under 2,000,000 £?

OXPath

OXPath

TBSL

White_Road

Answer:

15

dd:bedrooms

1,499,950 £dd:hasPrice

dbp:near Kindergarden_A

Linking-MetricOXPath


OXPath » DEQA: Question Answering on the Deep Web4

33

RDF Wrapper (partial)doc(‘wwagency.co.uk’)....

.... //div.proplist_wrap:<gr:Offering> [.//span.prop_price:<dd:hasPrice(xsd:double)=string(.)>] ..... [.//strong.orange:<vcard:postal-code=string(.)>]

.... [.//div.prop_img/a/@href:<foaf:page=string(.)>] //div.prop_img/a/{click /}//body [.//div#propertypage_copy/p[last()-1]:<gr:description=string(.)>] [.//a.~'Map view')]/@href:<wgs84:lat=extractLat(.)>] [.//a.~'Map view')]/@href:<wgs84:long=extractLong(.)>]


OXPath » DEQA: Question Answering on the Deep Web4

34

Question translation to SPARQL

Edwardian houses close to supermarket for less than 1,000,000 in Oxfordshire

mapping them to specific restrictions, e.g. cheap could be mapped to costs forflats less than 800 pounds per month.

An example of a successful query is “all houses in Abingdon with more than2 bedrooms”:

SELECT ?y WHERE {2 ?y a <http://diadem.cs.ox.ac.uk/ontologies/real-estate#House> .

?y <http://diadem.cs.ox.ac.uk/ontologies/real-estate#bedrooms> ?y0 .4 ?y <http://www.w3.org/2006/vcard/ns#street-address> ?y1 .

FILTER(?y0 > 2) .6 FILTER(regex(?y1,’Abingdon’,’i’)) .}

In that case, TBSL first performs a restriction by class (“House”), then it findsthe town name “Abingdon” from the street address and it performs a filter on thenumber of rooms. Note that most QA systems would not be sufficiently powerfulto include such filters.

Another example is “Edwardian houses close to supermarket for less than1,000,000 in Oxfordshire”, which was translated to the following query:

SELECT ?x0 WHERE {2 ?x0 <http://dbpedia.org/property/near> ?y2 .

?x0 a <http://diadem.cs.ox.ac.uk/ontologies/real-estate#House> .4 ?v <http://purl.org/goodrelations/v1#includes> ?x0 .

?x0 <http://www.w3.org/2006/vcard/ns#street-address> ?y0 .6 ?v <http://diadem.cs.ox.ac.uk/ontologies/real-estate#hasPrice> ?y1 .

?y2 a <http://linkedgeodata.org/ontology/Supermarket> .8 ?x0 <http://purl.org/goodrelations/v1#description> ?y .

FILTER(regex(?y0,’Oxfordshire’,’i’)) .10 FILTER(regex(?y,’Edwardian ’,’i’)) .

FILTER(?y1 < 1000000) .12 }

In that case, the links to LinkedGeoData were used by selecting the “near” prop-erty as well as by finding the correct class from the LinkedGeoData ontology.

3.2 Performance Evaluation

We conclude this evaluation with a brief look at the system performance, fo-cusing on the resource intensive background extraction and linking, which re-quire several hours compared to seconds for the actual query evaluation. Forthe real-estate scenario, the TBSL algorithm requires 7 seconds on average foranswering a natural language query using a remote triple store as backend. Theperformance is quite stable even for complex queries, which required at most 10seconds. So far, the TBSL system has not been heavily optimised in terms ofperformance, since the research focus was clearly to have a very flexible, robustand accurate algorithm. Performance could be improved, e.g., by using fulltextindices for speeding up NLP tasks and queries.


5

Hands-on

35


5

Version 1.1 available on http://oxpath.org (via code.google)

JAVA

Maven archetype and Command Line Interface with examples

Output in XML, RDF and Relational DB, custom output handler

Based on HTMLUnit

some limitations (e.g., no style axis)

Ongoing work

WebDriver-based implementation, Javascript in the next future

Visual Interface (record-and-play) as Firefox Extension

Any feedback is welcome! Get in touch with me

OXPath Engine

36


http://oxpath.org

http://oxpath.org

Live Demo

37


Questions?

oxpath.org38


Effective Web Scraping with OXPath

Technology

Transcript of Effective Web Scraping with OXPath