Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter...
-
Upload
lynette-ophelia-sherman -
Category
Documents
-
view
217 -
download
0
Transcript of Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter...
![Page 1: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/1.jpg)
Data Extraction and Integration from Imprecise Web Sources
Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti
Università degli Studi Roma Tre
(Creative Commons License, see last slide)
![Page 2: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/2.jpg)
Data-intensive websites
![Page 3: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/3.jpg)
Website
Data-intensive websites
Database
Template1
Template2
Template3
target
![Page 4: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/4.jpg)
Flint goal
…StockQuote
Last Min Max
Volume 52high Open
![Page 5: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/5.jpg)
Flint
System architecture
WebSearch[WIDM08]
Data Extraction
Data Integration
The WebThe Web
![Page 6: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/6.jpg)
Novel contribution
• Unsupervised• Automatic• Scalable• No knowledge available
Data Extraction
RoadRunner [Vldb01] ExAlg [Sigmod03]
TurboWrapper [Vldb07]
• Unsupervised• Automatic• Scalable• Uncertain Data• No labels available• No corpus available
Data Integration
WebTables [Vldb08]Cimple [Vldb07]
MetaQuerier [Cidr05]PayGo [Cidr07]
![Page 7: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/7.jpg)
Data Extraction
![Page 8: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/8.jpg)
Data Extraction
![Page 9: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/9.jpg)
Data Extraction
AAPL, GOOG, MSFT, INTC, …
128.09, 439.54, 34.89, 112.37, …
127.81, 439.25, 32.13, 111.01, …
132.43, 443.82, 33.67, 114.32, …
0.50%, -0.38%, 1.23%, 3.92%, -1.65%, …
Add AAPL to Your Portfolio, Add GOOG to Your Portfolio, Add MSFT to Your Portfolio,
Add INTC to Your Portfolio, …
…
![Page 10: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/10.jpg)
Data ExtractionHTML fragments taken from two pages belonging to the same website:
1,132,228 , 1,735,857/html/body/table/tr[1]/td[2]
$20.66 , $414.58/html/body/table/tr[2]/td[2]
$11.70 , $247.30/html/body/table/tr[3]/td[2]
$20.72 , $414.06/html/body/table/tr[4]/td[2]
Extraction error!
$0.02 , 99,494,200/html/body/table/tr[5]/td[2]
?
4,732,600 , null/html/body/table/tr[6]/td[2]
![Page 11: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/11.jpg)
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
![Page 12: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/12.jpg)
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
![Page 13: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/13.jpg)
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
1.0 1.0 1.0
![Page 14: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/14.jpg)
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
![Page 15: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/15.jpg)
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
0.6 1.0 1.0
![Page 16: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/16.jpg)
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
?
1.0 1.0
![Page 17: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/17.jpg)
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
1.0
![Page 18: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/18.jpg)
t=0.7 t=0.7
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
1.0
![Page 19: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/19.jpg)
t=0.7 t=0.7
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
![Page 20: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/20.jpg)
t=0.7 t=0.7
Wrapper Refinement
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
10null
10
(min/max)
? ?
0.3 (weak) 0.3 (weak) 0.0 0.0
![Page 21: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/21.jpg)
Wrapper Refinement
matching value
nearby template
tokens
//td[contains(text(),‘Open')]/../td[2]//td[contains(text(),‘Open')]/../../tr[5]/td[1]//td[contains(text(),‘Open')]/../../tr[5]/td[2]//td[contains(text(),‘High')]/../td[2]…
![Page 22: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/22.jpg)
t=0.7 t=0.7
Wrapper Refinement
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
10null
10
(min/max)
1.0 1.0
103316
(max)
42510
(min)
//td[contains(text(),‘Max')]/../td[2]
//td[contains(text(),‘Min')]/../td[2]
![Page 23: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/23.jpg)
t=0.7 t=0.7
Wrapper Refinement
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
10null
10
(min/max)
103316
(max)
42510
(min)
![Page 24: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/24.jpg)
Experimental Results(100 websites for each domain)
Soccer domain(45,714 pages)
Attribute |m|
• Name 90• Birth Date 61• Height 54• Nationality 48• Club 43• Position 43• Weight 34• League 14
Videogame domain(49,262 pages)
Attribute |m|
• Title 86• Publisher 59• Developer 45• Genre 28• ESRB rating 40• Release Date 9• Platform 9• # Players 6
Finance domain(57,623 pages)
Attribute |m|
• Stock Symbol 84• Price Change 73• % Change 73• Volume 52• Day Low 43• Day High 41• Last Price 29• Open Price 24
![Page 25: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/25.jpg)
Demo
• Found Websites• Integrated Data
![Page 26: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/26.jpg)
the end!
http://flint.dia.uniroma3.it
![Page 27: Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697c0141a28abf838ccd792/html5/thumbnails/27.jpg)
License
• This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.