Managing The Structured Web
description
Transcript of Managing The Structured Web
![Page 1: Managing The Structured Web](https://reader035.fdocuments.us/reader035/viewer/2022070409/56814498550346895db13e8b/html5/thumbnails/1.jpg)
Managing The Structured Web
Michael J. Cafarella University of Michigan
Michigan CSEApril 23, 2010
![Page 2: Managing The Structured Web](https://reader035.fdocuments.us/reader035/viewer/2022070409/56814498550346895db13e8b/html5/thumbnails/2.jpg)
2
The Structured Web Web pages contain structure that
is obvious to humans, though not machines
Search engines are largely blind to it
Databases need data that is perfectly structured
![Page 3: Managing The Structured Web](https://reader035.fdocuments.us/reader035/viewer/2022070409/56814498550346895db13e8b/html5/thumbnails/3.jpg)
![Page 4: Managing The Structured Web](https://reader035.fdocuments.us/reader035/viewer/2022070409/56814498550346895db13e8b/html5/thumbnails/4.jpg)
4
Different Approaches Extraction Techniques
Tables: WebTables [WebDB’08, VLDB’08] Large-scale entity extraction: Structurepedia
[ongoing]
Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing]
Tools MapReduce Optimizer: Manimal [ongoing]
Progress in one reinforces others
![Page 5: Managing The Structured Web](https://reader035.fdocuments.us/reader035/viewer/2022070409/56814498550346895db13e8b/html5/thumbnails/5.jpg)
5
Different Approaches Extraction Techniques
Tables: WebTables [WebDB’08, VLDB’08] (w/ Alon Halevy, Yang Zhang, Daisy Wang, Eugene
Wu) Large-scale entity extraction: Structurepedia
[ongoing]
Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing]
Tools MapReduce Optimizer: Manimal [ongoing]
(w/ Chris Re)
![Page 6: Managing The Structured Web](https://reader035.fdocuments.us/reader035/viewer/2022070409/56814498550346895db13e8b/html5/thumbnails/6.jpg)
6
![Page 7: Managing The Structured Web](https://reader035.fdocuments.us/reader035/viewer/2022070409/56814498550346895db13e8b/html5/thumbnails/7.jpg)
![Page 8: Managing The Structured Web](https://reader035.fdocuments.us/reader035/viewer/2022070409/56814498550346895db13e8b/html5/thumbnails/8.jpg)
8
WebTables WebTables system automatically extracts dbs
from web crawl [WebDB08, “Uncovering…”, Cafarella et al][VLDB08, “WebTables: Exploring…”, Cafarella et al]
An extracted relation is one table plus labeled columns
Estimate that our crawl of 14.1B raw HTML tables contains ~154M good relational dbs
Raw crawled pages Raw HTML Tables Recovered Relations Applications
Schema Statistics
![Page 9: Managing The Structured Web](https://reader035.fdocuments.us/reader035/viewer/2022070409/56814498550346895db13e8b/html5/thumbnails/9.jpg)
9
Schema Statistics Schema stats useful for computing attribute
probabilities p(“make”), p(“model”), p(“zipcode”) p(“make” | “model”), p(“make” | “zipcode”)
Allows many applications Schema “tab-complete” Synonym discovery Others
Progress in extraction technique enables new data applications
![Page 10: Managing The Structured Web](https://reader035.fdocuments.us/reader035/viewer/2022070409/56814498550346895db13e8b/html5/thumbnails/10.jpg)
10
Manimal (ongoing) MapReduce very popular for “big data”
Easy for non-database programmers Parallelizable, but inefficient
RDBMSes challenging for “big data” Programming and admin relatively difficult When well-used, very efficient
Manimal is hybrid MapReduce/RDBS execution system Static analysis to extract code semantics if(score > 5)… database selection Extractions enable RDBMS-style optimizations
Progress in extraction enables new data tools