Post on 23-Dec-2015
Managing The Structured Web
Michael J. Cafarella University of Michigan
Michigan CSEApril 23, 2010
2
The Structured Web Web pages contain structure that
is obvious to humans, though not machines
Search engines are largely blind to it
Databases need data that is perfectly structured
4
Different Approaches Extraction Techniques
Tables: WebTables [WebDB’08, VLDB’08] Large-scale entity extraction: Structurepedia
[ongoing]
Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing]
Tools MapReduce Optimizer: Manimal [ongoing]
Progress in one reinforces others
5
Different Approaches Extraction Techniques
Tables: WebTables [WebDB’08, VLDB’08] (w/ Alon Halevy, Yang Zhang, Daisy Wang, Eugene
Wu) Large-scale entity extraction: Structurepedia
[ongoing]
Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing]
Tools MapReduce Optimizer: Manimal [ongoing]
(w/ Chris Re)
6
8
WebTables WebTables system automatically extracts dbs
from web crawl [WebDB08, “Uncovering…”, Cafarella et al][VLDB08, “WebTables: Exploring…”, Cafarella et al]
An extracted relation is one table plus labeled columns
Estimate that our crawl of 14.1B raw HTML tables contains ~154M good relational dbs
Raw crawled pages Raw HTML Tables Recovered Relations Applications
Schema Statistics
9
Schema Statistics Schema stats useful for computing attribute
probabilities p(“make”), p(“model”), p(“zipcode”) p(“make” | “model”), p(“make” | “zipcode”)
Allows many applications Schema “tab-complete” Synonym discovery Others
Progress in extraction technique enables new data applications
10
Manimal (ongoing) MapReduce very popular for “big data”
Easy for non-database programmers Parallelizable, but inefficient
RDBMSes challenging for “big data” Programming and admin relatively difficult When well-used, very efficient
Manimal is hybrid MapReduce/RDBS execution system Static analysis to extract code semantics if(score > 5)… database selection Extractions enable RDBMS-style optimizations
Progress in extraction enables new data tools