Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration...
-
Upload
hillary-murphy -
Category
Documents
-
view
214 -
download
1
Transcript of Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration...
![Page 1: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/1.jpg)
Building Data Integration Systems for the Web
Alon Halevy
NSF Information Integration Workshop
April 22, 2010
![Page 2: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/2.jpg)
Without (too much) Loss of Generality
Web Enterprise, Science projects, …
Information integration ≅ data management
![Page 3: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/3.jpg)
A Few Principles
• Data management “in situ”– Data meaning is derived from its context– Manipulate data in its natural location
• Pay-as-you-go data management– Provide services before modeling is done– Data can be about any domain
• Collaboration should be built in– Query answering is only step the first step
![Page 4: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/4.jpg)
Alex Labrinidis
@via Facebook
![Page 5: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/5.jpg)
Structured Data & The Web
![Page 6: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/6.jpg)
Discover
Manage,Analyze, Combine
ExtractPublish
Hard to query, visualize, combine data across organizations
Requires infrastructure, concerns about losing control
Hard to find structured data via search engines
Data is embedded in web page, behind forms
![Page 7: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/7.jpg)
Outline
• Surfacing the Deep Web
• Searching tables on the surface Web
• Fusion Tables: a platform for data management on the Web.
![Page 8: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/8.jpg)
What is the Deep Web?
store locationsused cars
radio stationspatents
recipes
• Deep = not accessible through general purpose search engines– Major gap in the coverage of search engines.
![Page 9: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/9.jpg)
Tree Search
Amish quilts
Parking tickets in India
Horses
![Page 10: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/10.jpg)
Solution Constraints
• Can’t design a solution that requires domain engineering– (unless you can make money in that
domain!)
• Boundaries between domains are fuzzy
• Solution needs to be integrated into general web search– Can’t assume special query syntax
![Page 11: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/11.jpg)
Surfacing the Deep Web[Madhavan et al. VLDB 2008]
• Surfacing: – Find high-quality forms– Guess good queries to submit– Put the resulting HTML pages in the index
• ~3M sites, 50 languages, 700 domains.• 1000 queries per-second get results from the
deep web.• 400K forms served per day, 800K per week• Impact mostly on the long and heavy tail of
queries
![Page 12: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/12.jpg)
Deep Web: The Future
• Still an opportunity to go deeper into the deep web:– E.g., map the user query into a form
submission.
• Key challenge: given a keyword query, map it to forms in any domain
• Understanding the meaning of forms is still hard (e.g. content, geo constraints).
![Page 13: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/13.jpg)
Outline
Surfacing the Deep WebSearching tables on the surface Web
• Fusion Tables: a platform for data management on the Web.
![Page 14: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/14.jpg)
Bad table
![Page 15: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/15.jpg)
Vertical Tables
![Page 16: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/16.jpg)
Sub-Header Rows
![Page 17: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/17.jpg)
Winners of the Boston Marathon (but that’s nowhere in the table)
![Page 18: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/18.jpg)
Schema Ok, but context is subtle (year = 2006)
![Page 19: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/19.jpg)
WebTables: Exploring the Relational Web[Cafarella et al., VLDB 2008, WebDB 08]
• In corpus of 14B raw tables, we estimate 154M are “good” relations– Single-table databases; Schema = attr labels + types– Largest corpus of databases & schemas we know of
• The Webtables system:– Recovers good relations from crawl and enables search– Builds novel apps on the recovered data
![Page 20: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/20.jpg)
(Web-scale) Schema Collection
name e-mail|email, phone|telephone, e-mail_address|email_address, date|last_modified
instructor course-title|title, day|days, course|course-#,course-name|course-title
elected candidate|name, presiding-officer|speaker
ab k|so, h|hits, avg|ba, name|player
sqft bath|baths, list|list-price, bed|beds, price|rent
With 2.6 million schemas you can do some very interesting things.
Synonym discovery
![Page 21: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/21.jpg)
“KR”-Based Table Search [Wu, Madhavan, Miao, Pasca, Shen]
• Ideally, we describe every table:– Class of entities it contains– Properties being modeled– Context, quality, …
• Use Web-extracted knowledge bases– Extract isa-hierarchy using patterns:– “cities such as Paris and London”– “chemical elements including hydrogen and
oxygen”
![Page 22: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/22.jpg)
Step 1: Find “Subject” of Table
Not always the left (or first non-number column)
![Page 23: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/23.jpg)
Step 2: associate classes with subjectChemical elements
Most of the time, the class labels are not in attribute name
![Page 24: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/24.jpg)
Leveraging Web-extracted Ontologies
• Given a query, e.g., (country, GDP)– Rank tables about countries that have GDP
somewhere in the schema. – Very high precision (~90%)
• Next challenge: understand binary properties and binary relationships.
• Domain specialization: – System should improve if given ontologies in a
particular domain.
![Page 25: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/25.jpg)
25
Combine Search, Extraction, Cleaning and Integration
[Cafarella, Koussainova, H., VLDB 2009],
• Try to create a database of all“VLDB program committee members”
![Page 26: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/26.jpg)
Outline
Surfacing the Deep WebSearching tables on the surface WebFusion Tables: a platform for data
management on the Web.
![Page 27: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/27.jpg)
Data Management for the Web Era
• Integrate seamlessly with the Web:– Search, maps, …
• Easy to use:– Much broader user base, pay-as-you-go– Very simple data integration
• Provide incentives for sharing data
• Facilitate collaboration
Fusion Tables – our current attempt[Madhavan, Gonzalez, Langen, Shapley, Shen]
![Page 28: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/28.jpg)
We store and leverage a large collection of tables.
Incentive
![Page 29: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/29.jpg)
Incentive, Pay-..-Go
![Page 30: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/30.jpg)
Coffee Production
![Page 31: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/31.jpg)
Coffee Consumption
![Page 32: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/32.jpg)
Seamless integration with other web tools
![Page 33: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/33.jpg)
Toilet heat map…
![Page 34: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/34.jpg)
Database functionality on map
![Page 35: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/35.jpg)
Collaboration
Table Search
![Page 36: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/36.jpg)
Show up in search results!
![Page 37: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/37.jpg)
Data Integration
![Page 38: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/38.jpg)
Merged Table
Carries attribution from both base tables. Owners maintain control of their own data.
![Page 39: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/39.jpg)
Fine Grained Discussions
![Page 40: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/40.jpg)
Example Uses of Fusion Tables
• Tracking potholes in Spain• Displaying bike routes (MTBGuru)• State of California statistics• Government data from data.gov• Data about voting locations in the USA• Brazilian beaches• Chicago homicides• Most requested pop songs by year
![Page 41: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/41.jpg)
Conclusions
• Information integration “in situ”– Blur the boundary between structured and
unstructured data
• Combine search, extraction, cleaning and integration into a single experience
• Pay-as-you-go: introduce complexity as needed– Serve enterprises without IT depth
• OpenII – an open-source platform for information integration.
![Page 42: Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649da85503460f94a9587a/html5/thumbnails/42.jpg)
References• Fusion Tables:
– tables.googlelabs.com– SIGMOD, SOCC, 2010
• Deep-web crawling:– [Madhavan et al., VLDB 08]
• WebTables: – [Cafarella et al., VLDB 08]
• Octopus: – [Cafarella et al., VLDB 09],– [Elmeleegy et al, VLDB 09]