Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research)...
-
Upload
kenzie-frampton -
Category
Documents
-
view
216 -
download
1
Transcript of Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research)...
![Page 1: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/1.jpg)
Xin Luna Dong (AT&T Labs Google Inc.)
Barna Saha, Divesh Srivastava (AT&T Labs-Research)
VLDB’2013
* Less is More: Selecting Sources Wisely for Integration
![Page 2: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/2.jpg)
*“The More, The Better” —for Men
![Page 3: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/3.jpg)
*“The More, The Better” —for Women
![Page 4: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/4.jpg)
*“The More, The Better” —for DBers
![Page 5: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/5.jpg)
*But Data Come with A Cost*Lots of money
![Page 6: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/6.jpg)
*But Data Come with A Cost*Lots of machines
![Page 7: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/7.jpg)
*But Data Come with A Cost*Lots of people
![Page 8: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/8.jpg)
*And The Gain Could Be Small
1096 books
from the largest source
1213 books
from the 2 largest sources
1250 books
from the 10 largest sources
1260 books from the first 35 sources
All 1265 books from the first 537
sources
In total 894 sources, 1265 CS books
CS books from AbeBooks.com
![Page 9: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/9.jpg)
*And The Gain Could Even Be Negative
90 > 80 books w. correct
authors after 579 sources
(Accu)
93 > 80 books w. correct
authors after 583 sources
(Vote)
All 100 books (gold
standard) from the first 548 sources
78 books w. correct
authors for Vote
80 books w. correct
authors for Accu
CS books from AbeBooks.com
![Page 10: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/10.jpg)
*Less Is More—Source Selection [VLDB’13]
*Questions
*Is it best to integrate all data?
*How to spend the computing resources in a wise way?
*How to wisely select sources before real integration to balance the gain and the cost?
*Prelude for data integration and outside traditional integration tasks (schema mapping, entity resolution, data fusion)
![Page 11: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/11.jpg)
*Maximize Quality Under Budget?
14 books (17.6% fewer) w. correct authors from the
first 200 (33% less resources)
sources
17 books w. correct authors from 300 sources (budget)
CS books from AbeBooks.com
![Page 12: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/12.jpg)
*Minimize Cost w. Minimal Quality Requirement?
65 books w. correct authors
(quality requirement)
from the first 520 sources
81 books (25% more) w. correct authors from 526
sources (1% more)
CS books from AbeBooks.com
![Page 13: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/13.jpg)
*Marginalism Principle in Economic Theory
Marginal gainII
Marginal cost
0 3 6 90
2
4
6
8
10
12
GainCost
#(Resource Unit)$
0 3 6 90
0.5
1
1.5
2
2.5
3
Marginal GainMarginal Cost
#(Resource Unit)
$
The law of Diminishing
ReturnsLargest profit
![Page 14: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/14.jpg)
*Marginalism for Source Selection Marginal point with
the largest profit in this ordering: 548
sources
CS books from AbeBooks.com
Challenge 1. The Law of Diminishing
Returns does not necessarily hold, so multiple marginal
points
Challenge 2. Each source is different in quality, so different ordering leads to different marginal
points: best solution integrates 26 sources
Challenge 3. Estimating gain and
cost w/o real integration
![Page 15: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/15.jpg)
*Insight I. Maximizing Profit*Input*S: a set of available sources
*F: integration model
*Output: subset Ŝ to maximize profit
GF(Ŝ)-CF(Ŝ)
*GF(Ŝ): Gain of integrating Ŝ using model F
*CF(Ŝ): Cost of integrating Ŝ using model F
*Gain and cost need to be in the same unit to be comparable; e.g., $
![Page 16: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/16.jpg)
*Insight II. Yes, It Is A HARD Problem*Theorem I (NP-Completeness). Under the arbitrary cost model (i.e., different sources have different costs), Marginalism is NP-complete.
*Theorem II (A greedy solution can obtain arbitrarily bad results): Let dopt be the optimal profit and d be the profit by a greedy solution. For any θ, there exists an input set of sources and a gain model s.t. d/dopt < θ.
![Page 17: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/17.jpg)
*Insight III. An Efficient Algorithm—GRASP Solution
Improvement I. Randomly select from Top-k solutions
Improvement II. Hill climbing to improve the initial solution
Improvement III. Repeat r times and choose the best solution
![Page 18: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/18.jpg)
*Side Contributions*Side contributions on data fusion
*The PopAccu model: monotonicity—adding a source should never decrease fusion quality
*Algorithms to estimate fusion quality: dynamic programming
![Page 19: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/19.jpg)
*Experimental Setup
*Book data set: CS books at Abebooks.com in 2007*894 sources
*1265 books
*24364 records
*Flight data set: Deep-Web sources for “flight status” in 2011
*38 sources
*1200 flights
*27469 records
![Page 20: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/20.jpg)
*Maximizing Fusion Quality
228 sources provide books in gold
standard
Marginalism selects 165 sources; reaching
the highest quality
PopAccu outperforms Vote and Accu, and is nearly monotonic for “good”
sources
![Page 21: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/21.jpg)
*Source Selection: The Goal
Marginalism has higher profit than MaxGLimitC and
MinCLimitG most of the time
![Page 22: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/22.jpg)
Greedy solution often cannot find the optimal solution
GRASP (top-10, repeating 320 times) obtains nearly
optimal results
*Source Selection: The Approach
![Page 23: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/23.jpg)
*Future Work*Full-fledged source selection for data integration*Other quality measures: e.g., freshness, consistency, redundancy; correlations, copying relationships between sources
*Complex cost and gain models
*Selecting subsets of data from each source
*Other components of data integration: schema mapping, entity resolution
![Page 24: Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.](https://reader035.fdocuments.us/reader035/viewer/2022062620/551a9dfb550346761a8b5e50/html5/thumbnails/24.jpg)
The More the Better? OR Less is More?