Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives...
-
Upload
mackenzie-coles -
Category
Documents
-
view
217 -
download
3
Transcript of Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives...
![Page 1: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/1.jpg)
Update Exchange with Mappings and Provenance
Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen
University of Pennsylvania
VLDB 2007 Vienna, AustriaSeptember 26, 2007
![Page 2: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/2.jpg)
Adoption of data integration tools
• Structured information is pervasive in the Internet age, as is the need to access and integrate it…– Need to collect, transform, aggregate information– Need to import related data into an existing
database instance• … but after many years of research, few users
of data integration tools• Why?
2
![Page 3: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/3.jpg)
Not because the problem is too hard!
• People are doing it anyway! (Just without help from DB research)– e.g., bioinformatics
• Ad-hoc solutions (Perl scripts) developed for specific domains– e.g., at Penn, a large staff of programmers maintains the
Genomics Unified Schema (GUS)
• Point-to-point exchange between peers / collaborating sites
• To be adopted, data integration tools need to offer significant additional value...
3
![Page 4: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/4.jpg)
Needs unmet by data integration tools• Previous data integration tools do not offer:– Complete local control of data • Decide which data is import / integrated• Ability to modify any data, even data from elsewhere!
– Support for different points of view• Disagreements about data, mappings, schemas...• Which sources are trusted / distrusted
– Tracking of data provenance– Support for incremental updates• Changes to data, mappings, schemas...
• Our system, ORCHESTRA, addresses these needs 4
![Page 5: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/5.jpg)
• Give peers full control using local instance• Support different needs / perspectives• Relate peers by mappings and trust
policies• Support update exchange• Maintain data provenance
Requirements for ORCHESTRA, a Collaborative Data Sharing System (CDSS) [Ives+05]
DBMS
Queries, edits PUBLISH∆A+/−
∆B+/−∆C+/−
5
Peer A
Peer B
Peer C
∆A+/−
![Page 6: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/6.jpg)
How ORCHESTRA addresses CDSS requirements
σ ∆Pc ∆Pfm Local curation
+
−Apply trust policiesusing
provenance
Translate through
mappings with provenance
Produce candidate updates
Apply final updates to
peer
Updates from other peers
Contributions of this paper
2 31
6
∆Pother r
Resolve conflicts
From one peer’s perspective:
[TaylorIves06]
![Page 7: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/7.jpg)
Roadmap
• Update exchange in a CDSS:– Schema mappings– Tracking of data provenance– Incremental propagation of updates– Provenance-based trust policies– Local curation via insertions / deletions
• Prototype implementation• Experimental evaluation
7
![Page 8: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/8.jpg)
• CDSS setting: set of peers; set of declarative mappings (tgds)
• Given: setting, base data, updates• Goal: local instance at each peer cf. data exchange paradigm
[Fagin+03]– Universal solution yields the “certain answers” to queries– Can be computed using the chase
• Our contribution: how to do it incrementally, with provenance...
Mappings and updates
8
G
B
U
∙
m1
m2
m3 (𝑚1) 𝐺ሺ𝑖,𝑐,𝑛ሻ→𝐵ሺ𝑖,𝑛ሻ (𝑚2) 𝐺ሺ𝑖,𝑐,𝑛ሻ→𝑈(𝑛,𝑐) (𝑚3) 𝐵ሺ𝑖,𝑐ሻ∧𝑈ሺ𝑛,𝑐ሻ→𝐵ሺ𝑖,𝑛ሻ
![Page 9: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/9.jpg)
`
Incremental insertion
GB
(3, 2)(1, 3)U (2, 5)(3, 5, 2)
+
(1, 3, 3)+
(3, 5)+
m1
m1
m2
m3
m2
m3
9
(3, 3)+
This graph represents the provenance information that ORCHESTRA maintains
(𝑚1) 𝐺ሺ𝑖,𝑐,𝑛ሻ→𝐵ሺ𝑖,𝑛ሻ (𝑚2) 𝐺ሺ𝑖,𝑐,𝑛ሻ→𝑈(𝑛,𝑐) (𝑚3) 𝐵ሺ𝑖,𝑐ሻ∧𝑈ሺ𝑛,𝑐ሻ→𝐵ሺ𝑖,𝑛ሻ
![Page 10: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/10.jpg)
Incremental deletion
GB (3, 5)(3, 2)(1, 3)
U (2, 5)(3, 3)m3
+ +(3, 5, 2)
+
(1, 3, 3)+
m1
m1
m2
m2
m3
10
• Step 1: Use provenance graph to find derived tuples which can also be deleted
• Step 2: Test other affected tuples for derivability, and delete any not derivable
• Step 3: Repeat
+
![Page 11: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/11.jpg)
Other approaches to incremental deletion
• Many strategies (both research and commercial) for incremental deletions...
• ... but we have to support recursion– Mappings can have cycles– Count-based algorithms don’t work (infinite counts)
• Incremental maintenance for recursive datalog programs – DRed [GuptaMumick95]– DRed (“delete and re-derive”) computes superset of
deletions, then corrects if needed– We use provenance to compute exact set of deletions
11
![Page 12: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/12.jpg)
Trust policies (not every update should be propagated)
• Updates can be filtered automatically based on provenance and content– “Peer A distrusts any tuple U(i,n) if the data came
from Peer B and n ≥ 3, and trusts any tuple from Peer C”
– “Peer A distrusts any tuple U(i,n) that came from mapping m4 if n ≠ 2”
• Local curation: user can also manually accept/reject updates, or introduce new ones...
12
![Page 13: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/13.jpg)
Local curations• Extra tables for local insertions and deletions:
• Contribution: conforms to data exchange paradigm by using internal mappings with local insertions/deletions:
∆Pc ∆PfLocal curation
+
−Candidate updates
Final updates
(Mappings,trust policies,
etc.)
(𝑚2) 𝐺ሺ𝑥,𝑦,𝑧ሻ→𝑈(𝑥,𝑦)
(𝑚2′ ) 𝐺𝑓ሺ𝑥,𝑦,𝑧ሻ→𝑈𝑐(𝑥,𝑦) (𝑚𝑈+) 𝑈+ሺ𝑥,𝑦ሻ→𝑈𝑓(𝑥,𝑦) (𝑚𝑈−) 𝑈𝑐ሺ𝑥,𝑦ሻ∧¬𝑈−ሺ𝑥,𝑦ሻ→𝑈𝑓(𝑥,𝑦)
↦
13
![Page 14: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/14.jpg)
Prototype implementation
• Middleware layer on top of relational DBMS• Mappings converted to datalog rules (as in Clio)• Separate tables for provenance info• Engine option 1: based on commercial DBMS (DB2)– Datalog fixpoints in Java and SQL (only linear recursion in DB2)– Labeled nulls supported via encoding scheme
• Engine option 2: using in-house query engine (Tukwila)– BerkeleyDB for auxiliary storage and indexes– Custom operators for fixpoints, built-in labeled nulls
• 30,000 lines of Java and C++ code14
![Page 15: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/15.jpg)
Experimental evaluation
• DB2-based and Tukwila-based implementations• Workload typical of bioinformatics setting (at most 10s of
peers, GBs of data)• Synthetic update workload sampled from SWISS-PROT
biological data set– Randomly-generated schemas and mappings
• Dual Xeon 5150 server, 8 GB RAM (2 GB for DB)• Variables: number of peers, complexity of mappings, volume
of data, type of data, size of updates• Measured: time to join system, time to propagate updates,
size of updated database
15
![Page 16: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/16.jpg)
Non-incrementalIncremental
DRed
Incremental deletion algorithm yields significant speedup
Parameters: 5 peers, full acyclic mappings, string data, 1 GB database16
Tim
e to
pro
paga
te d
eleti
ons
(sec
)
![Page 17: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/17.jpg)
System scales to realistic #s of peers
0
20
40
60
80
2 5 10 20Number of peers
Tim
e (s
ec)
1% insertions (DB2)
10% insertions (DB2)
1% insertions (Tukwila)
10% insertions (Tukwila)
Parameters: full acyclic mappings, integer data, up to 1 GB database
10% insertions (DB2)
1% insertions (DB2)10% insertions (Tukwila)
1% insertions (Tukwila)
17
Tim
e to
pro
paga
te in
serti
ons
(sec
)
![Page 18: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/18.jpg)
Contributions• Orchestra innovatively performs update
exchange (not just mediated/federated query answering)
• Tracks data provenance across a network of schema mappings
• Supports provenance-based trust policies• Features algorithms for incremental
propagation of updates• Solutions have been validated by experimental
prototype for typical bioinformatics settings18
![Page 19: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/19.jpg)
Related work
• Peer data management systems Piazza [Halevy+03, 04], Hyperion [Kementsietsidis+04], [Bernstein+02], [Calvanese+04], ...
• Data exchange [Haas+99, Miller+00, Popa+02, Fagin+03], peer data exchange [Fuxman+05]
• Provenance / lineage [CuiWidom01], [Buneman+01], Trio [Widom+05], Spider [ChiticariuTan06], [Green+07], ...
• Incremental maintenance [GuptaMumick95], …
19
![Page 20: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/20.jpg)
CDSS as a research platform: promising future directions
• Ranking-based trust with provenance– Numeric weights and “accumulation of evidence”
• More expressive mappings– e.g., “looking inside” attributes using regular expressions
• Compact representations of provenance• Mixing virtual and materialized peers– Related to view selection problem
• Supporting key dependencies / egds– Deletion propagation becomes challenging
• Incorporating probabilistic mappings / data20
![Page 21: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/21.jpg)
Ongoing work at Penn• Deploying ORCHESTRA in the real world– Pilot project with Penn Center for Bioinformatics
• Bidirectional mappings – Propagating updates in both directions
• Mapping evolution problem– Handling updates to mappings (not just data)
• Fully distributed implementation– Using P2P database engine
21
![Page 22: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/22.jpg)
![Page 23: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/23.jpg)
Bioinformatics mappings example
(𝑚1) 𝐺ሺ𝑖,𝑐,𝑛ሻ→𝐵ሺ𝑖,𝑛ሻ (𝑚2) 𝐺ሺ𝑖,𝑐,𝑛ሻ→𝑈(𝑛,𝑐) (𝑚3) 𝐺ሺ𝑖,𝑐,𝑛ሻ→∃𝑐 𝑈ሺ𝑛,𝑐ሻ (𝑚4) 𝐵ሺ𝑖,𝑐ሻ∧𝑈ሺ𝑛,𝑐ሻ→𝐵ሺ𝑖,𝑛ሻ 23
![Page 24: Update Exchange with Mappings and Provenance Todd J. Green Grigoris Karvounarakis Zachary G. Ives Val Tannen University of Pennsylvania VLDB 2007 Vienna,](https://reader036.fdocuments.us/reader036/viewer/2022062511/55163c97550346b2068b5164/html5/thumbnails/24.jpg)
Delta rules for insertions
• As in DRed [GuptaMumick95]:
(𝑚4) 𝐵ሺ𝑖,𝑐ሻ∧𝑈ሺ𝑛,𝑐ሻ→𝐵ሺ𝑖,𝑛ሻ (𝑚4′) 𝐵+(𝑖,𝑐) ∧𝑈ሺ𝑛,𝑐ሻ→𝐵+ሺ𝑖,𝑛ሻ (𝑚4′′) 𝐵𝜈ሺ𝑖,𝑐ሻ∧𝑈+ሺ𝑛,𝑐ሻ→𝐵+ሺ𝑖,𝑛ሻ (𝑚𝐵+) 𝐵+ሺ𝑖,𝑐ሻ→𝐵𝜈ሺ𝑖,𝑐ሻ (𝑚𝐵) 𝐵ሺ𝑖,𝑐ሻ→𝐵𝜈ሺ𝑖,𝑐ሻ
24