Trio: A System for Data, Uncertainty, and Lineage
-
Upload
philip-mccarthy -
Category
Documents
-
view
39 -
download
0
description
Transcript of Trio: A System for Data, Uncertainty, and Lineage
![Page 1: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/1.jpg)
Trio: A System for Data, Uncertainty, and Lineage
Martin Theobald
Jennifer Widom
Stanford University
![Page 2: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/2.jpg)
2
Uncertainty in Databases
• Not a new idea — proposed 20+ years ago• “Classic” probabilistic database systems
• Uncertainty about attribute values
• Modeling of missing values/incompleteness
[Cavallo ‘87, Abiteboul ‘91, Garcia-Molina ‘92, Fuhr/Rölleke ‘97]
• Most initial (18+ years) work largely theoretical; not much systems-building until recently:URank – U Waterloo, Orion – Purdue U, MayBMS -
Saarland U
• But applications weren’t ready anyway
Are they now?
![Page 3: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/3.jpg)
3
The “Trio” in Trio
1. Data Student #123 is majoring in Econ: (123,Econ)
Major
2. Uncertainty Student #123 is majoring in Econ or CS:
(123, Econ ∥ CS) Major With confidence 60% student #456 is a CS major:
(456, CS 0.6) Major
3. Lineage 456 HardWorker derived from:
(456, CS) Major and“CS is hard” some web page
![Page 4: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/4.jpg)
4
Original Motivation for Trio
New Application Domains
• Many involve data that is uncertain (probabilistic, imprecise, fuzzy, incomplete...)
• Many of the same ones need to track the lineage (provenance) of their data
Coincidence or Fate?Coincidence or Fate?
![Page 5: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/5.jpg)
5
Original Motivation for Trio
New Application Domains
• Many involve data that is uncertain (probabilistic, imprecise, fuzzy, incomplete...)
• Many of the same ones need to track the lineage (provenance) of their data
Trio combines uncertainty & lineage for a closed & complete representation modelNeither uncertainty nor lineage is supported in current commercial database systems
Neither uncertainty nor lineage is supported in current commercial database systems
![Page 6: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/6.jpg)
6
Sample Applications
Information extraction• Find & label entities in unstructured text• Often probabilistic
Information integration• Combine data from multiple sources• Inconsistencies, probabilistic schema mappings
“Human input” data / Scientific experiments• Inexact/incomplete data• Many levels of “derived data products”
![Page 7: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/7.jpg)
7
Sample Applications
Sensor data management• Approximate readings• Missing readings• Levels of data aggregation
Deduplication (“data cleaning”)• Object linkage, entity resolution• Often heuristic/probabilistic
But *NOT* approximate query processing• Fast but inexact answers
![Page 8: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/8.jpg)
8
Our Claim
The connection between uncertainty and lineage goes deeper than just a shared need by several applications
Coincidence or Fate?Coincidence or Fate?
![Page 9: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/9.jpg)
9
Substantiation of Claim
[technical] Lineage:1. Enables simple and consistent
representation of uncertain data
2. Correlates uncertainty in query results with uncertainty in the input data
3. Can make computations over uncertain data more efficient
[fluffy] “Applications may use lineage to reduce or resolve uncertainty”
![Page 10: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/10.jpg)
10
Our Goal
Develop a new kind of database management system (DBMS) in which:
1. Data2. Uncertainty3. Lineage
are all first-class interrelated concepts
With all the “usual” DBMS features
![Page 11: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/11.jpg)
11
The “Usual” DBMS Features
1. Efficient,
2. Convenient,
3. Safe,
4. Multi-User storage of and access to
5. Massive amounts of
6. Persistent data
![Page 12: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/12.jpg)
12
Standard Relational DBMSs
Persistent; Convenient• All data stored in simple tables (relations)
• Queries and updates via simple but powerful declarative language (SQL)
Multi-User; Safe• Transactions, Recovery
Massive; Efficient• Storage and indexing structures
• Query optimization
![Page 13: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/13.jpg)
13
Trio: What Changes
Persistent; Convenient• All data stored in “uncertain” relations
• Queries and updates via simple but powerful declarative language
Multi-User; Safe• Transactions, Recovery
Massive; Efficient• Storage and indexing structures
• Query optimization
![Page 14: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/14.jpg)
14
Trio: What Changes
Persistent; Convenient• All data stored in “uncertain” relations
• Queries and updates via simple but powerful declarative language
Multi-User; Safe – standard DBMS underneath• Transactions, Recovery
Massive; Efficient – standard DBMS underneath• Storage and indexing structures
• Query optimization
![Page 15: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/15.jpg)
15
Another “Trio” in Trio
1. Data Model Simplest extension to relational model that’s
sufficiently expressive
2. Query Language Simple extension to SQL with well-defined
semantics and intuitive behavior
3. System A complete open-source DBMS that people
want to use
![Page 16: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/16.jpg)
16
Another “Trio” in Trio
1. Data Model Uncertainty-Lineage Databases (ULDBs)
2. Query Language TriQL
3. System Trio-One — built on top of standard DBMS
![Page 17: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/17.jpg)
17
Remainder of Talk
1. Data Model Uncertainty-Lineage Databases (ULDBs)
2. Query Language TriQL
3. SystemTrio-One
4. Real-World Applications PharmGKB.org
SERF
![Page 18: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/18.jpg)
18
Running Example: Crime-Solving
Saw (witness, color, car) // may be uncertain
Drives (person, color, car) // may be uncertain
Suspects (person) = πperson(Saw ⋈ Drives)
![Page 19: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/19.jpg)
19
In Standard Relational DBMS
Drives (person, color, car)
Jimmy red Toyota
Frank red Mazda
Billy blue Honda
Frank green
Mazda
Saw (witness, color, car)
Amy red Mazda
Betty blue Honda
Carol green Toyota
Suspects
Frank
Billy
Create Table Suspects asSelect personFrom Saw, DrivesWhere Saw.color = Drives.color And Saw.car = Drives.car
Create Table Suspects asSelect personFrom Saw, DrivesWhere Saw.color = Drives.color And Saw.car = Drives.car
![Page 20: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/20.jpg)
20
Data Model: Uncertainty
An uncertain database enodes a set of possible instances based on different kinds of uncertainty:
• Jimmy drives either a Toyota or a Mazda, but not both.
• Amy saw either a Honda or a Toyota, or nothing?
• Hank is a suspect with confidence 0.7
• Betty saw either an Acura with confidence 0.5 or a Toyota with confidence 0.3
or nothing with confidence 0.2 ….
![Page 21: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/21.jpg)
21
Our Model for Uncertainty
1. Alternatives
2. ‘?’ (Maybe) Annotations
3. Confidences
![Page 22: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/22.jpg)
22
Our Model for Uncertainty
1. Alternatives: uncertainty about value
2. ‘?’ (Maybe) Annotations
3. ConfidencesSaw (witness, color, car)
Amy red, Honda ∥ red, Toyota ∥ orange, Mazda
Three possible
instances
![Page 23: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/23.jpg)
23
Six possibleinstances
Our Model for Uncertainty
1. Alternatives
2. ‘?’ (Maybe): uncertainty about presence
3. Confidences
?
Saw (witness, color, car)
Amy red, Honda ∥ red, Toyota ∥ orange, Mazda
Betty blue, Acura
![Page 24: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/24.jpg)
24
Our Model for Uncertainty
1. Alternatives
2. ‘?’ (Maybe): uncertainty about presence
3. ConfidencesSaw (witness, color, car)
Amy red, Honda ∥ red, Toyota ∥ orange, Mazda
Betty blue, Acura?
Betty blue, Acura ∥ NULL, NULL
absent unknownabsent unknown
![Page 25: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/25.jpg)
25
Our Model for Uncertainty
1. Alternatives
2. ‘?’ (Maybe) Annotations
3. Confidences: weighted uncertainty
Six possible instances, each with a probability
?
Saw (witness, color, car)
Amy red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2
Betty blue, Acura 0.6
![Page 26: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/26.jpg)
26
Our Model is Not Closed
Saw (witness, car)
Cathy
Honda ∥ Mazda
Drives (person, car)
Jimmy, Toyota ∥ Jimmy, Mazda
Billy, Honda ∥ Frank, Honda
Hank, Honda
Suspects
Jimmy
Billy ∥ Frank
Hank
Suspects = πperson(Saw ⋈ Drives)
???
Does not correctlycapture possibleinstances in theresult (not 12)!
CANNOT
![Page 27: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/27.jpg)
27
to the Rescue
Lineage (provenance): “where data came from”• Internal lineage – from another Trio relation
• External lineage – from an external source
In Trio: A Boolean function λ from data elements to other data elements (or external sources)
Lineage
![Page 28: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/28.jpg)
28
Example with Lineage
ID Saw (witness, car)
11
Cathy
Honda ∥ Mazda
ID Drives (person, car)
21
Jimmy, Toyota ∥ Jimmy, Mazda
22
Billy, Honda ∥ Frank, Honda
23
Hank, Honda
ID Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
Suspects = πperson(Saw ⋈ Drives)
???
λ(31) = (11,2)˄(21,2)
λ(32,1) = (11,1)˄(22,1)
λ(33) = (11,1) ˄ 23
; λ(32,2) = (11,1)˄(22,2)
![Page 29: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/29.jpg)
29
Example with Lineage
ID Saw (witness, car)
11
Cathy
Honda ∥ Mazda
ID Drives (person, car)
21
Jimmy, Toyota ∥ Jimmy, Mazda
22
Billy, Honda ∥ Frank, Honda
23
Hank, Honda
ID Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
Suspects = πperson(Saw ⋈ Drives)
???
λ(31) = (11,2)˄(21,2)λ(32,1) = (11,1)˄(22,1); λ(32,2) = (11,1)˄(22,2)
λ(33) = (11,1) ˄ 23
Correctly captures possible instances inthe result (7)
![Page 30: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/30.jpg)
30
Trio Data Model
1. Alternatives
2. ‘?’ (Maybe) Annotations
3. Confidences
4. Lineage
ULDBs are closed and complete
Uncertainty-Lineage Databases (ULDBs)Uncertainty-Lineage Databases (ULDBs)
[Databases with Uncertainty and Lineage, VLDB J. Special Issue 2/08]
![Page 31: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/31.jpg)
31
ULDBs: Lineage
Conjunctive lineage sufficient for most operations (,,⋈)
• Disjunctive lineage for duplicate-elimination ()
• Negative lineage for set difference ()
General case after several queries: • Boolean formula
• Can represent any finite (sub)set of possible instances
• Can capture arbitrary Boolean constraints among tuples
• Can represent any finite (sub)set of possible instances
• Can capture arbitrary Boolean constraints among tuples
![Page 32: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/32.jpg)
32
ULDBs: Minimality
A ULDB relation R represents a set of possible
instances
• Does every tuple in R appear in some possible instance? (no extraneous tuples)
• Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s)
• Also
Data-minimalityData-minimality
Lineage-minimalityLineage-minimality
![Page 33: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/33.jpg)
33
Data Minimality Examples
Extraneous ‘?’
. . .
10
Billy, Honda ∥ Frank, Mazda
. . .
. . .
40
Billy ∥ Frank
. . .
?λ(40,1)=(10,1)λ(40,2)=(10,2)
extraneous
. . .
20 Billy, Honda
. . .
. . .
30 Frank, Mazda
. . .
? ?
![Page 34: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/34.jpg)
34
Data Minimality Examples
Extraneous tuple
extraneous
. . . . . .
10
Diane Acura ∥ Mazda
. . . . . .
. . .
40
Diane
. . .
? λ(40,1)=(10,1)˄(10,2)
. . . . . .
20 Diane Acura
. . . . . .
. . . . . .
30 Diane Mazda
. . . . . .
? ?
![Page 35: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/35.jpg)
• Confidence computation over arbitrary Boolean expressions:
Efficient approximation algorithms (E.g. Monte Carlo simulations)
35
ULDBs: Complexity Questions• Does a given tuple t appear in some (all) possible
instance(s) of an uncertain relation R ?
• Is a given table T one of (all of) the possible instances of R ?
Polynomial algorithms based on data-minimizationPolynomial algorithms based on data-minimization
NP-hardNP-hard
#P-complete#P-complete
![Page 36: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/36.jpg)
36
Querying ULDBs
Simple extension to SQL
Formal semantics, intuitive meaning
Query uncertainty, confidences, and lineage
TriQLTriQL
![Page 37: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/37.jpg)
37
Formal Semantics
Relational (SQL) query Q on ULDB D
DD
D1, D2, …, DnD1, D2, …, Dn
possibleinstances
Q on eachinstance
representationof instances
Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)
implementation of Q
operational semanticsD + ResultD + Result
![Page 38: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/38.jpg)
38
Operational Semantics: Example
SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car
Saw (witness, car)
Cathy Honda ∥ Mazda
Drives (person, car)
Jim ∥ Bill Mazda
Hank Honda
![Page 39: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/39.jpg)
39
Operational Semantics: Example
SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car
(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)
Saw (witness, car)
Cathy Honda ∥ Mazda
Drives (person, car)
Jim ∥ Bill Mazda
Hank Honda
![Page 40: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/40.jpg)
40
Operational Semantics: Example
SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car
(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)
Saw (witness, car)
Cathy Honda ∥ Mazda
Drives (person, car)
Jim ∥ Bill Mazda
Hank Honda
![Page 41: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/41.jpg)
41
Operational Semantics: Example
SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car
(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)
Saw (witness, car)
Cathy Honda ∥ Mazda
Drives (person, car)
Jim ∥ Bill Mazda
Hank Honda
?
![Page 42: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/42.jpg)
42
Operational Semantics: Example
(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)
SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car
(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)
Saw (witness, car)
Cathy Honda ∥ Mazda
Drives (person, car)
Jim ∥ Bill Mazda
Hank Honda
?
![Page 43: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/43.jpg)
43
Operational Semantics: Example
(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)
SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car
(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)
Saw (witness, car)
Cathy Honda ∥ Mazda
Drives (person, car)
Jim ∥ Bill Mazda
Hank Honda
?
![Page 44: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/44.jpg)
44
Operational Semantics: Example
(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)
SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car
(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)
Saw (witness, car)
Cathy Honda ∥ Mazda
Drives (person, car)
Jim ∥ Bill Mazda
Hank Honda
??
![Page 45: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/45.jpg)
45
Operational Semantics: Example
CREATE TABLE Suspects ASSELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car
Suspects
Jim ∥ Bill
Hank
??λ( ) = ...λ( ) = ...
Saw (witness, car)
Cathy Honda ∥ Mazda
Drives (person, car)
Jim ∥ Bill Mazda
Hank Honda
• Efficiently implementable on top of standard SQL
• Data computation overhead: O(n log(n))
• Efficiently implementable on top of standard SQL
• Data computation overhead: O(n log(n))
![Page 46: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/46.jpg)
46
Confidences
Confidences supplied with base data
Trio computes confidences on query results• Default probabilistic interpretation (#P-complete)• Plug in different arithmetics, e.g., Min (linear)
Saw (witness, car)
Cathy
Honda 0.6 ∥ Mazda 0.4
Drives (person, car)
Jim 0.3 ∥ Bill 0.6 Mazda
Hank 1.0 Honda
Suspects
Jim 0.12 ∥ Bill 0.24
Hank 0.6
0.3 0.4
0.6 ProbabilisticProbabilistic Min
![Page 47: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/47.jpg)
47
TriQL: Querying Confidences Built-in function: Conf()
SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(Saw) > 0.5 AND Conf(Drives) > 0.8
Similarly: Conf(*)
SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(*) > 0.5
Naïve data-minimization:
WHERE Conf(*) > 0
Naïve data-minimization:
WHERE Conf(*) > 0
![Page 48: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/48.jpg)
48
TriQL: Querying Lineage
Built-in join predicate: Lineage()
SELECT Suspects.person FROM Suspects, Saw WHERE Saw.witness = ‘Amy’ AND Lineage(Suspects,Saw)
X ==> Y shorthand for Lineage(X,Y)
![Page 49: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/49.jpg)
49
More TriQL Constructs
• “Horizontal subqueries”• Refer to tuple alternatives as a relation
• Aggregations: • (sum, count, avg, min, max) (low, high,
expected)
• Fetch modifiers:• Merged (eliminate horizontal duplicates)• Flatten, GroupAlts, NoLineage, NoConf, NoMaybe
• Set operators: Union [All], Intersect, ExceptConjunctive, Disjunctive & Negative Lineage
![Page 50: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/50.jpg)
50
Horizontal Subquery Example
PrimeSuspect (crime#, suspect, accuser)
1 Jimmy, Amy ∥ Billy, Betty ∥ Hank, Cathy
2 Frank, Cathy ∥ Freddy, Betty
Credibility (person,scor
e)
Amy 10
Betty 15
Cathy 5
Suspects
Jimmy 0.33 ∥ Billy 0.5 ∥ Hank 0.166
Frank 0.25 ∥ Freddy 0.75
List suspects with conf values based on accuser credibility
![Page 51: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/51.jpg)
51
Horizontal Subquery Example
PrimeSuspect (crime#, suspect, accuser)
1 Jimmy, Amy ∥ Billy, Betty ∥ Hank, Cathy
2 Frank, Cathy ∥ Freddy, Betty
Credibility (person,scor
e)
Amy 10
Betty 15
Cathy 5
Suspects
Jimmy 0.33 ∥ Billy 0.5 ∥ Hank 0.166
Frank 0.25 ∥ Freddy 0.75
SELECT suspect, score/[sum(score)] as confFROM PrimeSuspect, CredibilityWHERE accuser = person
SELECT suspect, score/[sum(score)] as confFROM PrimeSuspect, CredibilityWHERE accuser = person
![Page 52: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/52.jpg)
SELECT suspect, FROM PrimeSuspect, CredibilityWHERE accuser = person
SELECT suspect, FROM PrimeSuspect, CredibilityWHERE accuser = person
52
Horizontal Subquery Example
PrimeSuspect (crime#, suspect, accuser)
1 Jimmy, Amy ∥ Billy, Betty ∥ Hank, Cathy
2 Frank, Cathy ∥ Freddy, Betty
Credibility (person,scor
e)
Amy 10
Betty 15
Cathy 5
Suspects
Jimmy 0.33 ∥ Billy 0.5 ∥ Hank 0.166
Frank 0.25 ∥ Freddy 0.75
scaled as conf
Remove extraneous ‘?’ if:
[Sum(Conf(*))] = 1
Remove extraneous ‘?’ if:
[Sum(Conf(*))] = 1
![Page 53: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/53.jpg)
53
Trio-Specific Additional Features
• Lineage tracing
• On-demand confidence computation
• Coexistence checks
• Extraneous data removal
Interrelated algorithms based on lineage
![Page 54: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/54.jpg)
54
The Trio System
• Version 1 (“Trio-One”)• Layered on top of a standard DBMS
(PostgreSQL)
• Surprisingly easy and complete, reasonably efficient
• Open source as of summer 2007• Supports Linux, Windows Server/XP/Vista
& Mac-OSX
• Constantly updated & maintained
![Page 55: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/55.jpg)
55
Trio-One Overview
Standard relational DBMS
Trio API and TriQL rewriter(Python)
Trio API and TriQL rewriter(Python)
TrioPlus(Command-line
Client)
TrioPlus(Command-line
Client)
TrioMetadat
a
TrioExplorer(GUI client)
TrioExplorer(GUI client)
Trio Stored
Procedures
EncodedData
TablesLineageTables
Standard SQL• Partition and “verticalize”• Shared IDs for alternatives• Columns for confidence,“?”• One per result table• Uses unique IDs• Encodes formulas
• Table types• Schema-level lineage structure• Conf()• Lineage()
• DDL commands• TriQL queries• Schema browsing• Table browsing• Explore lineage• On-demand confidence computation
Postgres SPI functions
![Page 56: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/56.jpg)
Applications
Pharmaceutical / Medical Databases• PharmGKB.org
• Relationships between Drugs, Diseases, Genes
• Evidence from Literature References, Clinical Outcomes, etc.
58
![Page 57: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/57.jpg)
Applications – PharmGKB.org
59
GeneC
DiseaseB
DrugA
EffectiveA,BEffectiveA,B,C
• Relational Schema • Drugs(DrugID, Name, AltName)
• Diseases(DiseaseID, Name, AltName)
• Genes(GeneID, Name, Symbol, AltSymbol)
• Relationships(RelID, RelatedTo, Evidence)
• Independent Base Data
P[EffectiveA,B,C]
= P[A] P[B] P[C]
![Page 58: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/58.jpg)
Applications – PharmGKB.org
60
GeneC
DiseaseB
DrugA
EffectiveA,BEffectiveA,B,C
• Relational Schema • Drugs(DrugID, Name, AltName)
• Diseases(DiseaseID, Name, AltName)
• Genes(GeneID, Name, Symbol, AltSymbol)
• Relationships(RelID, RelatedTo, Evidence)
• Non-Independent Base Data
P[EffectiveA,B,C]
= P[E|A,B,C]
≠ P[A] P[B] P[C]
ULDBs Vs. Bayesian Nets Factorized representation w/conditional probability tables (CPTs)
E.g.: C = Heart DiseaseB = High Blood PressureA = Warfarin
![Page 59: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/59.jpg)
Applications – SERF
Stanford Entity Resolution Framework
61
Swoosh: A Generic Approach to Entity ResolutionBenjelloun, Garcia-Molina, Whang et al.
<id>r2<name>John Doe<phone>357-2635<email>[email protected]
<id>r1<name>J. Doe<phone>351-2535
<id>r3<name>John D.<phone>351-2535<email>[email protected]
<id>r4<name>{John Doe||John D.}<phone>{357-2635||351-2535}<email>[email protected]
<id>r5<name>{John Doe||John D.||J. Doe}<phone>{357-2635||351-2535}<email>[email protected]
Iteration 2Iteration 2
Iteration 1Iteration 1
![Page 60: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/60.jpg)
Stanford Entity Resolution Framework
62
Swoosh: A Generic Approach to Entity ResolutionBenjelloun, Garcia-Molina, Whang et al.
<id>r2<name>John Doe<phone>357-2635<email>[email protected]
<id>r1<name>J. Doe<phone>351-2535
<id>r3<name>John D.<phone>351-2535<email>[email protected]
<id>r4<name>{John Doe : 0.4 ||John D. : 0.3}<phone>{357-2635 : 0.5 ||351-2535 : 0.5}<email>[email protected]
<id>r5<name>{John Doe : 0.4 ||John D. : 0.3||J. Doe : 0.3}<phone>{357-2635 : 0.3 ||351-2535 : 0.6}<email>[email protected]
Iteration 2Iteration 2
Iteration 1Iteration 1
Applications – SERF
![Page 61: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/61.jpg)
63
Future Topics (sample)
More forms of uncertainty• Continuous uncertainty (intervals, Gaussians)• Correlated uncertainty• Incomplete relations
More efficient confidence-based queries• Top-k by confidence• Confidence bounds & approximation
More forms of lineage• Update propagation w/lineage (versioning)• External lineage
![Page 62: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/62.jpg)
64
Future Topics (sample)
Theory• AI models/Bayesian Nets & inferencing
• Design theory
System• Full TriQL query language• Version 2: Go native beyond Postgres SPI?
– Storage and indexing structures– Statistics– Query optimization
![Page 63: Trio: A System for Data, Uncertainty, and Lineage](https://reader033.fdocuments.us/reader033/viewer/2022051619/56812b5e550346895d8f80ea/html5/thumbnails/63.jpg)
Search “stanford trio”
Trio contributors, past and present Parag Agrawal, Omar Benjelloun, Julien Chaumond, Ashok Chandra, Anish Das Sarma, Alon Halevy, Chris
Hayworth, Ander de Keijzer, Raghotham Murthy, Michi Mutsuzaki, Shubha Nabar, Tomoe Sugihara,
Martin Theobald, Jeff Ullman, Jennifer Widom