Trio: A System for Data, Uncertainty, and Lineage

63
Trio: A System for Data, Uncertainty, and Lineage Martin Theobald Jennifer Widom Stanford University

description

Trio: A System for Data, Uncertainty, and Lineage. Martin Theobald Jennifer Widom Stanford University. Uncertainty in Databases. Not a new idea — proposed 20+ years ago “Classic” probabilistic database systems Uncertainty about attribute values Modeling of missing values/incompleteness - PowerPoint PPT Presentation

Transcript of Trio: A System for Data, Uncertainty, and Lineage

Page 1: Trio:  A System for Data, Uncertainty, and Lineage

Trio: A System for Data, Uncertainty, and Lineage

Martin Theobald

Jennifer Widom

Stanford University

Page 2: Trio:  A System for Data, Uncertainty, and Lineage

2

Uncertainty in Databases

• Not a new idea — proposed 20+ years ago• “Classic” probabilistic database systems

• Uncertainty about attribute values

• Modeling of missing values/incompleteness

[Cavallo ‘87, Abiteboul ‘91, Garcia-Molina ‘92, Fuhr/Rölleke ‘97]

• Most initial (18+ years) work largely theoretical; not much systems-building until recently:URank – U Waterloo, Orion – Purdue U, MayBMS -

Saarland U

• But applications weren’t ready anyway

Are they now?

Page 3: Trio:  A System for Data, Uncertainty, and Lineage

3

The “Trio” in Trio

1. Data Student #123 is majoring in Econ: (123,Econ)

Major

2. Uncertainty Student #123 is majoring in Econ or CS:

(123, Econ ∥ CS) Major With confidence 60% student #456 is a CS major:

(456, CS 0.6) Major

3. Lineage 456 HardWorker derived from:

(456, CS) Major and“CS is hard” some web page

Page 4: Trio:  A System for Data, Uncertainty, and Lineage

4

Original Motivation for Trio

New Application Domains

• Many involve data that is uncertain (probabilistic, imprecise, fuzzy, incomplete...)

• Many of the same ones need to track the lineage (provenance) of their data

Coincidence or Fate?Coincidence or Fate?

Page 5: Trio:  A System for Data, Uncertainty, and Lineage

5

Original Motivation for Trio

New Application Domains

• Many involve data that is uncertain (probabilistic, imprecise, fuzzy, incomplete...)

• Many of the same ones need to track the lineage (provenance) of their data

Trio combines uncertainty & lineage for a closed & complete representation modelNeither uncertainty nor lineage is supported in current commercial database systems

Neither uncertainty nor lineage is supported in current commercial database systems

Page 6: Trio:  A System for Data, Uncertainty, and Lineage

6

Sample Applications

Information extraction• Find & label entities in unstructured text• Often probabilistic

Information integration• Combine data from multiple sources• Inconsistencies, probabilistic schema mappings

“Human input” data / Scientific experiments• Inexact/incomplete data• Many levels of “derived data products”

Page 7: Trio:  A System for Data, Uncertainty, and Lineage

7

Sample Applications

Sensor data management• Approximate readings• Missing readings• Levels of data aggregation

Deduplication (“data cleaning”)• Object linkage, entity resolution• Often heuristic/probabilistic

But *NOT* approximate query processing• Fast but inexact answers

Page 8: Trio:  A System for Data, Uncertainty, and Lineage

8

Our Claim

The connection between uncertainty and lineage goes deeper than just a shared need by several applications

Coincidence or Fate?Coincidence or Fate?

Page 9: Trio:  A System for Data, Uncertainty, and Lineage

9

Substantiation of Claim

[technical] Lineage:1. Enables simple and consistent

representation of uncertain data

2. Correlates uncertainty in query results with uncertainty in the input data

3. Can make computations over uncertain data more efficient

[fluffy] “Applications may use lineage to reduce or resolve uncertainty”

Page 10: Trio:  A System for Data, Uncertainty, and Lineage

10

Our Goal

Develop a new kind of database management system (DBMS) in which:

1. Data2. Uncertainty3. Lineage

are all first-class interrelated concepts

With all the “usual” DBMS features

Page 11: Trio:  A System for Data, Uncertainty, and Lineage

11

The “Usual” DBMS Features

1. Efficient,

2. Convenient,

3. Safe,

4. Multi-User storage of and access to

5. Massive amounts of

6. Persistent data

Page 12: Trio:  A System for Data, Uncertainty, and Lineage

12

Standard Relational DBMSs

Persistent; Convenient• All data stored in simple tables (relations)

• Queries and updates via simple but powerful declarative language (SQL)

Multi-User; Safe• Transactions, Recovery

Massive; Efficient• Storage and indexing structures

• Query optimization

Page 13: Trio:  A System for Data, Uncertainty, and Lineage

13

Trio: What Changes

Persistent; Convenient• All data stored in “uncertain” relations

• Queries and updates via simple but powerful declarative language

Multi-User; Safe• Transactions, Recovery

Massive; Efficient• Storage and indexing structures

• Query optimization

Page 14: Trio:  A System for Data, Uncertainty, and Lineage

14

Trio: What Changes

Persistent; Convenient• All data stored in “uncertain” relations

• Queries and updates via simple but powerful declarative language

Multi-User; Safe – standard DBMS underneath• Transactions, Recovery

Massive; Efficient – standard DBMS underneath• Storage and indexing structures

• Query optimization

Page 15: Trio:  A System for Data, Uncertainty, and Lineage

15

Another “Trio” in Trio

1. Data Model Simplest extension to relational model that’s

sufficiently expressive

2. Query Language Simple extension to SQL with well-defined

semantics and intuitive behavior

3. System A complete open-source DBMS that people

want to use

Page 16: Trio:  A System for Data, Uncertainty, and Lineage

16

Another “Trio” in Trio

1. Data Model Uncertainty-Lineage Databases (ULDBs)

2. Query Language TriQL

3. System Trio-One — built on top of standard DBMS

Page 17: Trio:  A System for Data, Uncertainty, and Lineage

17

Remainder of Talk

1. Data Model Uncertainty-Lineage Databases (ULDBs)

2. Query Language TriQL

3. SystemTrio-One

4. Real-World Applications PharmGKB.org

SERF

Page 18: Trio:  A System for Data, Uncertainty, and Lineage

18

Running Example: Crime-Solving

Saw (witness, color, car) // may be uncertain

Drives (person, color, car) // may be uncertain

Suspects (person) = πperson(Saw ⋈ Drives)

Page 19: Trio:  A System for Data, Uncertainty, and Lineage

19

In Standard Relational DBMS

Drives (person, color, car)

Jimmy red Toyota

Frank red Mazda

Billy blue Honda

Frank green

Mazda

Saw (witness, color, car)

Amy red Mazda

Betty blue Honda

Carol green Toyota

Suspects

Frank

Billy

Create Table Suspects asSelect personFrom Saw, DrivesWhere Saw.color = Drives.color And Saw.car = Drives.car

Create Table Suspects asSelect personFrom Saw, DrivesWhere Saw.color = Drives.color And Saw.car = Drives.car

Page 20: Trio:  A System for Data, Uncertainty, and Lineage

20

Data Model: Uncertainty

An uncertain database enodes a set of possible instances based on different kinds of uncertainty:

• Jimmy drives either a Toyota or a Mazda, but not both.

• Amy saw either a Honda or a Toyota, or nothing?

• Hank is a suspect with confidence 0.7

• Betty saw either an Acura with confidence 0.5 or a Toyota with confidence 0.3

or nothing with confidence 0.2 ….

Page 21: Trio:  A System for Data, Uncertainty, and Lineage

21

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

Page 22: Trio:  A System for Data, Uncertainty, and Lineage

22

Our Model for Uncertainty

1. Alternatives: uncertainty about value

2. ‘?’ (Maybe) Annotations

3. ConfidencesSaw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Three possible

instances

Page 23: Trio:  A System for Data, Uncertainty, and Lineage

23

Six possibleinstances

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe): uncertainty about presence

3. Confidences

?

Saw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Betty blue, Acura

Page 24: Trio:  A System for Data, Uncertainty, and Lineage

24

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe): uncertainty about presence

3. ConfidencesSaw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Betty blue, Acura?

Betty blue, Acura ∥ NULL, NULL

absent unknownabsent unknown

Page 25: Trio:  A System for Data, Uncertainty, and Lineage

25

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences: weighted uncertainty

Six possible instances, each with a probability

?

Saw (witness, color, car)

Amy red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2

Betty blue, Acura 0.6

Page 26: Trio:  A System for Data, Uncertainty, and Lineage

26

Our Model is Not Closed

Saw (witness, car)

Cathy

Honda ∥ Mazda

Drives (person, car)

Jimmy, Toyota ∥ Jimmy, Mazda

Billy, Honda ∥ Frank, Honda

Hank, Honda

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈ Drives)

???

Does not correctlycapture possibleinstances in theresult (not 12)!

CANNOT

Page 27: Trio:  A System for Data, Uncertainty, and Lineage

27

to the Rescue

Lineage (provenance): “where data came from”• Internal lineage – from another Trio relation

• External lineage – from an external source

In Trio: A Boolean function λ from data elements to other data elements (or external sources)

Lineage

Page 28: Trio:  A System for Data, Uncertainty, and Lineage

28

Example with Lineage

ID Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID Drives (person, car)

21

Jimmy, Toyota ∥ Jimmy, Mazda

22

Billy, Honda ∥ Frank, Honda

23

Hank, Honda

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

Suspects = πperson(Saw ⋈ Drives)

???

λ(31) = (11,2)˄(21,2)

λ(32,1) = (11,1)˄(22,1)

λ(33) = (11,1) ˄ 23

; λ(32,2) = (11,1)˄(22,2)

Page 29: Trio:  A System for Data, Uncertainty, and Lineage

29

Example with Lineage

ID Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID Drives (person, car)

21

Jimmy, Toyota ∥ Jimmy, Mazda

22

Billy, Honda ∥ Frank, Honda

23

Hank, Honda

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

Suspects = πperson(Saw ⋈ Drives)

???

λ(31) = (11,2)˄(21,2)λ(32,1) = (11,1)˄(22,1); λ(32,2) = (11,1)˄(22,2)

λ(33) = (11,1) ˄ 23

Correctly captures possible instances inthe result (7)

Page 30: Trio:  A System for Data, Uncertainty, and Lineage

30

Trio Data Model

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

4. Lineage

ULDBs are closed and complete

Uncertainty-Lineage Databases (ULDBs)Uncertainty-Lineage Databases (ULDBs)

[Databases with Uncertainty and Lineage, VLDB J. Special Issue 2/08]

Page 31: Trio:  A System for Data, Uncertainty, and Lineage

31

ULDBs: Lineage

Conjunctive lineage sufficient for most operations (,,⋈)

• Disjunctive lineage for duplicate-elimination ()

• Negative lineage for set difference ()

General case after several queries: • Boolean formula

• Can represent any finite (sub)set of possible instances

• Can capture arbitrary Boolean constraints among tuples

• Can represent any finite (sub)set of possible instances

• Can capture arbitrary Boolean constraints among tuples

Page 32: Trio:  A System for Data, Uncertainty, and Lineage

32

ULDBs: Minimality

A ULDB relation R represents a set of possible

instances

• Does every tuple in R appear in some possible instance? (no extraneous tuples)

• Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s)

• Also

Data-minimalityData-minimality

Lineage-minimalityLineage-minimality

Page 33: Trio:  A System for Data, Uncertainty, and Lineage

33

Data Minimality Examples

Extraneous ‘?’

. . .

10

Billy, Honda ∥ Frank, Mazda

. . .

. . .

40

Billy ∥ Frank

. . .

?λ(40,1)=(10,1)λ(40,2)=(10,2)

extraneous

. . .

20 Billy, Honda

. . .

. . .

30 Frank, Mazda

. . .

? ?

Page 34: Trio:  A System for Data, Uncertainty, and Lineage

34

Data Minimality Examples

Extraneous tuple

extraneous

. . . . . .

10

Diane Acura ∥ Mazda

. . . . . .

. . .

40

Diane

. . .

? λ(40,1)=(10,1)˄(10,2)

. . . . . .

20 Diane Acura

. . . . . .

. . . . . .

30 Diane Mazda

. . . . . .

? ?

Page 35: Trio:  A System for Data, Uncertainty, and Lineage

• Confidence computation over arbitrary Boolean expressions:

Efficient approximation algorithms (E.g. Monte Carlo simulations)

35

ULDBs: Complexity Questions• Does a given tuple t appear in some (all) possible

instance(s) of an uncertain relation R ?

• Is a given table T one of (all of) the possible instances of R ?

Polynomial algorithms based on data-minimizationPolynomial algorithms based on data-minimization

NP-hardNP-hard

#P-complete#P-complete

Page 36: Trio:  A System for Data, Uncertainty, and Lineage

36

Querying ULDBs

Simple extension to SQL

Formal semantics, intuitive meaning

Query uncertainty, confidences, and lineage

TriQLTriQL

Page 37: Trio:  A System for Data, Uncertainty, and Lineage

37

Formal Semantics

Relational (SQL) query Q on ULDB D

DD

D1, D2, …, DnD1, D2, …, Dn

possibleinstances

Q on eachinstance

representationof instances

Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)

implementation of Q

operational semanticsD + ResultD + Result

Page 38: Trio:  A System for Data, Uncertainty, and Lineage

38

Operational Semantics: Example

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

Page 39: Trio:  A System for Data, Uncertainty, and Lineage

39

Operational Semantics: Example

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

Page 40: Trio:  A System for Data, Uncertainty, and Lineage

40

Operational Semantics: Example

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

Page 41: Trio:  A System for Data, Uncertainty, and Lineage

41

Operational Semantics: Example

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

?

Page 42: Trio:  A System for Data, Uncertainty, and Lineage

42

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

?

Page 43: Trio:  A System for Data, Uncertainty, and Lineage

43

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

?

Page 44: Trio:  A System for Data, Uncertainty, and Lineage

44

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT personFROM Saw, DrivesWHERE Saw.car = Drives.car

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

??

Page 45: Trio:  A System for Data, Uncertainty, and Lineage

45

Operational Semantics: Example

CREATE TABLE Suspects ASSELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Suspects

Jim ∥ Bill

Hank

??λ( ) = ...λ( ) = ...

Saw (witness, car)

Cathy Honda ∥ Mazda

Drives (person, car)

Jim ∥ Bill Mazda

Hank Honda

• Efficiently implementable on top of standard SQL

• Data computation overhead: O(n log(n))

• Efficiently implementable on top of standard SQL

• Data computation overhead: O(n log(n))

Page 46: Trio:  A System for Data, Uncertainty, and Lineage

46

Confidences

Confidences supplied with base data

Trio computes confidences on query results• Default probabilistic interpretation (#P-complete)• Plug in different arithmetics, e.g., Min (linear)

Saw (witness, car)

Cathy

Honda 0.6 ∥ Mazda 0.4

Drives (person, car)

Jim 0.3 ∥ Bill 0.6 Mazda

Hank 1.0 Honda

Suspects

Jim 0.12 ∥ Bill 0.24

Hank 0.6

0.3 0.4

0.6 ProbabilisticProbabilistic Min

Page 47: Trio:  A System for Data, Uncertainty, and Lineage

47

TriQL: Querying Confidences Built-in function: Conf()

SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(Saw) > 0.5 AND Conf(Drives) > 0.8

Similarly: Conf(*)

SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(*) > 0.5

Naïve data-minimization:

WHERE Conf(*) > 0

Naïve data-minimization:

WHERE Conf(*) > 0

Page 48: Trio:  A System for Data, Uncertainty, and Lineage

48

TriQL: Querying Lineage

Built-in join predicate: Lineage()

SELECT Suspects.person FROM Suspects, Saw WHERE Saw.witness = ‘Amy’ AND Lineage(Suspects,Saw)

X ==> Y shorthand for Lineage(X,Y)

Page 49: Trio:  A System for Data, Uncertainty, and Lineage

49

More TriQL Constructs

• “Horizontal subqueries”• Refer to tuple alternatives as a relation

• Aggregations: • (sum, count, avg, min, max) (low, high,

expected)

• Fetch modifiers:• Merged (eliminate horizontal duplicates)• Flatten, GroupAlts, NoLineage, NoConf, NoMaybe

• Set operators: Union [All], Intersect, ExceptConjunctive, Disjunctive & Negative Lineage

Page 50: Trio:  A System for Data, Uncertainty, and Lineage

50

Horizontal Subquery Example

PrimeSuspect (crime#, suspect, accuser)

1 Jimmy, Amy ∥ Billy, Betty ∥ Hank, Cathy

2 Frank, Cathy ∥ Freddy, Betty

Credibility (person,scor

e)

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy 0.33 ∥ Billy 0.5 ∥ Hank 0.166

Frank 0.25 ∥ Freddy 0.75

List suspects with conf values based on accuser credibility

Page 51: Trio:  A System for Data, Uncertainty, and Lineage

51

Horizontal Subquery Example

PrimeSuspect (crime#, suspect, accuser)

1 Jimmy, Amy ∥ Billy, Betty ∥ Hank, Cathy

2 Frank, Cathy ∥ Freddy, Betty

Credibility (person,scor

e)

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy 0.33 ∥ Billy 0.5 ∥ Hank 0.166

Frank 0.25 ∥ Freddy 0.75

SELECT suspect, score/[sum(score)] as confFROM PrimeSuspect, CredibilityWHERE accuser = person

SELECT suspect, score/[sum(score)] as confFROM PrimeSuspect, CredibilityWHERE accuser = person

Page 52: Trio:  A System for Data, Uncertainty, and Lineage

SELECT suspect, FROM PrimeSuspect, CredibilityWHERE accuser = person

SELECT suspect, FROM PrimeSuspect, CredibilityWHERE accuser = person

52

Horizontal Subquery Example

PrimeSuspect (crime#, suspect, accuser)

1 Jimmy, Amy ∥ Billy, Betty ∥ Hank, Cathy

2 Frank, Cathy ∥ Freddy, Betty

Credibility (person,scor

e)

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy 0.33 ∥ Billy 0.5 ∥ Hank 0.166

Frank 0.25 ∥ Freddy 0.75

scaled as conf

Remove extraneous ‘?’ if:

[Sum(Conf(*))] = 1

Remove extraneous ‘?’ if:

[Sum(Conf(*))] = 1

Page 53: Trio:  A System for Data, Uncertainty, and Lineage

53

Trio-Specific Additional Features

• Lineage tracing

• On-demand confidence computation

• Coexistence checks

• Extraneous data removal

Interrelated algorithms based on lineage

Page 54: Trio:  A System for Data, Uncertainty, and Lineage

54

The Trio System

• Version 1 (“Trio-One”)• Layered on top of a standard DBMS

(PostgreSQL)

• Surprisingly easy and complete, reasonably efficient

• Open source as of summer 2007• Supports Linux, Windows Server/XP/Vista

& Mac-OSX

• Constantly updated & maintained

Page 55: Trio:  A System for Data, Uncertainty, and Lineage

55

Trio-One Overview

Standard relational DBMS

Trio API and TriQL rewriter(Python)

Trio API and TriQL rewriter(Python)

TrioPlus(Command-line

Client)

TrioPlus(Command-line

Client)

TrioMetadat

a

TrioExplorer(GUI client)

TrioExplorer(GUI client)

Trio Stored

Procedures

EncodedData

TablesLineageTables

Standard SQL• Partition and “verticalize”• Shared IDs for alternatives• Columns for confidence,“?”• One per result table• Uses unique IDs• Encodes formulas

• Table types• Schema-level lineage structure• Conf()• Lineage()

• DDL commands• TriQL queries• Schema browsing• Table browsing• Explore lineage• On-demand confidence computation

Postgres SPI functions

Page 56: Trio:  A System for Data, Uncertainty, and Lineage

Applications

Pharmaceutical / Medical Databases• PharmGKB.org

• Relationships between Drugs, Diseases, Genes

• Evidence from Literature References, Clinical Outcomes, etc.

58

Page 57: Trio:  A System for Data, Uncertainty, and Lineage

Applications – PharmGKB.org

59

GeneC

DiseaseB

DrugA

EffectiveA,BEffectiveA,B,C

• Relational Schema • Drugs(DrugID, Name, AltName)

• Diseases(DiseaseID, Name, AltName)

• Genes(GeneID, Name, Symbol, AltSymbol)

• Relationships(RelID, RelatedTo, Evidence)

• Independent Base Data

P[EffectiveA,B,C]

= P[A] P[B] P[C]

Page 58: Trio:  A System for Data, Uncertainty, and Lineage

Applications – PharmGKB.org

60

GeneC

DiseaseB

DrugA

EffectiveA,BEffectiveA,B,C

• Relational Schema • Drugs(DrugID, Name, AltName)

• Diseases(DiseaseID, Name, AltName)

• Genes(GeneID, Name, Symbol, AltSymbol)

• Relationships(RelID, RelatedTo, Evidence)

• Non-Independent Base Data

P[EffectiveA,B,C]

= P[E|A,B,C]

≠ P[A] P[B] P[C]

ULDBs Vs. Bayesian Nets Factorized representation w/conditional probability tables (CPTs)

E.g.: C = Heart DiseaseB = High Blood PressureA = Warfarin

Page 59: Trio:  A System for Data, Uncertainty, and Lineage

Applications – SERF

Stanford Entity Resolution Framework

61

Swoosh: A Generic Approach to Entity ResolutionBenjelloun, Garcia-Molina, Whang et al.

<id>r2<name>John Doe<phone>357-2635<email>[email protected]

<id>r1<name>J. Doe<phone>351-2535

<id>r3<name>John D.<phone>351-2535<email>[email protected]

<id>r4<name>{John Doe||John D.}<phone>{357-2635||351-2535}<email>[email protected]

<id>r5<name>{John Doe||John D.||J. Doe}<phone>{357-2635||351-2535}<email>[email protected]

Iteration 2Iteration 2

Iteration 1Iteration 1

Page 60: Trio:  A System for Data, Uncertainty, and Lineage

Stanford Entity Resolution Framework

62

Swoosh: A Generic Approach to Entity ResolutionBenjelloun, Garcia-Molina, Whang et al.

<id>r2<name>John Doe<phone>357-2635<email>[email protected]

<id>r1<name>J. Doe<phone>351-2535

<id>r3<name>John D.<phone>351-2535<email>[email protected]

<id>r4<name>{John Doe : 0.4 ||John D. : 0.3}<phone>{357-2635 : 0.5 ||351-2535 : 0.5}<email>[email protected]

<id>r5<name>{John Doe : 0.4 ||John D. : 0.3||J. Doe : 0.3}<phone>{357-2635 : 0.3 ||351-2535 : 0.6}<email>[email protected]

Iteration 2Iteration 2

Iteration 1Iteration 1

Applications – SERF

Page 61: Trio:  A System for Data, Uncertainty, and Lineage

63

Future Topics (sample)

More forms of uncertainty• Continuous uncertainty (intervals, Gaussians)• Correlated uncertainty• Incomplete relations

More efficient confidence-based queries• Top-k by confidence• Confidence bounds & approximation

More forms of lineage• Update propagation w/lineage (versioning)• External lineage

Page 62: Trio:  A System for Data, Uncertainty, and Lineage

64

Future Topics (sample)

Theory• AI models/Bayesian Nets & inferencing

• Design theory

System• Full TriQL query language• Version 2: Go native beyond Postgres SPI?

– Storage and indexing structures– Statistics– Query optimization

Page 63: Trio:  A System for Data, Uncertainty, and Lineage

Search “stanford trio”

Trio contributors, past and present Parag Agrawal, Omar Benjelloun, Julien Chaumond, Ashok Chandra, Anish Das Sarma, Alon Halevy, Chris

Hayworth, Ander de Keijzer, Raghotham Murthy, Michi Mutsuzaki, Shubha Nabar, Tomoe Sugihara,

Martin Theobald, Jeff Ullman, Jennifer Widom