Probabilistic Ranking of Database Query Results

32
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by Weimin He CSE@UTA

description

Probabilistic Ranking of Database Query Results. Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik. Presented by Weimin He CSE@UTA. Outline. Motivation Problem Definition - PowerPoint PPT Presentation

Transcript of Probabilistic Ranking of Database Query Results

Page 1: Probabilistic Ranking of Database Query Results

Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik

Presented by Weimin HeCSE@UTA

Page 2: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 2

Outline

Motivation Problem Definition System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems

Page 3: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 3

Motivating example

Realtor DB: Table D=(TID, Price , City, Bedrooms,

Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)

SQL query:Select * From D Where City=Seattle AND View=Waterfront

Page 4: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 4

Motivation

Many-answers problem Two alternative solutions:

Query reformulation Automatic ranking Apply probabilistic model in IR to

DB tuple ranking

Page 5: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 5

Problem DefinitionGiven a database table D with n tuples {t1, …, tn} over a set of

m categorical attributes A = {A1, …, Am}and a query Q: SELECT * FROM D WHERE X1=x1 AND … AND Xs=xswhere each Xi is an attribute from A and xi is a value in its

domain.

The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes

Let be the answer set of Q

How to rank tuples in S and return top-k tuples to the user ?

},...,{ 1 nttS

Page 6: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 6

System Architecture

Page 7: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 7

Intuition for Ranking Function Select * From D Where City=“Seattle” And

View=“Waterfront”

Score of a Result Tuple t depends on Global Score: Global Importance of Unspecified

Attribute Values E.g., Homes with good school districts are

globally desirable Conditional Score: Correlations between

Specified and Unspecified Attribute Values E.g., Waterfront BoatDock

Page 8: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 8

Probabilistic Model in IR Bayes’ Rule Product Rule

)(

)()|()|(

bp

apabpbap

),|()|()|,( cabpcapcbap

)|(

)|(

)(

)()|()(

)()|(

)|(

)|()(

Rtp

Rtp

tp

RpRtptp

RpRtp

tRp

tRptScore

Document t, Query QR: Relevant document setR = D - R: Irrelevant document set

Vagelis Hristidis
Let's see how by adapting PIR techniques to our problem we can create a ranking function.
Page 9: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 9

Adaptation of PIR to DB

Tuple t is considered as a document

Partition t into t(X) and t(Y) t(X) and t(Y) are written as X and Y Derive from initial scoring function

until final ranking function is obtained

Page 10: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 10

Preliminary Derivation

Page 11: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 11

Limited Independence Assumptions

Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

Xx

CxpCXp )()(

Yy

CypCYp )()(

Page 12: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 12

Continuing Derivation

Page 13: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 13

Workload-based Estimation of )( Ryp

Assume a collection of “past” queries existed in system

Workload W is represented as a set of “tuples”

Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X

All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X

),()( WXypRyp

Page 14: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 14

Final Ranking Function

Page 15: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 15

Pre-computing Atomic Probabilities in Ranking Function

)( Wyp

)( Dyp

),( Dyxp

Relative frequency in W

Relative frequency in D

),( Wyxp (#of tuples in W that conatains x, y)/total # of tuples in W

(#of tuples in D that conatains x, y)/total # of tuples in D

Page 16: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 16

Example for Computing Atomic Probabilities

Select * From D Where City=“Seattle” And View=“Waterfront”

Y={SchoolDistrict, BoatDock, …}

D=10,000 W=1000 W{excellent}=10 W{waterfront &yes}=5

p(excellent|W)=10/1000=0.1 p(excellent|D)=10/10,000=0.01 p(waterfront|yes,W)=5/1000=0.005 p(waterfront|yes,D)=5/10,000=0.0005

Page 17: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 17

Indexing Atomic Probabilities

)( Wyp

)( Dyp

),( Dyxp

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

),( Wyxp

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

Page 18: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 18

Scan AlgorithmPreprocessing - Atomic Probabilities Module Computes and Indexes the Quantities

P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each Result-

Tuple Return Top-K Tuples

Page 19: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 19

Beyond Scan Algorithm Scan algorithm is Inefficient

Many tuples in the answer set Another extreme

Pre-compute top-K tuples for all possible queriesStill infeasible in practice

Trade-off solutionPre-compute ranked lists of tuples for all possible atomic queriesAt query time, merge ranked lists to get top-K tuples

Page 20: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 20

Two kinds of Ranked List CondList Cx

{AttName, AttVal, TID, CondScore}B+ tree index on (AttName, AttVal, CondScore)

GlobList Gx

{AttName, AttVal, TID, GlobScore}B+ tree index on (AttName, AttVal, GlobScore)

Page 21: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 21

Index Module

Page 22: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 22

List Merge Algorithm

Page 23: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 23

Experimental Setup Datasets:

MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)

Internet Movie Database (http://www.imdb.com)

Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO

Page 24: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 24

Quality Experiments

Conducted on Seattle Homes and Movies tables

Collect a workload from users Compare Conditional Ranking

Method in the paper with the Global Method [CIDR03]

Page 25: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 25

Quality Experiment-Average Precision

For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples

Let each user mark 10 tuples in Hi as most relevant to Qi

Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

Page 26: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 26

Quality Experiment- Fraction of Users Preferring Each Algorithm

5 new queries Users were given the top-5 results

Page 27: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 27

Performance Experiments

Table NumTuples Database Size (MB)

Seattle Homes 17463 1.936

US Homes 1380762 140.432

Datasets

Compare 2 Algorithms: Scan algorithm List Merge algorithm

Page 28: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 28

Performance Experiments – Pre-computation Time

Page 29: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 29

Performance Experiments – Execution Time

Page 30: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 30

Performance Experiments – Execution Time

Page 31: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 31

Performance Experiments – Execution Time

Page 32: Probabilistic Ranking of Database Query Results

04/19/23 Weimin He CSE@UTA 32

Conclusion and Open Problems

Automatic ranking for many-answers

Adaptation of PIR to DB

Mutiple-table query Non-categorical attributes