Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde...

34
Sideway Value Algebra for Object- Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve University, USA Bilkent University, Turkey

Transcript of Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde...

Page 1: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Sideway Value Algebra for Object-Relational

Databases

G. Ozsoyoglu

A. Al-Hamdani

I. S. Altıngovde

S.A. Ozel

O. Ulusoy

Z.M. Ozsoyoglu

Case Western Reserve University, USA

Bilkent University, Turkey

Page 2: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Attach

(a) Functions to relations “Sideway Value Fncts”

(b) Function evaluations to tuples of relations

“Sideway Values”

“Scores” : Cohen, Sigmod’98

“Preference values” : Agrawal and Wimmers, Sigmod’00

Hristidis, Papakonstantinou, Koudas, Sigmod’01

“Probabilistic values”: Barbara, Garcia-Molina, Porter, IEEE TKDE’92

Our Motivation: Recent database applications on the web

Page 3: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Sideway Values and Functions

Represent• Recommendations of Data Creators,

• Preferences of users. Importance Values

Employed for• User-guided query output ranking, • Top-K “query evaluation”.

Integrated with “approximate text joins”.

Page 4: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Web Resources: DBLP Bibliography and SIGMOD Anthology

Data Model (by Data Provider)

Topics

Tid TName TType TDomain TImp

T01 Edward F. Codd Researcher Database 0.9

T08 data models database mngement ResearchPaper Database 0.8

Query: Database Researchers with a name similar to E. Codd?

Text similarity between Edward F. Codd and E. Codd is judged to be 0.7

Result: Tuple T01 is returned with importance value 0.63 (0.9 * 0.7)

Edward F. Codd, “Data Models in Database Management”, 1980

Page 5: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

“T08 is a research paper of T01”

Recursive Metalinks: RelatedToPapers, PrerequisitePapers

Topic Sources (web documents)

Relationships (Metalinks)

MId MType AntecedentId ConsequentId

M01 ResearchPaperOf T01 T08

Data Model

Topics

Page 6: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Query (top-k, approximate text similarity)

Use the advice at www.expert.com/advice. Find the 20 most-important papers

having index terms with similarity to “join algorithms” above 0.9

Query (time constraints, topic closure, threshold )

Use the advice at www.expert.com/advice .Find in two minutes those Anthology papers that are prerequisites

of T08 with importance values above 0.7.

Page 7: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Summary So Far

• Importance values as sideway values/functions

• Top-k or threshold-based queries

• Regular-expression-based topic closure queries:

PapersWrittenBy(RelatedToPapers*(T08))

• Expensive metalink importance value computation *

RelatedToPapers, PrerequisitePapers

• Time Constraints *

Page 8: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Rest of the talk

• SQL extensions for sideway values

• Sideway Value Algebra (SVA) operators for Selection

Join Topic Closure

• Evaluations of SVA operators Join Topic Closure

• Experimental results

Page 9: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

SVA queries propagate sideway values to output relations in automated ways.

Query output sizes controlled by (a) Ranking threshold k (final output size control)

(b) Sideway value threshold Vt

(intermediate output size control)

(c) k and Vt

Page 10: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

SVs attached to base relations

• Open Form: SV stored in column of base relation.

• Closed Form: SV derived thru closed function.

Ex: for relation Employee, we have Imp (Salary,JobCode) = a.Salary + b.JobCode

• Semi-Closed Form: SV value for a set of

tuples identified through regular expression. Ex: Imp(<TopicName = “*kidney complications*”>) = 0.9

Page 11: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Approximate Text Similarity

• Vector space based similarity model.

• Vocabulary W for all words (terms).

• Represent each topic as a vector of |W| terms.

• TF-IDF model of IR, and cosine similarity.

Page 12: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

SQL Extensions

• Database

using advice at www.xx as database

• Propagate Sideway Values propagate importance as <type> function of <arg list>

• Stopping Conditions stop after k

stop with threshold Vt

stop after k and with threshold Vt

Page 13: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

20 highest topic-importance-ranked papers having index terms with similarity to “join algorithms” above 0.9.

A product-based importance propagation function.

IndexedBy: SetOf IndexTermId PaperId

select M.ConsequentIdusing advice at www.expert.com/advice as database DBfrom DB.Topics T, DB.Metalinks Mwhere T.TType=”Index Term” and M.MType=”IndexedBy” and T.TId is-in M.AntecedentId and

T.TName (0.9) ”join algorithms”propagate importance as product function of Tstop after 20 most important

Page 14: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

SVA Operators

For each RA operator, there is SVA counterpart extended with

(a) Output SV function fout

(b) Output threshold (ranking or sideway threshold )

Operators with superscript * : SVA operators.

Operators without superscript * : RA operators.

They carry sideway values of the left or right relation.

Page 15: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

SVA Selection: σ*C, fout, (R)

Input • Input relation R with sideway function fin

• Selection Condition C• Output sideway propagation function fout

• Output threshold (k or Vt)

Output

=k : Top-k fout-ranking tuples that satisfy C

=Vt : Tuples with fout value greater than Vt and satisfy C

= (k,Vt)

Page 16: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

STOP AFTER (20) ΠConsequentId ORDER BYImportance L L.TId in R.AntecedentId

IndexedBy"" MTypeσ

*TName (0.9) “join algorithms” and TType=”Index Term”,

fout=fin*sim(TName,”join algorithms”), = 0.0 DB.Topics DB.Metalinks

Logical Query Tree

20 most-important papershaving index terms with similarity to “join algorithms” above 0.9

Page 17: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

SVA Join: L *A B, fout, R

Input

• L and R with two functions flin and frin

• Join Condition on attributes A and B

• Output sideway propagation function fout

• Output threshold (k or Vt)

Output: join of L and R

Sideway values of output tuples computed by fout.

Output threshold satisfied.

Page 18: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Query: Five researchers with (a) papers having index terms similar to “join algorithms” above 0.9

(b) geometric averages of highest importance values.

ResearchTopicOf: SetOf Index-termId ResearcherId

select M.ConsequentIdusing advice at www.expert.com/advice as database DBfrom DB.Topics T, DB.Metalinks Mwhere TType = ”Index Term” and M.MType = ”ResearchTopicOf” and T.Td is-in M.AntecedentId and T.TName (0.9) “join algorithms”propagate importance as gmtrc-average function of T, Mstop after 5 most important

Page 19: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

STOP AFTER (5) ConsequentId ORDER BYImportance * L.TId in R.AntecedentId, fout=sqrt(flin*frin), = 5

*TName (0.9) “join algorithms” picOf"ResearchTo" MTypeσ

and TType=”Index Term”, fout=fin*sim(TName,”join algorithms”), = 0.0 DB.Topics DB.Metalinks

Page 20: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Semantics of TClosure

FPath:

X.

.

.

.

tP1 Pn

. . .

FPathMerge

Page 21: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Metalink types have associated axioms Ex. RelatedTo is transitive and reflexive

SubTopicOf is transitive but not reflexive

Computing topic closure requires• Sound and complete set of axioms for metalink types,

and

• Polynomial-time algorithm to compute the topic closure

using the axioms.

Page 22: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Evaluation of SVA Join Operator

• Nested-Loops-Based algorithms for top-k-based and sideway-value-based threshold Join operators

• Sorted input relations wrpt sideway values.

• fout() function is monotone

e.g. Product, numeric average, geometric average

Page 23: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Nested-Loops Threshold Join Algorithm

• Enforce new stopping conditions while processing inner and outer loops.

– Inner loop (S) exits: fout() value of the tuple ri*sj < Vt (ri is in R and sj is in S).

– Outer loop (R) exits at the ith iteration: fout() value of the output tuple ri.s1 < Vt

(ri is in R and s1 is the first tuple in S)

R S

risj

Page 24: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Nested-Loops Threshold Join Algorithm with Approximate Text Match Join Conditions

• Similarity of text-valued join attributes > threshold tsim

• Use vector-space model. – Term vector ur = <u1 u2 … ux> corresponding to join attribute A of

tuple r in R, where ui represents weight of term i in A

– Filter vector fS = <w1 ... wx>: each value wi is the max weight of the corresponding term i among all vectors of S

• Cosine (ur, fS) : the maximal similarity of a record r in R to any other record s in S

• if Cosine (ur, fS) < Vt then r can not be similar to any tuple s in S with similarity > Vt

Page 25: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Evaluation of SVA TClosure Operator

• Compute Topic Closure X+ where regular expression R is a single metalink type M

• Each metalink VM Tid is represented by a tuple in the Metalinks table

V is a set of topic identifiers and Tid is a topic identifier

• All metalinks are RHS-decomposable.

• If a metalink type is LHS-decomposable then each metalink with V in LHS is decomposed into multiple metalinks with a single topic in LHS

Page 26: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

• Create index MIndex for all metalink instances

• TClosure algorithms use only MIndex to find closure of a given set of topics.

• Create a second index HIndex to maintain all nodes that are not decomposable.

Page 27: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

0.8 RelatedTo 0.6 0.85 0.9 Pre 0..95 0.7 Pre Pre 0.9 0.95 0.9

T2

T333

T4

T111

T5

MType Tid1 Imp(Tid1) ParentListChildList

<Tid2, Imp(Tid2), Imp(Mid)> triplets

Pre T1 0.9 {} <T3,0.85,0.95>, <T4,0.95,0.9>

Pre T3 0.85 {T1} -

Pre H1 Avg(0.9, 0.95) = 0.925

{T3, T4} <T5,0.7,0.9>

RelatedTo T1 0.9 {T2, T3, T4} <T2,0.8,0.6>,<T3,0.85, 0.95>, <T4,0.95,0.9>

Tid NodeList <TidList, Hid>

T3 <{T3,T4}, H1>

T4 <{T3,T4}, H1>

MIndex HIndex

Example 5.

Page 28: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Ranking-based Topic Closure

Algorithm• First compute the initial candidate top k ranked topics

from input topics X. • Then, in each iteration i,

– Extract ith top-ranked topic from current k-i+1 candidate top-ranked topics, and

– Update current candidate topics by processing all emanating metalinks from ith topic.

• Algorithm requires k iterations in order to compute the top-k-ranked topic closure of a set X of input topics.

Page 29: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Experimental Results SVA Join Operator

• Extracted titles of journals and conf. papers from the DBLP data set into two different files, R and S.

• For arbitrary predicates and monotone SV functions, algorithms NLoopSVT and NLoopTop-k improve the performance of (block-nested loops) BNL considerably.

• For similarity-based joins, the algorithms are further optimized, and more gains are obtained.

Page 30: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Tuple Comparisons

0

1500

3000

4500

6000

7500

9000

10500

12000

0 0.2 0.4 0.6 0.8 1SV threshold

Co

mp

aris

on

# (

Mil

lio

ns) BNL

NLoopSVTNLoopSimSVT

Tuple Comparisons

0

1500

3000

45006000

7500

9000

10500

12000

0 100 200 300 400 500

Top-k

Co

mp

aris

on

# (

Mil

lio

ns) BNL

NLoopTopK

NLoopSimTopK

Page 31: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Experimental Results SVA Topic Closure Operator

• Synthetically generated data for – Topics file with N, 1000 ≤ N ≤ 100,000, topics.

Topic importance values in [0.4, 1.0]– Metalink file with metalinks/topics ratio of 3.

Metalink importance values in [0.4, 1.0]– Set X of input topics with size= 100 topics.

Topic importance values in [0.5, 1.0]

• Disk-based MIndex file generated from topic and metalink files.

• In-memory buffer of size 100KB. A sparse index table of size 1000 tuples.

Page 32: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

24000

0.3 0.4 0.5 0.6 0.7 0.8 0.9

Sideway Values Threshold [0,1]

1000 Topics5000 Topics10000 Topics20000 Topics50000 Topics100000 Topics

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0 500 1000 1500 2000 2500 3000

Top-K Ranking

1000 Topics5000 Topics10000 Topics20000 Topics50000 Topics100000 Topics

Page 33: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

24000

0.3 0.4 0.5 0.6 0.7 0.8 0.9

Sideway Values Threshold [0,1]

defualtMetalinks/topic R=7Topics and Metalinks have Imp[0.7,1.0] Buffer Size=1000Input Size=500

0

2000

4000

6000

8000

10000

12000

14000

16000

0 500 1000 1500 2000 2500 3000

Top-K Ranking

defualt

Metalinks/topic R=7

Topics and Metalinkshave Imp[0.7,1.0] Input Size=500

Page 34: Sideway Value Algebra for Object-Relational Databases G. Ozsoyoglu A. Al-Hamdani I. S. Altıngovde S.A. Ozel O. Ulusoy Z.M. Ozsoyoglu Case Western Reserve.

Conclusions

• Sideway Values and Functions

• SQL Extensions and SVA

• Application: Web Querying