Preference and Diversity-based Ranking in Network-Centric Information Management Systems

75
Preference and Diversity- based Ranking in Network- Centric Information Management Systems PhD defense Marina Drosou Computer Science & Engineering Dept. University of Ioannina

description

Preference and Diversity-based Ranking in Network-Centric Information Management Systems. PhD defense Marina Drosou Computer Science & Engineering Dept. University of Ioannina. Why diversify ?. Car. Animal. Sports Team. “Mr. Jaguar’’. Thesis Goal. - PowerPoint PPT Presentation

Transcript of Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 1: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Preference and Diversity-based Ranking in Network-Centric Information Management SystemsPhD defense

Marina DrosouComputer Science & Engineering Dept.University of Ioannina

Page 2: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

2

Why diversify?

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Car

Animal

Sports Team“Mr.

Jaguar’’

Page 3: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

3

Thesis Goal This PhD thesis concerns the development, implementation

and evaluation of models, algorithms and techniques for the ranking of information being presented to users of network-centric information management systems

This ranking is based on the importance of each piece of information. We consider that importance is influenced by both relevance to user information needs and diversity: Relevance is important so that users are only presented

with the most useful results according to their needs Diversity ensures that the received results do not all

contain similar information.

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 4: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

4

Outline

Search Result Diversification: Introduction & Related Work

Content Diversification using Indices

DisC Diversity: Diversification based on Dissimilarity and Coverage

POIKILO: Evaluating the Results of Diversification Models and Algorithms

Summary

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 5: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

5

Outline

Search Result Diversification: Introduction & Related Work Problem Definition Variations Algorithms

Content Diversification using Indices

DisC Diversity: Diversification based on Dissimilarity and Coverage

POIKILO: Evaluating the Results of Diversification Models and Algorithms

SummaryMarina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 6: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 6

Problem Definition

Given:1. P = {p1, …, pn}2. k ≤ n3. d: a distance metric4. f: a diversity function

),(argmax* dSfS k|S|

PS

Find:

Given a set P of items and a number k, select a subset S* of P with the k most diverse items of

P

Page 7: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 7

What it means Given a set P of query results we want to select a

representative diverse subset S* of P

What does diverse mean? Content: dissimilar items

e.g., distant location on a map, different attribute values in tuples

Coverage: different aspects, perspectives, concepts e.g., different interpretations of a keyword in web search,

different topics Novelty: items not seen in the past

e.g., novel results in a notification service

Page 8: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

8

Content-based diversity

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

),(min),( ,MIN ji

ppSpp

ppddSfji

ji

ji

jipp

Sppji ppddSf

,SUM ),(),(

Page 9: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 9

Coverage-based diversity Basic idea: Find a set of results that cover different

interpretations of the query

Common assumptions: A taxonomy exists Both queries and results may belong to many categories Statistics on the distribution of user intents have been

collected Result independence

Probabilistic view of the problem

Page 10: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

10

Novelty-based diversity

Novelty: the need to avoid redundancy (vs. Diversity: the need to resolve ambiguity) Intuitively: an item should be returned in the ith position of

the list if it is relevant the previous (i-1) items do not contain the same information

Information is partitioned into “nuggets” Often, human judges decide what is relevant or not for each

nugget (IR approach)

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 11: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 11

Adding relevance in the mix

We must not forget: Relevance to the query is also important! Results must be both relevant and diverse

Two alternatives: Select the k most diverse results out of the top-m most

relevant ones, m > k Include diversity into the ranking criterion

Augmenting diversity function with relevance Adapting IR criteria, e.g., discounted cumulative gain(DGC) at

position i

Page 12: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

12

Adding relevance in the mix Augmenting diversity functions with relevance:

MAXMIN:

MAXSUM:

Mono-objective formula:

and others

(where is a relevance function and )

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 13: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 13

Problem Complexity

The problem of choosing diverse items is NP-hard This follows from the MAX COVERAGE/SET COVER

problems Intuitively:

To find the most diverse subset S* of all items P we have to compute all possible combinations of k items out of |P| and keep the one with the maximum diversity

Page 14: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 14

Solving the problem Thus, we use heuristics for approximate solutions

Greedy heuristics: Selecting items one by one until we have k of them

Interchange heuristics: Start with a random solution and interchange items that

improve the objective function

Also: Neighborhood heuristics: Disqualify items close to the ones

already selected Simulated Annealing: Apply simulated annealing to avoid local

maxima and others

Page 15: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

15

Related Work

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Content Coverage Novelty

Greedy

Jain, PAKDD 2004Ziegler, WWW 2005Gollapudi, WWW 2009Drosou, DEBS 2009Tao, ICDE 2009Haritsa, IEEE Data Eng. Bull. 2009Vieira, ICDE 2011Bozzon, ICWE 2012Santos, SSBDM 2013Abbar, WWW 2013Valkanas, EBDT 2013 Agrawal, WSDM, 2009

Liu, SDM 2009Zhu, WWW 2011Li, CIKM 2012

Zhang, SIGIR 2002Clarke, SIGIR 2008Souravlias, 2010Lathia, SIGIR 2010Szpektor, WWW 2013

Interchange

Yu, EDBT 2009Vieira, ICDE 2011Liu, PVLDB 2009Minack, SIGIR 2011Liu, TODS 2012

Others

Vee, ICDE 2008Zhang, RecSys 2008Angel, SIGMOD 2011Fraternali, SIGMOD 2012Li, PVLDB 2013

Page 16: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

16

Outline

Search Result Diversification: Introduction & Related Work

Content Diversification using Indices Model Diverse set computation Combining diversity & relevance

DisC Diversity: Diversification based on Dissimilarity and Coverage

POIKILO: Evaluating the Results of Diversification Models and Algorithms

SummaryMarina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 17: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

17

Introduction We focus on content-based diversification

MAXMIN

Basic idea: employ indices for the efficient computation of diverse Items

Cover Trees

We also define the Continuous k-diversity problem

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 18: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 18

The Cover Tree A leveled tree where each level is a “cover” for all

levels beneath it

Items at higher levels are farther apart from each other than items at lower levels

level

level

level

Page 19: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 19

Cover Tree Invariants - Nesting Nesting: , i.e., once an item appears at some level,

then every lower level has a node associated with

level

level

level p1

p1

p2

p2

p2

p3

p3

Page 20: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 20

Cover Tree Invariants - Covering Covering: For every , there exists a , such that and

the node associated with is the parent of the node associated with

p1

p1

p2

p2

p2

p3

p3 bl-1

b: the “base” of the tree l: the level of pi

level

level

level

Page 21: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 21

Cover Tree Invariants - Separation Separation: For all distinct , it holds that

p1

p1

p2

p2

p2

p3

p3

> bl-2

bl-1

level

level

level

Page 22: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 22

Example

Items indexed at the first ten levels of the same Cover Tree

Page 23: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 23

Cover Tree Representations After an item appears in some level of the tree,

then is a child of itself at all levels below .

Implicit Representation

Explicit Representation

O(n) space

space depending on P

p1

p1

p2

p2

p2

p3

p3

p1

p2

p3

Page 24: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 24

Dynamic Construction Items can be inserted and deleted from a Cover Tree in a

dynamic fashion

Insertion:1. Starting from the root, descend towards the candidate

nodes that can cover the new item p2. Continue until a level Cl is reached where p is separated

from all other items3. Select as parent a candidate node of Cl+1 that covers p

Deletion:1. Descend the tree looking for p, keeping note of candidate

nodes that can cover the children of p2. Remove p and reassign its children to the candidate nodes

Page 25: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

25

Level Family of Algorithms The higher the tree level, the farthest apart its

nodes. Thus, by selecting items from nodes at high levels, we retrieve more diverse results

Let be the first level with at least k items

1. Level-Basic: Select k random items from 2. Level-Greedy: Greedily select k items from 3. Level-Inherit: Select all items in and greedily

select k-|| items from

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 26: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 26

Approximation Bound

(Proved by exploiting the covering invariant of the tree to bound the level where the least common ancestor of any two items of the optimal solution appears in the tree)

Let P be a set of items, k 2, dOPT(P,k) the optimal minimum distance for the MAXMIN problem and dCT(P,k) be the minimum distance of the diverse set computed by the Level-Basic algorithm. It holds that:

dCT(P,k) dOPT(P,k), where = (b-1)/(2b2)

Page 27: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

27

Cover Tree implementation of Greedy Any Cover Tree can be employed for implementing

the greedy heuristic ½-approximation of the optimal solution

We perform k descends of the tree, using one of the following pruning rules:

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

CT PRUNING RULE: Let p and q be two sibling nodes at level l in a CT. If , then no node in the subtree of q can be

further apart from S than p

WCT PRUNING RULE: Let p and q be two sibling nodes at level l in a CT. If , then no node in the subtree of q can be further apart from S than p

Page 28: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 28

Batch Construction If all items of P are available, we can perform a

batch construction of the Cover Tree We call such trees “Batch Cover Trees” (BCTs) As we descend a BCT, we get items in the order selected by

Greedy

Algorithm:1. The leaf level Cl contains all items in P2. We greedily select items from Cl with distance larger than

bl+1 and promote them to Cl+1

3. The rest of the items in Cl are distributed as children among the new nodes of Cl+1

4. Continue until we reach the root level

Page 29: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

29

Adding relevance Two approaches:

1. Incorporate relevance to the distance function

2. Use relevance to select items from the tree, e.g., mmr

Level-Greedy Level-Hybrid: Greedily select k items from and the k most

relevant descendants of items in

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 30: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 30

Continuous Model We consider a streaming scenario, where new items

arrive and older items expire

We want to provide users with a continuously updated subset of the top-k most diverse recent items in the stream

We consider a sliding-window model:

wjump step

Window Pi-1

Window Pi

Page 31: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 31

Continuous k-Diversity Problem Let be two subsequent jumping windows

For each , we seek to select a diverse subset , where the additional two constraints hold:

Durability: Once selected as diverse, an item remains as such until it

expires

Freshness: Let be the newest item in . Then, \ with , such that, Items are selected in the same order they are produced

Page 32: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 32

Continuity Requirements Items in the tree are marked as valid or invalid:

Freshness: non-diverse items that are older than the newest diverse item from the previous window are marked as invalid in the cover tree and are not further considered.

Durability: Let r be the number of diverse items from previous windows that have not yet expired. We select k-r new valid diverse items from the new window.

Page 33: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 33

Building Batch Cover Trees We measure the extra cost of building a BCT as

compared to executing the greedy heuristic (GR) for k = n This extra cost corresponds to assigning nodes to suitable

parents to form the tree levels

Clustered Facesb non-np np non-np np

1.3 0.42% 0.58% 1.49% 1.94%1.5 0.42% 0.56% 1.47% 1.92%1.7 0.41% 0.55% 1.47% 1.91%

np – nearest parent heuristic (choose closest candidate parent).

The quality of the solution is the same for BCT and GR.

Extr

a C

ost

Page 34: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 34

Building Incremental Cover Trees Building ICTs requires a small fraction of the cost required for the

corresponding BCTs However, the quality of the solutions provided by ICTs is comparable to that of

BCTs (and, thus, GR)b Clustered Faces

1.3 0.16% 0.79%1.5 0.08% 0.41%1.7 0.06% 0.28%

For trees with 10,ooo items: Insertion cost: ~2.6 msec Deletion cost: ~10 msec

Inserting/Removing items after a window jump depends on the size of the window and the jump step but is much faster than re-building a BCT for the new set of items

Extr

a C

ost

Page 35: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

35

Pruning

Pruning is even better for non uniform datasets, since each selection of a diverse item results in pruning a largest number of items around it

Also, pruning is better for large values of λ

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 36: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 36

Streaming Data We compare ICTs against SGR, a streaming version of GR:

At each window, we keep any remaining diverse items from the previous window (durability) and let GR select items from the new window satisfying freshness

Comparable achieved diversity, while ICTs are much faster

Retrieving the top-100 items from an ICT with 1,000-10,000 items requires ~1.5 msec

Executing SGR requires 3.2 sec for 5,000 items and more than 15 sec for 10,000 items

Page 37: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 37

Summary We proposed an indexed-based diversification

approach based on Cover Trees

We provided a new suite of algorithms along with theoretical results for the quality of our approach

We studied the diversification problem in a dynamic setting, where items change over time and defined continuity requirements that the diversified items must satisfy

Page 38: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

38

Related Publications

1. M. Drosou and E. Pitoura, Diverse Set Selection over Dynamic Data, in IEEE TKDE (to appear)

2. M. Drosou and E. Pitoura, Dynamic Diversification of Continuous Data, EDBT 2012

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 39: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

39

Outline

Search Result Diversification: Introduction & Related Work

Content Diversification using Indices

DisC Diversity: Diversification based on Dissimilarity and Coverage DisC Diversity Algorithms Comparison with other models Incremental DisC

POIKILO: Evaluating the Results of Diversification Models and Algorithms

Summary

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 40: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 40

DisC Diversity What is the right size for the diverse subset S? What is a good k?

What if… instead of k, a radius r?

Given a result set P and a radius r, we select a representative subset S ⊆ P such that:1. For each item in P, there is at least one similar item in S

(coverage)2. No two items in S are similar with each other

(dissimilarity)

Page 41: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

41

DisC Diversity

Zoom-out

Zoom-in Local zoom

Small r: more and less dissimilar points (zoom in)

Large r: less and more dissimilar points (zoom out)

Local zooming at specific pointsMarina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 42: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 42

DisC Diversity

Formal definition: Let P be a set of objects and r, r ≥ 0, a positive real

number. A subset S ⊆ P is an r-Dissimilar-and-Covering diverse subset, or r-DisC diverse subset, of P, if the following two conditions hold:1. coverage condition: ∀pi ∈ P, ∃pj ∈ N+

r (pi), such that pj ∈ S2. dissimilarity condition: ∀pi, pj ∈ S with pi ≠ pj, it holds that

d(pi, pj) > r

Since a DisC set for a set P is not unique We seek a concise representation → the

minimum DisC set

Page 43: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 43

Graph model We use a graph to model the problem:

Each item is a vertex There exists an edge between two vertices, if their distance

is less than r

r

Page 44: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 44

Graph model Finding a minimum r-DisC diverse subset of a set P

is equivalent to finding a minimum Independent Dominating set of the corresponding graph Independent: no edge between any two vertices in the set Dominating: all vertices outside connected with at least one

inside This is an NP-hard problem

Dominating, not independent

Dominating and independent

Page 45: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 45

Computing DisC subsets

A basic or greedy approach: select random items or items with large neighborhoods

Page 46: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 46

How smaller is the (optimal) minimum DisC set?

where B the maximum number of independent neighbors of any item in P

i.e., each item has at most B neighbors that are independent from each other

B depends on the distance metric and data cardinality

We have proved that: for the Euclidean distance in the 2D plane: B = 5 for the Manhattan distance in the 2D plane: B = 7 for the Euclidean distance in the 3D plane: B = 24

The size of any r-DisC diverse subset S of P is B times the size of any minimum r-DisC diverse subset S*

Page 47: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 47

Raising the dissimilarity condition

When we consider only coverage:

Let Δ be the maximum number of neighbors of any item in P; the size of any covering (but not dissimilar) diverse subset S of P is at most lnΔ times larger than any minimum covering subset S*

Page 48: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

48

Adding weights We also consider weights

e.g., indicating relevance

We now seek the DisC set S with the minimum value of

When all weights are equal, the problem is reduced to finding a minimum r-DisC subset

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 49: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

49

Multiple radii We want to allow different areas of the data to

contribute more or less items to the diverse set

The problem now loses its symmetry

Two interpretations:1. pi can represent all items lying at a distance at most r(pi)

around it (Covering problem)2. pi can be represented only by items lying at a distance at

most r(pi) around it (CoveredBy problem)

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 50: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

50

Multiple radii variations

The problem is now modeled via a directed graph Directed graphs do not always have an independent

dominating set! We provide heuristic algorithms that always locate a valid

DisC set Covering: start with items with larger radii CoveredBy: start with items with smaller radiiMarina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Covering

CoveredBy

A set P

Page 51: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 51

Comparison with other models

r-DisC MAXSUM

MAXMIN k-medoids

Page 52: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 52

Comparison with MAXMIN

For a set of items P, we have proved that:1. Let S be an r-DisC set and S* be an optimal MAXMIN set.

Let and * be the MAXMIN distances of the two sets. Then, * ≤ 3.

2. Let S* be the optimal MAXMIN set of size k with MAXMIN distance equal to *. Let S be an r-DisC set with r = *. Then, |S| < k′, where k′ is the first integer larger than k for which the corresponding optimal MAXMIN set of P S*′ has MAXMIN distance equal to λ*′, with λ*′ < λ*.

Page 53: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 53

Zooming We want to change the radius r to r’ interactively and

compute a new diverse set r’ < r zoom in, r’ > r zoom out

Two requirements:1. Support an incremental mode of operation:

the new set Sr’ should be as close as possible to the already seen result Sr. Ideally, Sr’ ⊇ Sr for r’ < r and Sr’ ⊆ Sr for r’ > r

2. The size of Sr’ should be as close as possible to the size of the minimum r’-DisC diverse subset

There is no monotonic property among the r-DisC diverse and the r’-DisC diverse subsets of a set of objects P (the two sets may be completely different)

Page 54: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 54

Size when moving from r to r’ The change in size of the diverse set when moving

from r to r’ depends on the number of independent neighbors (for r’) in the “ring” around an object between the two radii.

𝑁 (𝑝𝑖)𝑟 1 ,𝑟 2

𝐼

Page 55: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 55

Zooming

Again, depends on the distance metric and data cardinality

We have proved that: 2D Euclidean: , where

2D Manhattan: , where

(proofs in the thesis)

Page 56: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 56

Zooming-In For zooming-in, we keep the items of Sr and fill in

the solution with items from uncovered areas

It holds that:1. Sr ⊆ Sr′

2. |Sr′| ≤ N|Sr|, where N is the maximum in Sr

(proof and various algorithms for keeping the size small in the thesis)

Page 57: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 57

Zooming-Out For zooming-out, we keep the independent items of

Sr and fill in the solution with items from uncovered areas.

It holds that:1. There are at most N items in Sr\Sr’

2. For each item in Sr\Sr’, at most (B-1) items are added to Sr’

(proof and various algorithms for keeping the size small in the thesis)

Page 58: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 58

Implementation We base our implementation on a spatial data

structure (central operation: compute neighbors)

We use an M-tree We link together all leaf nodes (we visit items in a single

left-to-right traversal of the leaf level to exploit locality) We build trees using splitting policies that minimize

overlap

Page 59: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 59

Implementation

Two implementations for our greedy approach Grey-Greedy, White-Greedy

Lazy variations for updating neighborhoods

Pruning Rule: A leaf node that contains no white objects is colored grey. When all its children become grey, an internal node is colored grey and becomes inactive. We prune subtrees with only “grey nodes”.

Page 60: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 60

PerformanceMany real and synthetic datasets

General trade-off:Larger r → Smaller diverse set → higher cost

Lazy variations of our algorithms further reduce computational cost

The cost also depends on the characteristics of the M-tree (fat-factor)

Smaller sizes for clustered data

Cost

Solution size

Page 61: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

61

Diversity and Relevance

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Similar diversity for the Basic and Greedy algorithms

Greedy considers relevance and produces subsets of larger average weight

Raising the dissimilarity condition improves average weight but minimum distance is decreased

Also, we get larger subsets than in the diversity-only case

Page 62: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 62

ZoomingSolution size

Cost

Jaccard distance among solutions

Both requirements: incremental (much smaller cost) and small size (relative to computing it from scratch)

Larger overlap among Sr and Sr’

Page 63: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

63

Related Publications

1. M. Drosou and E. Pitoura, Multiple Radii DisC Diversity: Result Diversification based on Dissimilarity and Coverage (submitted)

2. M. Drosou and E. Pitoura, DisC Diversity: Result Diversification based on Dissimilarity and Coverage, in PVLDB, vol. 6, no.1, pp. 1324, 2012, VLDB Endowment (Best Paper Award)

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 64: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

64

Outline

Search Result Diversification: Introduction & Related Work

Content Diversification using Indices

DisC Diversity: Diversification based on Dissimilarity and Coverage

POIKILO: Evaluating the Results of Diversification Models and Algorithms

Summary

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 65: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

65

Visualizing Diverse Items

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Selecting diversification

parameters

Zooming and Streaming

Result

Statistics

Page 66: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

66

Visualizing Diverse Items

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 67: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

67

Related Publications

1. M. Drosou and E. Pitoura, POIKILO: A Tool for Evaluating the Results of Diversification Models and Algorithms, VLDB 2013

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 68: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

68

Outline

Search Result Diversification: Introduction & Related Work

Content Diversification using Indices

DisC Diversity: Diversification based on Dissimilarity and Coverage

POIKILO: Evaluating the Results of Diversification Models and Algorithms

Summary Thesis contribution Directions for future research

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 69: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

69

Thesis ContributionsIndex-based diversification We proposed a novel, index-based approach for solving the

MAXMIN diversification problem

We provided a suite of algorithms exploiting the Cover Tree We introduced a new family of diversity algorithms which provide a

-approximation of the optimal solution, where b is the base of the cover tree

We presented tree-based efficient implementation of a traditional greedy algorithm which provides an ½-approximation of the optimal solution

We defined the continuous diversification problem and introduced a number of continuity requirements for increasing the quality of diverse results in a streaming scenario

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 70: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

70

Thesis ContributionsDiversification based on dissimilarity and coverage We introduced a novel diversity definition, called DisC diversity, based

on using a radius r rather than a size limit k to select diverse items

We presented both a spatial and a graph model for our definition

We studied the weighted and multiple radii cases

We introduced incremental diversification to a new radius through zooming-in and zooming-out

We presented algorithms for locating DisC diverse subsets and derived bounds concerning the size of such subsets

We provided efficient implementations of our algorithms based on spatial index structures, namely the M-Tree

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 71: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

71

Thesis ContributionsVisualizing and comparing diversification algorithms We developed a system prototype, called “Poikilo”,

providing implementations of a wide variety of diversification approaches to assist users in locating, visualizing and comparing diverse results

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 72: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

72

Directions for future researchShort term plans Diversification in Database Exploration

Interesting suggestions in database exploration are often similar Also: exploit external sources

Diversification of Multiple Search Results Exploit overlap among results of different queries Use diversified results of past queries to answer new ones

Diversification of Keyword Search Results in Databases Moving diversification to the ranking phase Apply coverage-based definitions

Long term plans Diversification in a distributed setting

Place “diversification filters” on the overlay network to reduce computational and communication costs

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 73: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

73

Thesis PublicationsJournal Publications1. M. Drosou and E. Pitoura, Multiple Radii DisC Diversity: Result

Diversification based on Dissimilarity and Coverage (submitted)2. M. Drosou and E. Pitoura, YmalDB: Exploring Relational

Databases via Result Driven Recommendations, in VLDBJ (to appear)

3. M. Drosou and E. Pitoura, Diverse Set Selection over Dynamic Data, in IEEE TKDE (to appear)

4. M. Drosou and E. Pitoura, DisC Diversity: Result Diversification based on Dissimilarity and Coverage, in PVLDB, vol. 6, no.1, pp. 1324, 2012, VLDB Endowment (Best Paper Award)

5. M. Drosou and E. Pitoura, Search Result Diversification, in SIGMOD Record, vol. 39, no. 1, pp. 4147, 2010, ACM

6. M. Drosou and E. Pitoura, Diversity over Continuous Data, in IEEE Data Engineering Bulletin, vol. 32, no. 4, pp. 4956, 2009, IEEE

Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 74: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

74

Thesis PublicationsConference Publications1. M. Drosou and E. Pitoura, Dynamic Diversification of Continuous Data, EDBT

20122. M. Drosou and E. Pitoura, REDRIVE: Result Driven Database Exploration

through Recommendations, CIKM 20113. K. Stefanidis, M. Drosou and E. Pitoura, PerK: Personalized Keyword Search

in Relational Databases through Preferences, EDBT 2010

Workshop Publications4. D. Souravlias, M. Drosou, K. Stefanidis and E. Pitoura, On Novelty in

Publish/Subscribe Delivery, DBRank 20105. K. Stefanidis, M. Drosou and E. Pitoura, ‘‘You May Also Like’’ Results in

Relational Databases, PersDB 2009

Demos6. M. Drosou and E. Pitoura, POIKILO: A Tool for Evaluating the Results of

Diversification Models and Algorithms, VLDB 20137. M. Drosou and E. Pitoura, YmalDB: A Result Driven Recommendation System

for Databases, EDBT 2013Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems

Page 75: Preference  and  Diversity-based Ranking  in Network-Centric  Information Management  Systems

75

Thank you!