Preference and Diversity-based Ranking in Network-Centric Information Management Systems PhD defense...
-
Upload
donald-dorsey -
Category
Documents
-
view
217 -
download
0
Transcript of Preference and Diversity-based Ranking in Network-Centric Information Management Systems PhD defense...
Preference and Diversity-based Ranking in Network-Centric Information Management Systems
PhD defense
Marina DrosouComputer Science & Engineering Dept.University of Ioannina
2
Why diversify?
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Car
Animal
Sports Team“Mr.
Jaguar’’
3
Thesis Goal
This PhD thesis concerns the development, implementation and evaluation of models, algorithms and techniques for the ranking of information being presented to users of network-centric information management systems
This ranking is based on the importance of each piece of information. We consider that importance is influenced by both relevance to user information needs and diversity: Relevance is important so that users are only presented
with the most useful results according to their needs Diversity ensures that the received results do not all
contain similar information.
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
4
Outline
Search Result Diversification: Introduction & Related Work
Content Diversification using Indices
DisC Diversity: Diversification based on Dissimilarity and Coverage
POIKILO: Evaluating the Results of Diversification Models and Algorithms
Summary
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
5
Outline
Search Result Diversification: Introduction & Related Work Problem Definition Variations Algorithms
Content Diversification using Indices
DisC Diversity: Diversification based on Dissimilarity and Coverage
POIKILO: Evaluating the Results of Diversification Models and Algorithms
SummaryMarina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 6
Problem Definition
Given:1. P = {p1, …, pn}
2. k ≤ n3. d: a distance metric4. f: a diversity function
),(argmax* dSfS
k|S| PS
Find:
Given a set P of items and a number k, select a subset S* of P with the k most diverse items of
P
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 7
What it means
Given a set P of query results we want to select a representative diverse subset S* of P
What does diverse mean? Content: dissimilar items
e.g., distant location on a map, different attribute values in tuples
Coverage: different aspects, perspectives, concepts e.g., different interpretations of a keyword in web search,
different topics Novelty: items not seen in the past
e.g., novel results in a notification service
8
Content-based diversity
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
),(min),( ,
MIN ji
ppSpp
ppddSf
ji
ji
ji
jipp
Sppji ppddSf
,SUM ),(),(
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 9
Coverage-based diversity
Basic idea: Find a set of results that cover different interpretations of the query
Common assumptions: A taxonomy exists Both queries and results may belong to many categories Statistics on the distribution of user intents have been
collected Result independence
Probabilistic view of the problem
10
Novelty-based diversity
Novelty: the need to avoid redundancy (vs. Diversity: the need to resolve ambiguity) Intuitively: an item should be returned in the ith position of
the list if it is relevant the previous (i-1) items do not contain the same information
Information is partitioned into “nuggets” Often, human judges decide what is relevant or not for each
nugget (IR approach)
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 11
Adding relevance in the mix
We must not forget: Relevance to the query is also important! Results must be both relevant and diverse
Two alternatives: Select the k most diverse results out of the top-m most
relevant ones, m > k Include diversity into the ranking criterion
Augmenting diversity function with relevance Adapting IR criteria, e.g., discounted cumulative gain(DGC) at
position i
12
Adding relevance in the mix Augmenting diversity functions with relevance:
MAXMIN:
MAXSUM:
Mono-objective formula:
and others
(where is a relevance function and )
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 13
Problem Complexity
The problem of choosing diverse items is NP-hard This follows from the MAX COVERAGE/SET COVER
problems Intuitively:
To find the most diverse subset S* of all items P we have to compute all possible combinations of k items out of |P| and keep the one with the maximum diversity
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 14
Solving the problem
Thus, we use heuristics for approximate solutions Greedy heuristics:
Selecting items one by one until we have k of them
Interchange heuristics: Start with a random solution and interchange items that
improve the objective function
Also: Neighborhood heuristics: Disqualify items close to the ones
already selected Simulated Annealing: Apply simulated annealing to avoid local
maxima and others
15
Related Work
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Content Coverage Novelty
Greedy
Jain, PAKDD 2004Ziegler, WWW 2005Gollapudi, WWW 2009Drosou, DEBS 2009Tao, ICDE 2009Haritsa, IEEE Data Eng. Bull. 2009Vieira, ICDE 2011Bozzon, ICWE 2012Santos, SSBDM 2013Abbar, WWW 2013Valkanas, EBDT 2013 Agrawal, WSDM, 2009
Liu, SDM 2009Zhu, WWW 2011Li, CIKM 2012
Zhang, SIGIR 2002Clarke, SIGIR 2008Souravlias, 2010Lathia, SIGIR 2010Szpektor, WWW 2013
Interchange
Yu, EDBT 2009Vieira, ICDE 2011Liu, PVLDB 2009Minack, SIGIR 2011Liu, TODS 2012
Others
Vee, ICDE 2008Zhang, RecSys 2008Angel, SIGMOD 2011Fraternali, SIGMOD 2012Li, PVLDB 2013
16
Outline
Search Result Diversification: Introduction & Related Work
Content Diversification using Indices Model Diverse set computation Combining diversity & relevance
DisC Diversity: Diversification based on Dissimilarity and Coverage
POIKILO: Evaluating the Results of Diversification Models and Algorithms
SummaryMarina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
17
Introduction
We focus on content-based diversification MAXMIN
Basic idea: employ indices for the efficient computation of diverse Items
Cover Trees
We also define the Continuous k-diversity problem
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 18
The Cover Tree
A leveled tree where each level is a “cover” for all levels beneath it
Items at higher levels are farther apart from each other than items at lower levels
level
level
level
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 19
Cover Tree Invariants - Nesting
Nesting: , i.e., once an item appears at some level, then every lower level has a node associated with
level
level
level p1
p1
p2
p2
p2
p3
p3
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 20
Cover Tree Invariants - Covering
Covering: For every , there exists a , such that and the node associated with is the parent of the node associated with
p1
p1
p2
p2
p2
p3
p3 bl-1
b: the “base” of the tree l: the level of pi
level
level
level
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 21
Cover Tree Invariants - Separation
Separation: For all distinct , it holds that
p1
p1
p2
p2
p2
p3
p3
> bl-2
bl-1
level
level
level
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 22
Example
Items indexed at the first ten levels of the same Cover Tree
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 23
Cover Tree Representations
After an item appears in some level of the tree, then is a child of itself at all levels below .
Implicit Representation
Explicit Representation
O(n) space
space depending on P
p1
p1
p2
p2
p2
p3
p3
p1
p2
p3
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 24
Dynamic Construction
Items can be inserted and deleted from a Cover Tree in a dynamic fashion
Insertion:1. Starting from the root, descend towards the candidate
nodes that can cover the new item p2. Continue until a level Cl is reached where p is separated
from all other items3. Select as parent a candidate node of Cl+1 that covers p
Deletion:1. Descend the tree looking for p, keeping note of candidate
nodes that can cover the children of p2. Remove p and reassign its children to the candidate nodes
25
Level Family of Algorithms
The higher the tree level, the farthest apart its nodes. Thus, by selecting items from nodes at high levels, we retrieve more diverse results
Let be the first level with at least k items
1. Level-Basic: Select k random items from 2. Level-Greedy: Greedily select k items from 3. Level-Inherit: Select all items in and greedily
select k-|| items from
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 26
Approximation Bound
(Proved by exploiting the covering invariant of the tree to bound the level where the least common ancestor of any two items of the optimal solution appears in the tree)
Let P be a set of items, k 2, dOPT(P,k) the optimal minimum distance for the MAXMIN problem and dCT(P,k) be the minimum distance of the diverse set computed by the Level-Basic algorithm. It holds that:
dCT(P,k) dOPT(P,k), where = (b-1)/(2b2)
27
Cover Tree implementation of Greedy
Any Cover Tree can be employed for implementing the greedy heuristic ½-approximation of the optimal solution
We perform k descends of the tree, using one of the following pruning rules:
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
CT PRUNING RULE: Let p and q be two sibling nodes at level l in a CT. If , then no node in the subtree of q can be
further apart from S than p
WCT PRUNING RULE: Let p and q be two sibling nodes at level l in a CT. If , then no node in the subtree of q can be further apart from S than p
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 28
Batch Construction
If all items of P are available, we can perform a batch construction of the Cover Tree We call such trees “Batch Cover Trees” (BCTs) As we descend a BCT, we get items in the order selected by
Greedy
Algorithm:1. The leaf level Cl contains all items in P
2. We greedily select items from Cl with distance larger than bl+1 and promote them to Cl+1
3. The rest of the items in Cl are distributed as children among the new nodes of Cl+1
4. Continue until we reach the root level
29
Adding relevance
Two approaches:1. Incorporate relevance to the distance function
2. Use relevance to select items from the tree, e.g., mmr
Level-Greedy Level-Hybrid: Greedily select k items from and the k most
relevant descendants of items in
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 30
Continuous Model
We consider a streaming scenario, where new items arrive and older items expire
We want to provide users with a continuously updated subset of the top-k most diverse recent items in the stream
We consider a sliding-window model:
wjump step
Window Pi-1
Window Pi
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 31
Continuous k-Diversity Problem
Let be two subsequent jumping windows
For each , we seek to select a diverse subset , where the additional two constraints hold:
Durability: Once selected as diverse, an item remains as such until it
expires
Freshness: Let be the newest item in . Then, \ with , such that, Items are selected in the same order they are produced
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 32
Continuity Requirements
Items in the tree are marked as valid or invalid:
Freshness: non-diverse items that are older than the newest diverse item from the previous window are marked as invalid in the cover tree and are not further considered.
Durability: Let r be the number of diverse items from previous windows that have not yet expired. We select k-r new valid diverse items from the new window.
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 33
Building Batch Cover Trees
We measure the extra cost of building a BCT as compared to executing the greedy heuristic (GR) for k = n This extra cost corresponds to assigning nodes to suitable
parents to form the tree levels
Clustered Faces
b non-np np non-np np
1.3 0.42% 0.58% 1.49% 1.94%
1.5 0.42% 0.56% 1.47% 1.92%
1.7 0.41% 0.55% 1.47% 1.91%
np – nearest parent heuristic (choose closest candidate parent).
The quality of the solution is the same for BCT and GR.
Extr
a
Cost
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 34
Building Incremental Cover Trees
Building ICTs requires a small fraction of the cost required for the corresponding BCTs
However, the quality of the solutions provided by ICTs is comparable to that of BCTs (and, thus, GR)
b Clustered Faces
1.3 0.16% 0.79%
1.5 0.08% 0.41%
1.7 0.06% 0.28%
For trees with 10,ooo items: Insertion cost: ~2.6 msec Deletion cost: ~10 msec
Inserting/Removing items after a window jump depends on the size of the window and the jump step but is much faster than re-building a BCT for the new set of items
Extr
a
Cost
35
Pruning
Pruning is even better for non uniform datasets, since each selection of a diverse item results in pruning a largest number of items around it
Also, pruning is better for large values of λ
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 36
Streaming Data We compare ICTs against SGR, a streaming version of GR:
At each window, we keep any remaining diverse items from the previous window (durability) and let GR select items from the new window satisfying freshness
Comparable achieved diversity, while ICTs are much faster
Retrieving the top-100 items from an ICT with 1,000-10,000 items requires ~1.5 msec
Executing SGR requires 3.2 sec for 5,000 items and more than 15 sec for 10,000 items
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 37
Summary
We proposed an indexed-based diversification approach based on Cover Trees
We provided a new suite of algorithms along with theoretical results for the quality of our approach
We studied the diversification problem in a dynamic setting, where items change over time and defined continuity requirements that the diversified items must satisfy
38
Related Publications
1. M. Drosou and E. Pitoura, Diverse Set Selection over Dynamic Data, in IEEE TKDE (to appear)
2. M. Drosou and E. Pitoura, Dynamic Diversification of Continuous Data, EDBT 2012
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
39
Outline
Search Result Diversification: Introduction & Related Work
Content Diversification using Indices
DisC Diversity: Diversification based on Dissimilarity and Coverage DisC Diversity Algorithms Comparison with other models Incremental DisC
POIKILO: Evaluating the Results of Diversification Models and Algorithms
SummaryMarina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 40
DisC Diversity
What is the right size for the diverse subset S? What is a good k?
What if… instead of k, a radius r?
Given a result set P and a radius r, we select a representative subset S ⊆ P such that:1. For each item in P, there is at least one similar item in S
(coverage)2. No two items in S are similar with each other
(dissimilarity)
41
DisC Diversity
Zoom-out
Zoom-in Local zoom
Small r: more and less dissimilar points (zoom in)
Large r: less and more dissimilar points (zoom out)
Local zooming at specific pointsMarina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 42
DisC Diversity
Formal definition: Let P be a set of objects and r, r ≥ 0, a positive real
number. A subset S ⊆ P is an r-Dissimilar-and-Covering diverse subset, or r-DisC diverse subset, of P, if the following two conditions hold:1. coverage condition: ∀pi ∈ P, ∃pj ∈ N+
r (pi), such that pj ∈ S
2. dissimilarity condition: ∀pi, pj ∈ S with pi ≠ pj, it holds that d(pi, pj) > r
Since a DisC set for a set P is not unique We seek a concise representation → the
minimum DisC set
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 43
Graph model
We use a graph to model the problem: Each item is a vertex There exists an edge between two vertices, if their distance
is less than r
r
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 44
Graph model
Finding a minimum r-DisC diverse subset of a set P is equivalent to finding a minimum Independent Dominating set of the corresponding graph Independent: no edge between any two vertices in the set Dominating: all vertices outside connected with at least one
inside This is an NP-hard problem
Dominating, not independent
Dominating and independent
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 45
Computing DisC subsets
A basic or greedy approach: select random items or items with large neighborhoods
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 46
How smaller is the (optimal) minimum DisC set?
where B the maximum number of independent neighbors of any item in P
i.e., each item has at most B neighbors that are independent from each other
B depends on the distance metric and data cardinality
We have proved that: for the Euclidean distance in the 2D plane: B = 5 for the Manhattan distance in the 2D plane: B = 7 for the Euclidean distance in the 3D plane: B = 24
The size of any r-DisC diverse subset S of P is B times the size of any minimum r-DisC diverse subset S*
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 47
Raising the dissimilarity condition
When we consider only coverage:
Let Δ be the maximum number of neighbors of any item in P; the size of any covering (but not dissimilar) diverse subset S of P is at most lnΔ times larger than any minimum covering subset S*
48
Adding weights
We also consider weights e.g., indicating relevance
We now seek the DisC set S with the minimum value of
When all weights are equal, the problem is reduced to finding a minimum r-DisC subset
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
49
Multiple radii
We want to allow different areas of the data to contribute more or less items to the diverse set
The problem now loses its symmetry
Two interpretations:1. pi can represent all items lying at a distance at most r(pi)
around it (Covering problem)2. pi can be represented only by items lying at a distance at
most r(pi) around it (CoveredBy problem)
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
50
Multiple radii variations
The problem is now modeled via a directed graph Directed graphs do not always have an independent
dominating set! We provide heuristic algorithms that always locate a valid
DisC set Covering: start with items with larger radii CoveredBy: start with items with smaller radiiMarina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Covering
CoveredBy
A set P
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 51
Comparison with other models
r-DisC MAXSUM
MAXMIN k-medoids
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 52
Comparison with MAXMIN
For a set of items P, we have proved that:1. Let S be an r-DisC set and S* be an optimal MAXMIN set.
Let and * be the MAXMIN distances of the two sets. Then, * ≤ 3.
2. Let S* be the optimal MAXMIN set of size k with MAXMIN distance equal to *. Let S be an r-DisC set with r = *. Then, |S| < k′, where k′ is the first integer larger than k for which the corresponding optimal MAXMIN set of P S*′ has MAXMIN distance equal to λ*′, with λ*′ < λ*.
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 53
Zooming
We want to change the radius r to r’ interactively and compute a new diverse set r’ < r zoom in, r’ > r zoom out
Two requirements:1. Support an incremental mode of operation:
the new set Sr’ should be as close as possible to the already seen result Sr. Ideally, Sr’ ⊇ Sr for r’ < r and Sr’ ⊆ Sr for r’ > r
2. The size of Sr’ should be as close as possible to the size of the minimum r’-DisC diverse subset
There is no monotonic property among the r-DisC diverse and the r’-DisC diverse subsets of a set of objects P (the two sets may be completely different)
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 54
Size when moving from r to r’
The change in size of the diverse set when moving from r to r’ depends on the number of independent neighbors (for r’) in the “ring” around an object between the two radii.
𝑁 (𝑝𝑖)𝑟 1 ,𝑟 2
𝐼
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 55
Zooming
Again, depends on the distance metric and data cardinality
We have proved that: 2D Euclidean: , where
2D Manhattan: , where
(proofs in the thesis)
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 56
Zooming-In
For zooming-in, we keep the items of Sr and fill in the solution with items from uncovered areas
It holds that:1. Sr ⊆ Sr′
2. |Sr′| ≤ N|Sr|, where N is the maximum in Sr
(proof and various algorithms for keeping the size small in the thesis)
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 57
Zooming-Out
For zooming-out, we keep the independent items of Sr and fill in the solution with items from uncovered areas.
It holds that:1. There are at most N items in Sr\Sr’
2. For each item in Sr\Sr’, at most (B-1) items are added to Sr’
(proof and various algorithms for keeping the size small in the thesis)
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 58
Implementation
We base our implementation on a spatial data structure (central operation: compute neighbors)
We use an M-tree We link together all leaf nodes (we visit items in a single
left-to-right traversal of the leaf level to exploit locality) We build trees using splitting policies that minimize
overlap
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 59
Implementation
Two implementations for our greedy approach Grey-Greedy, White-Greedy
Lazy variations for updating neighborhoods
Pruning Rule: A leaf node that contains no white objects is colored grey. When all its children become grey, an internal node is colored grey and becomes inactive. We prune subtrees with only “grey nodes”.
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 60
PerformanceMany real and synthetic datasets
General trade-off:Larger r → Smaller diverse set → higher cost
Lazy variations of our algorithms further reduce computational cost
The cost also depends on the characteristics of the M-tree (fat-factor)
Smaller sizes for clustered data
Cost
Solution size
61
Diversity and Relevance
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Similar diversity for the Basic and Greedy algorithms
Greedy considers relevance and produces subsets of larger average weight
Raising the dissimilarity condition improves average weight but minimum distance is decreased
Also, we get larger subsets than in the diversity-only case
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems 62
ZoomingSolution size
Cost
Jaccard distance among solutions
Both requirements: incremental (much smaller cost) and small size (relative to computing it from scratch)
Larger overlap among Sr and Sr’
63
Related Publications
1. M. Drosou and E. Pitoura, Multiple Radii DisC Diversity: Result Diversification based on Dissimilarity and Coverage (submitted)
2. M. Drosou and E. Pitoura, DisC Diversity: Result Diversification based on Dissimilarity and Coverage, in PVLDB, vol. 6, no.1, pp. 1324, 2012, VLDB Endowment (Best Paper Award)
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
64
Outline
Search Result Diversification: Introduction & Related Work
Content Diversification using Indices
DisC Diversity: Diversification based on Dissimilarity and Coverage
POIKILO: Evaluating the Results of Diversification Models and Algorithms
Summary
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
65
Visualizing Diverse Items
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Selecting diversification
parameters
Zooming and Streaming
Result
Statistics
66
Visualizing Diverse Items
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
67
Related Publications
1. M. Drosou and E. Pitoura, POIKILO: A Tool for Evaluating the Results of Diversification Models and Algorithms, VLDB 2013
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
68
Outline
Search Result Diversification: Introduction & Related Work
Content Diversification using Indices
DisC Diversity: Diversification based on Dissimilarity and Coverage
POIKILO: Evaluating the Results of Diversification Models and Algorithms
Summary Thesis contribution Directions for future research
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
69
Thesis Contributions
Index-based diversification We proposed a novel, index-based approach for solving the
MAXMIN diversification problem
We provided a suite of algorithms exploiting the Cover Tree We introduced a new family of diversity algorithms which provide
a -approximation of the optimal solution, where b is the base of the cover tree
We presented tree-based efficient implementation of a traditional greedy algorithm which provides an ½-approximation of the optimal solution
We defined the continuous diversification problem and introduced a number of continuity requirements for increasing the quality of diverse results in a streaming scenario
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
70
Thesis Contributions
Diversification based on dissimilarity and coverage We introduced a novel diversity definition, called DisC diversity,
based on using a radius r rather than a size limit k to select diverse items
We presented both a spatial and a graph model for our definition
We studied the weighted and multiple radii cases
We introduced incremental diversification to a new radius through zooming-in and zooming-out
We presented algorithms for locating DisC diverse subsets and derived bounds concerning the size of such subsets
We provided efficient implementations of our algorithms based on spatial index structures, namely the M-TreeMarina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
71
Thesis Contributions
Visualizing and comparing diversification algorithms We developed a system prototype, called “Poikilo”,
providing implementations of a wide variety of diversification approaches to assist users in locating, visualizing and comparing diverse results
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
72
Directions for future research
Short term plans Diversification in Database Exploration
Interesting suggestions in database exploration are often similar Also: exploit external sources
Diversification of Multiple Search Results Exploit overlap among results of different queries Use diversified results of past queries to answer new ones
Diversification of Keyword Search Results in Databases Moving diversification to the ranking phase Apply coverage-based definitions
Long term plans Diversification in a distributed setting
Place “diversification filters” on the overlay network to reduce computational and communication costs
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
73
Thesis Publications
Journal Publications1. M. Drosou and E. Pitoura, Multiple Radii DisC Diversity: Result
Diversification based on Dissimilarity and Coverage (submitted)2. M. Drosou and E. Pitoura, YmalDB: Exploring Relational
Databases via Result Driven Recommendations, in VLDBJ (to appear)
3. M. Drosou and E. Pitoura, Diverse Set Selection over Dynamic Data, in IEEE TKDE (to appear)
4. M. Drosou and E. Pitoura, DisC Diversity: Result Diversification based on Dissimilarity and Coverage, in PVLDB, vol. 6, no.1, pp. 1324, 2012, VLDB Endowment (Best Paper Award)
5. M. Drosou and E. Pitoura, Search Result Diversification, in SIGMOD Record, vol. 39, no. 1, pp. 4147, 2010, ACM
6. M. Drosou and E. Pitoura, Diversity over Continuous Data, in IEEE Data Engineering Bulletin, vol. 32, no. 4, pp. 4956, 2009, IEEE
Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
74
Thesis Publications
Conference Publications1. M. Drosou and E. Pitoura, Dynamic Diversification of Continuous Data,
EDBT 20122. M. Drosou and E. Pitoura, REDRIVE: Result Driven Database Exploration
through Recommendations, CIKM 20113. K. Stefanidis, M. Drosou and E. Pitoura, PerK: Personalized Keyword
Search in Relational Databases through Preferences, EDBT 2010
Workshop Publications4. D. Souravlias, M. Drosou, K. Stefanidis and E. Pitoura, On Novelty in
Publish/Subscribe Delivery, DBRank 20105. K. Stefanidis, M. Drosou and E. Pitoura, ‘‘You May Also Like’’ Results in
Relational Databases, PersDB 2009
Demos6. M. Drosou and E. Pitoura, POIKILO: A Tool for Evaluating the Results of
Diversification Models and Algorithms, VLDB 20137. M. Drosou and E. Pitoura, YmalDB: A Result Driven Recommendation
System for Databases, EDBT 2013Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
75
Thank you!