Efficient Diversity-Aware Search

Dacong (Tony) Yan

May 4, 2011

Background & Motivation

What is search?

1. A user U initiates a query Q2. A list of documents D sorted by relevance R w.r.t Q are returned

User Satisfaction sat(U ,Q)

It’s all about relevance between D and Q!User U has its own perspective on relevance RURoughly speaking, sat(U ,Q) ∝ 1

diff (RU ,R)

Problem: RU is difficult to capture, and usually ignored!

Symptoms of ignoring RURedundant documents included in the result setMost relevant documents in terms of RU excluded from the result set

Solution: diversity-aware search!

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20

What is search?

diff (RU ,R)

What is search?

diff (RU ,R)

What is search?

diff (RU ,R)

What is search?

diff (RU ,R)

Agenda

Diversity-Aware Search

DivGen Approach

Evaluation

Conclusion

Intuitively, relevance + dissimilarity

Formally, a content-based diversification perspective:

Data ModelUser Behavior ModelAnswer Quality

Intuitively, relevance + dissimilarity

Formally, a content-based diversification perspective:

Data ModelUser Behavior ModelAnswer Quality

Data Model

Vector Space Model: documents as weighted sets of features

Each document d is represented as a vector

d = (d1, d2, ...),

denoting feature i has weight d i ≥ 0 in document d

Examples

textual documents: features can be keywords weighted in a tf.idfmannergraph “documents”: features can be paths in the corpus graphin recsys scenario: features can be the set of users who recommend adocument

Data Model

Vector Space Model: documents as weighted sets of features

Each document d is represented as a vector

d = (d1, d2, ...),

denoting feature i has weight d i ≥ 0 in document d

Examples

textual documents: features can be keywords weighted in a tf.idfmannergraph “documents”: features can be paths in the corpus graphin recsys scenario: features can be the set of users who recommend adocument

User Behavior Model

Assumption: the user examines the results in their order ofpresentation.

Usefulness of a document d : the probability that d is usefulRelevance: the probability that d is relevantNovelty: the probability that d ’s content is not redundant

Consider a document d preceded by d1, d2, ..., dm w.r.t a query q, itsusefulness is defined below:

use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))⇓

use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m

i=1(1− red(d |di , q))

red(d |di , q) can be decomposed further:sim(d , di ): the probability that the content of d is similar to, orcontained in, that of di ;fq: the estimated probability that, given a query q, a document withsimilar content to, or content contained in, a document previouslyemitted, is redundant.

red(d |di , q) = sim(d , di ) · fq

User Behavior Model

use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m

i=1(1− red(d |di , q))

User Behavior Model

use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))

⇓use(d |{d1, ..., dm}, q) = sim(d , q) ·

∏mi=1(1− red(d |di , q))

User Behavior Model

use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m

i=1(1− red(d |di , q))

User Behavior Model

use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m

i=1(1− red(d |di , q))

User Behavior Model

use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m

i=1(1− red(d |di , q))

red(d |di , q) = sim(d , di ) · fqCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20

User Behavior Model (Cont.)

Focus Parameter fq

fq is the main tunable parameter in red(d |di , q) = sim(d , di ) · fqIt is defined on a per-query basis, and denotes the amount of desireddiversification

Smaller fq favors relevance over diversityLarger fq favors diversity over relevance

Probabilistic interpretation:

“how likely is a relevant document to be useful to the user, giventhat they have already examined a document with similar content?”

User Behavior Model (Cont.)

Focus Parameter fq

fq is the main tunable parameter in red(d |di , q) = sim(d , di ) · fqIt is defined on a per-query basis, and denotes the amount of desireddiversification

Smaller fq favors relevance over diversityLarger fq favors diversity over relevance

Probabilistic interpretation:

“how likely is a relevant document to be useful to the user, giventhat they have already examined a document with similar content?”

Answer Quality

Quantification properties

Tractable instantiation

An optimal answer for strict order dominance semantics can befound by greedily identifying the best result at position 1, 2, ..., k

Answer Quality

The DivGen Approach

A First Stab to DAS

Steps:

1. Compute the relevance of each document to the query;2. Identify the highest score document d , and update the usefulness of

all other documents, based on their similarity to d ;3. Repeat the procedure k times.

Problems:

It requires access to the entire corpus.It is too inefficient even for a moderately large set of documents.

A First Stab to DAS

Steps:

1. Compute the relevance of each document to the query;2. Identify the highest score document d , and update the usefulness of

all other documents, based on their similarity to d ;3. Repeat the procedure k times.

Problems:

It requires access to the entire corpus.It is too inefficient even for a moderately large set of documents.

A Threshold Algorithm for DAS

Generate-Filter Idea:

Incrementally compute documents in descending order of relevance;Maintain upper and lower bounds on the relevance of everyencountered document;Rerank the documents with diversity taken into account.

Data Access Primitives

Sequential Access (SA): retrieve the id of the document with thenext highest weight for a specified feature iRandom Access (RA): retrieve the exact weight of feature i indocument d

Drawbacks

Fully compute the relevance, and retrieve the entire content;Wasted I/O efforts, and a lot of this I/O is not sequential in nature;Hardly any early pruning is possible.

DivGen: making Generate aware of diversity!

Drawbacks

The DivGen Algorithm

Idea: maintain a set of candidate documents with bounds onusefulness

Novel Data Access Primitives

Bound Access (BA): retrieve the features with the highest weight ind , as well as an upper bound w on the weight of any other featuresof dBatch Sequential Access (BSA): retrieve the documents with thehighest weight of non-query feature i , as well as an upper bound won the weight of i in any other documentDocument Random Access (DocRA): retrieve all the features withnonzero weight in d , along with their exact weights

Advantages of BA, BSA, DocRA

Existing index techniques can be easily leveraged to enable theseprimitives.These primitives can enable a set of early prunings to make thealgorithm more efficient.

Algorithm Pseudo-code

Revisit Data Access Primitives

An Execution Example

Evaluation

Experimental Setup

Java 6, Oracle BerkeleyDB Java Edition v3.3.74Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memoryext3fs filesystem with a page size of 4KB

Datasets

Real data: taken from Grapevine, a tool for distilling knowledge fromsocial mediaSynthetic data: Zipfian distribution across documents, and normaldistribution in each document. How to synthesize?

Evaluation

Experimental Setup

Datasets

Real data: taken from Grapevine, a tool for distilling knowledge fromsocial mediaSynthetic data: Zipfian distribution across documents, and normaldistribution in each document.

How to synthesize?

Evaluation

Experimental Setup

Datasets

Real data: taken from Grapevine, a tool for distilling knowledge fromsocial mediaSynthetic data: Zipfian distribution across documents, and normaldistribution in each document. How to synthesize?

Evaluation (Cont. I)

Evaluation (Cont. II)

Conclusion

This paper

formally studied the diversity-aware search problem;

proposed a set of novel data access primitives to efficiently solveDAS;

performed experimental studies demonstrating the usefulness ofDivGen.

Thank you!

Efficient Diversity-Aware Search

Technology

Transcript of Efficient Diversity-Aware Search