Efficient Diversity-Aware Search

43
Efficient Diversity-Aware Search Dacong (Tony) Yan May 4, 2011

description

Yet another class present.

Transcript of Efficient Diversity-Aware Search

Page 1: Efficient Diversity-Aware Search

Efficient Diversity-Aware Search

Dacong (Tony) Yan

May 4, 2011

Page 2: Efficient Diversity-Aware Search

Background & Motivation

What is search?

1. A user U initiates a query Q2. A list of documents D sorted by relevance R w.r.t Q are returned

User Satisfaction sat(U ,Q)

It’s all about relevance between D and Q!User U has its own perspective on relevance RURoughly speaking, sat(U ,Q) ∝ 1

diff (RU ,R)

Problem: RU is difficult to capture, and usually ignored!

Symptoms of ignoring RURedundant documents included in the result setMost relevant documents in terms of RU excluded from the result set

Solution: diversity-aware search!

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20

Page 3: Efficient Diversity-Aware Search

Background & Motivation

What is search?

1. A user U initiates a query Q2. A list of documents D sorted by relevance R w.r.t Q are returned

User Satisfaction sat(U ,Q)

It’s all about relevance between D and Q!User U has its own perspective on relevance RURoughly speaking, sat(U ,Q) ∝ 1

diff (RU ,R)

Problem: RU is difficult to capture, and usually ignored!

Symptoms of ignoring RURedundant documents included in the result setMost relevant documents in terms of RU excluded from the result set

Solution: diversity-aware search!

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20

Page 4: Efficient Diversity-Aware Search

Background & Motivation

What is search?

1. A user U initiates a query Q2. A list of documents D sorted by relevance R w.r.t Q are returned

User Satisfaction sat(U ,Q)

It’s all about relevance between D and Q!User U has its own perspective on relevance RURoughly speaking, sat(U ,Q) ∝ 1

diff (RU ,R)

Problem: RU is difficult to capture, and usually ignored!

Symptoms of ignoring RURedundant documents included in the result setMost relevant documents in terms of RU excluded from the result set

Solution: diversity-aware search!

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20

Page 5: Efficient Diversity-Aware Search

Background & Motivation

What is search?

1. A user U initiates a query Q2. A list of documents D sorted by relevance R w.r.t Q are returned

User Satisfaction sat(U ,Q)

It’s all about relevance between D and Q!User U has its own perspective on relevance RURoughly speaking, sat(U ,Q) ∝ 1

diff (RU ,R)

Problem: RU is difficult to capture, and usually ignored!

Symptoms of ignoring RURedundant documents included in the result setMost relevant documents in terms of RU excluded from the result set

Solution: diversity-aware search!

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20

Page 6: Efficient Diversity-Aware Search

Background & Motivation

What is search?

1. A user U initiates a query Q2. A list of documents D sorted by relevance R w.r.t Q are returned

User Satisfaction sat(U ,Q)

It’s all about relevance between D and Q!User U has its own perspective on relevance RURoughly speaking, sat(U ,Q) ∝ 1

diff (RU ,R)

Problem: RU is difficult to capture, and usually ignored!

Symptoms of ignoring RURedundant documents included in the result setMost relevant documents in terms of RU excluded from the result set

Solution: diversity-aware search!

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20

Page 7: Efficient Diversity-Aware Search

Agenda

Background & Motivation

Diversity-Aware Search

DivGen Approach

Evaluation

Conclusion

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 3/20

Page 8: Efficient Diversity-Aware Search

Diversity-Aware Search

Intuitively, relevance + dissimilarity

Formally, a content-based diversification perspective:

Data ModelUser Behavior ModelAnswer Quality

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 4/20

Page 9: Efficient Diversity-Aware Search

Diversity-Aware Search

Intuitively, relevance + dissimilarity

Formally, a content-based diversification perspective:

Data ModelUser Behavior ModelAnswer Quality

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 4/20

Page 10: Efficient Diversity-Aware Search

Data Model

Vector Space Model: documents as weighted sets of features

Each document d is represented as a vector

d = (d1, d2, ...),

denoting feature i has weight d i ≥ 0 in document d

Examples

textual documents: features can be keywords weighted in a tf.idfmannergraph “documents”: features can be paths in the corpus graphin recsys scenario: features can be the set of users who recommend adocument

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 5/20

Page 11: Efficient Diversity-Aware Search

Data Model

Vector Space Model: documents as weighted sets of features

Each document d is represented as a vector

d = (d1, d2, ...),

denoting feature i has weight d i ≥ 0 in document d

Examples

textual documents: features can be keywords weighted in a tf.idfmannergraph “documents”: features can be paths in the corpus graphin recsys scenario: features can be the set of users who recommend adocument

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 5/20

Page 12: Efficient Diversity-Aware Search

User Behavior Model

Assumption: the user examines the results in their order ofpresentation.

Usefulness of a document d : the probability that d is usefulRelevance: the probability that d is relevantNovelty: the probability that d ’s content is not redundant

Consider a document d preceded by d1, d2, ..., dm w.r.t a query q, itsusefulness is defined below:

use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))⇓

use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m

i=1(1− red(d |di , q))

red(d |di , q) can be decomposed further:sim(d , di ): the probability that the content of d is similar to, orcontained in, that of di ;fq: the estimated probability that, given a query q, a document withsimilar content to, or content contained in, a document previouslyemitted, is redundant.

red(d |di , q) = sim(d , di ) · fq

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20

Page 13: Efficient Diversity-Aware Search

User Behavior Model

Assumption: the user examines the results in their order ofpresentation.

Usefulness of a document d : the probability that d is usefulRelevance: the probability that d is relevantNovelty: the probability that d ’s content is not redundant

Consider a document d preceded by d1, d2, ..., dm w.r.t a query q, itsusefulness is defined below:

use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))⇓

use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m

i=1(1− red(d |di , q))

red(d |di , q) can be decomposed further:sim(d , di ): the probability that the content of d is similar to, orcontained in, that of di ;fq: the estimated probability that, given a query q, a document withsimilar content to, or content contained in, a document previouslyemitted, is redundant.

red(d |di , q) = sim(d , di ) · fq

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20

Page 14: Efficient Diversity-Aware Search

User Behavior Model

Assumption: the user examines the results in their order ofpresentation.

Usefulness of a document d : the probability that d is usefulRelevance: the probability that d is relevantNovelty: the probability that d ’s content is not redundant

Consider a document d preceded by d1, d2, ..., dm w.r.t a query q, itsusefulness is defined below:

use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))

⇓use(d |{d1, ..., dm}, q) = sim(d , q) ·

∏mi=1(1− red(d |di , q))

red(d |di , q) can be decomposed further:sim(d , di ): the probability that the content of d is similar to, orcontained in, that of di ;fq: the estimated probability that, given a query q, a document withsimilar content to, or content contained in, a document previouslyemitted, is redundant.

red(d |di , q) = sim(d , di ) · fq

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20

Page 15: Efficient Diversity-Aware Search

User Behavior Model

Assumption: the user examines the results in their order ofpresentation.

Usefulness of a document d : the probability that d is usefulRelevance: the probability that d is relevantNovelty: the probability that d ’s content is not redundant

Consider a document d preceded by d1, d2, ..., dm w.r.t a query q, itsusefulness is defined below:

use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))⇓

use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m

i=1(1− red(d |di , q))

red(d |di , q) can be decomposed further:sim(d , di ): the probability that the content of d is similar to, orcontained in, that of di ;fq: the estimated probability that, given a query q, a document withsimilar content to, or content contained in, a document previouslyemitted, is redundant.

red(d |di , q) = sim(d , di ) · fq

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20

Page 16: Efficient Diversity-Aware Search

User Behavior Model

Assumption: the user examines the results in their order ofpresentation.

Usefulness of a document d : the probability that d is usefulRelevance: the probability that d is relevantNovelty: the probability that d ’s content is not redundant

Consider a document d preceded by d1, d2, ..., dm w.r.t a query q, itsusefulness is defined below:

use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))⇓

use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m

i=1(1− red(d |di , q))

red(d |di , q) can be decomposed further:sim(d , di ): the probability that the content of d is similar to, orcontained in, that of di ;fq: the estimated probability that, given a query q, a document withsimilar content to, or content contained in, a document previouslyemitted, is redundant.

red(d |di , q) = sim(d , di ) · fq

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20

Page 17: Efficient Diversity-Aware Search

User Behavior Model

Assumption: the user examines the results in their order ofpresentation.

Usefulness of a document d : the probability that d is usefulRelevance: the probability that d is relevantNovelty: the probability that d ’s content is not redundant

Consider a document d preceded by d1, d2, ..., dm w.r.t a query q, itsusefulness is defined below:

use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))⇓

use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m

i=1(1− red(d |di , q))

red(d |di , q) can be decomposed further:sim(d , di ): the probability that the content of d is similar to, orcontained in, that of di ;fq: the estimated probability that, given a query q, a document withsimilar content to, or content contained in, a document previouslyemitted, is redundant.

red(d |di , q) = sim(d , di ) · fqCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20

Page 18: Efficient Diversity-Aware Search

User Behavior Model (Cont.)

Focus Parameter fq

fq is the main tunable parameter in red(d |di , q) = sim(d , di ) · fqIt is defined on a per-query basis, and denotes the amount of desireddiversification

Smaller fq favors relevance over diversityLarger fq favors diversity over relevance

Probabilistic interpretation:

“how likely is a relevant document to be useful to the user, giventhat they have already examined a document with similar content?”

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 7/20

Page 19: Efficient Diversity-Aware Search

User Behavior Model (Cont.)

Focus Parameter fq

fq is the main tunable parameter in red(d |di , q) = sim(d , di ) · fqIt is defined on a per-query basis, and denotes the amount of desireddiversification

Smaller fq favors relevance over diversityLarger fq favors diversity over relevance

Probabilistic interpretation:

“how likely is a relevant document to be useful to the user, giventhat they have already examined a document with similar content?”

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 7/20

Page 20: Efficient Diversity-Aware Search

Answer Quality

Quantification properties

Tractable instantiation

An optimal answer for strict order dominance semantics can befound by greedily identifying the best result at position 1, 2, ..., k

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 8/20

Page 21: Efficient Diversity-Aware Search

Answer Quality

Quantification properties

Tractable instantiation

An optimal answer for strict order dominance semantics can befound by greedily identifying the best result at position 1, 2, ..., k

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 8/20

Page 22: Efficient Diversity-Aware Search

Answer Quality

Quantification properties

Tractable instantiation

An optimal answer for strict order dominance semantics can befound by greedily identifying the best result at position 1, 2, ..., k

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 8/20

Page 23: Efficient Diversity-Aware Search

The DivGen Approach

Page 24: Efficient Diversity-Aware Search

A First Stab to DAS

Steps:

1. Compute the relevance of each document to the query;2. Identify the highest score document d , and update the usefulness of

all other documents, based on their similarity to d ;3. Repeat the procedure k times.

Problems:

It requires access to the entire corpus.It is too inefficient even for a moderately large set of documents.

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 10/20

Page 25: Efficient Diversity-Aware Search

A First Stab to DAS

Steps:

1. Compute the relevance of each document to the query;2. Identify the highest score document d , and update the usefulness of

all other documents, based on their similarity to d ;3. Repeat the procedure k times.

Problems:

It requires access to the entire corpus.It is too inefficient even for a moderately large set of documents.

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 10/20

Page 26: Efficient Diversity-Aware Search

A Threshold Algorithm for DAS

Generate-Filter Idea:

Incrementally compute documents in descending order of relevance;Maintain upper and lower bounds on the relevance of everyencountered document;Rerank the documents with diversity taken into account.

Data Access Primitives

Sequential Access (SA): retrieve the id of the document with thenext highest weight for a specified feature iRandom Access (RA): retrieve the exact weight of feature i indocument d

Drawbacks

Fully compute the relevance, and retrieve the entire content;Wasted I/O efforts, and a lot of this I/O is not sequential in nature;Hardly any early pruning is possible.

DivGen: making Generate aware of diversity!

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20

Page 27: Efficient Diversity-Aware Search

A Threshold Algorithm for DAS

Generate-Filter Idea:

Incrementally compute documents in descending order of relevance;Maintain upper and lower bounds on the relevance of everyencountered document;Rerank the documents with diversity taken into account.

Data Access Primitives

Sequential Access (SA): retrieve the id of the document with thenext highest weight for a specified feature iRandom Access (RA): retrieve the exact weight of feature i indocument d

Drawbacks

Fully compute the relevance, and retrieve the entire content;Wasted I/O efforts, and a lot of this I/O is not sequential in nature;Hardly any early pruning is possible.

DivGen: making Generate aware of diversity!

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20

Page 28: Efficient Diversity-Aware Search

A Threshold Algorithm for DAS

Generate-Filter Idea:

Incrementally compute documents in descending order of relevance;Maintain upper and lower bounds on the relevance of everyencountered document;Rerank the documents with diversity taken into account.

Data Access Primitives

Sequential Access (SA): retrieve the id of the document with thenext highest weight for a specified feature iRandom Access (RA): retrieve the exact weight of feature i indocument d

Drawbacks

Fully compute the relevance, and retrieve the entire content;Wasted I/O efforts, and a lot of this I/O is not sequential in nature;Hardly any early pruning is possible.

DivGen: making Generate aware of diversity!

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20

Page 29: Efficient Diversity-Aware Search

A Threshold Algorithm for DAS

Generate-Filter Idea:

Incrementally compute documents in descending order of relevance;Maintain upper and lower bounds on the relevance of everyencountered document;Rerank the documents with diversity taken into account.

Data Access Primitives

Sequential Access (SA): retrieve the id of the document with thenext highest weight for a specified feature iRandom Access (RA): retrieve the exact weight of feature i indocument d

Drawbacks

Fully compute the relevance, and retrieve the entire content;Wasted I/O efforts, and a lot of this I/O is not sequential in nature;Hardly any early pruning is possible.

DivGen: making Generate aware of diversity!

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20

Page 30: Efficient Diversity-Aware Search

The DivGen Algorithm

Idea: maintain a set of candidate documents with bounds onusefulness

Novel Data Access Primitives

Bound Access (BA): retrieve the features with the highest weight ind , as well as an upper bound w on the weight of any other featuresof dBatch Sequential Access (BSA): retrieve the documents with thehighest weight of non-query feature i , as well as an upper bound won the weight of i in any other documentDocument Random Access (DocRA): retrieve all the features withnonzero weight in d , along with their exact weights

Advantages of BA, BSA, DocRA

Existing index techniques can be easily leveraged to enable theseprimitives.These primitives can enable a set of early prunings to make thealgorithm more efficient.

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20

Page 31: Efficient Diversity-Aware Search

The DivGen Algorithm

Idea: maintain a set of candidate documents with bounds onusefulness

Novel Data Access Primitives

Bound Access (BA): retrieve the features with the highest weight ind , as well as an upper bound w on the weight of any other featuresof dBatch Sequential Access (BSA): retrieve the documents with thehighest weight of non-query feature i , as well as an upper bound won the weight of i in any other documentDocument Random Access (DocRA): retrieve all the features withnonzero weight in d , along with their exact weights

Advantages of BA, BSA, DocRA

Existing index techniques can be easily leveraged to enable theseprimitives.These primitives can enable a set of early prunings to make thealgorithm more efficient.

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20

Page 32: Efficient Diversity-Aware Search

The DivGen Algorithm

Idea: maintain a set of candidate documents with bounds onusefulness

Novel Data Access Primitives

Bound Access (BA): retrieve the features with the highest weight ind , as well as an upper bound w on the weight of any other featuresof dBatch Sequential Access (BSA): retrieve the documents with thehighest weight of non-query feature i , as well as an upper bound won the weight of i in any other documentDocument Random Access (DocRA): retrieve all the features withnonzero weight in d , along with their exact weights

Advantages of BA, BSA, DocRA

Existing index techniques can be easily leveraged to enable theseprimitives.These primitives can enable a set of early prunings to make thealgorithm more efficient.

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20

Page 33: Efficient Diversity-Aware Search

Algorithm Pseudo-code

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 13/20

Page 34: Efficient Diversity-Aware Search

Revisit Data Access Primitives

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 14/20

Page 35: Efficient Diversity-Aware Search

An Execution Example

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 15/20

Page 36: Efficient Diversity-Aware Search

An Execution Example

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 15/20

Page 37: Efficient Diversity-Aware Search

Evaluation

Experimental Setup

Java 6, Oracle BerkeleyDB Java Edition v3.3.74Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memoryext3fs filesystem with a page size of 4KB

Datasets

Real data: taken from Grapevine, a tool for distilling knowledge fromsocial mediaSynthetic data: Zipfian distribution across documents, and normaldistribution in each document. How to synthesize?

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20

Page 38: Efficient Diversity-Aware Search

Evaluation

Experimental Setup

Java 6, Oracle BerkeleyDB Java Edition v3.3.74Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memoryext3fs filesystem with a page size of 4KB

Datasets

Real data: taken from Grapevine, a tool for distilling knowledge fromsocial mediaSynthetic data: Zipfian distribution across documents, and normaldistribution in each document.

How to synthesize?

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20

Page 39: Efficient Diversity-Aware Search

Evaluation

Experimental Setup

Java 6, Oracle BerkeleyDB Java Edition v3.3.74Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memoryext3fs filesystem with a page size of 4KB

Datasets

Real data: taken from Grapevine, a tool for distilling knowledge fromsocial mediaSynthetic data: Zipfian distribution across documents, and normaldistribution in each document. How to synthesize?

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20

Page 40: Efficient Diversity-Aware Search

Evaluation (Cont. I)

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 17/20

Page 41: Efficient Diversity-Aware Search

Evaluation (Cont. II)

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 18/20

Page 42: Efficient Diversity-Aware Search

Conclusion

This paper

formally studied the diversity-aware search problem;

proposed a set of novel data access primitives to efficiently solveDAS;

performed experimental studies demonstrating the usefulness ofDivGen.

CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 19/20

Page 43: Efficient Diversity-Aware Search

Thank you!