2011_An Energy Efficient Pedestrian Aware Smart Street Lighting System_MuellnerRiener2011
Efficient Diversity-Aware Search
-
Upload
dacong-yan -
Category
Technology
-
view
315 -
download
0
description
Transcript of Efficient Diversity-Aware Search
Efficient Diversity-Aware Search
Dacong (Tony) Yan
May 4, 2011
Background & Motivation
What is search?
1. A user U initiates a query Q2. A list of documents D sorted by relevance R w.r.t Q are returned
User Satisfaction sat(U ,Q)
It’s all about relevance between D and Q!User U has its own perspective on relevance RURoughly speaking, sat(U ,Q) ∝ 1
diff (RU ,R)
Problem: RU is difficult to capture, and usually ignored!
Symptoms of ignoring RURedundant documents included in the result setMost relevant documents in terms of RU excluded from the result set
Solution: diversity-aware search!
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
Background & Motivation
What is search?
1. A user U initiates a query Q2. A list of documents D sorted by relevance R w.r.t Q are returned
User Satisfaction sat(U ,Q)
It’s all about relevance between D and Q!User U has its own perspective on relevance RURoughly speaking, sat(U ,Q) ∝ 1
diff (RU ,R)
Problem: RU is difficult to capture, and usually ignored!
Symptoms of ignoring RURedundant documents included in the result setMost relevant documents in terms of RU excluded from the result set
Solution: diversity-aware search!
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
Background & Motivation
What is search?
1. A user U initiates a query Q2. A list of documents D sorted by relevance R w.r.t Q are returned
User Satisfaction sat(U ,Q)
It’s all about relevance between D and Q!User U has its own perspective on relevance RURoughly speaking, sat(U ,Q) ∝ 1
diff (RU ,R)
Problem: RU is difficult to capture, and usually ignored!
Symptoms of ignoring RURedundant documents included in the result setMost relevant documents in terms of RU excluded from the result set
Solution: diversity-aware search!
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
Background & Motivation
What is search?
1. A user U initiates a query Q2. A list of documents D sorted by relevance R w.r.t Q are returned
User Satisfaction sat(U ,Q)
It’s all about relevance between D and Q!User U has its own perspective on relevance RURoughly speaking, sat(U ,Q) ∝ 1
diff (RU ,R)
Problem: RU is difficult to capture, and usually ignored!
Symptoms of ignoring RURedundant documents included in the result setMost relevant documents in terms of RU excluded from the result set
Solution: diversity-aware search!
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
Background & Motivation
What is search?
1. A user U initiates a query Q2. A list of documents D sorted by relevance R w.r.t Q are returned
User Satisfaction sat(U ,Q)
It’s all about relevance between D and Q!User U has its own perspective on relevance RURoughly speaking, sat(U ,Q) ∝ 1
diff (RU ,R)
Problem: RU is difficult to capture, and usually ignored!
Symptoms of ignoring RURedundant documents included in the result setMost relevant documents in terms of RU excluded from the result set
Solution: diversity-aware search!
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
Agenda
Background & Motivation
Diversity-Aware Search
DivGen Approach
Evaluation
Conclusion
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 3/20
Diversity-Aware Search
Intuitively, relevance + dissimilarity
Formally, a content-based diversification perspective:
Data ModelUser Behavior ModelAnswer Quality
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 4/20
Diversity-Aware Search
Intuitively, relevance + dissimilarity
Formally, a content-based diversification perspective:
Data ModelUser Behavior ModelAnswer Quality
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 4/20
Data Model
Vector Space Model: documents as weighted sets of features
Each document d is represented as a vector
d = (d1, d2, ...),
denoting feature i has weight d i ≥ 0 in document d
Examples
textual documents: features can be keywords weighted in a tf.idfmannergraph “documents”: features can be paths in the corpus graphin recsys scenario: features can be the set of users who recommend adocument
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 5/20
Data Model
Vector Space Model: documents as weighted sets of features
Each document d is represented as a vector
d = (d1, d2, ...),
denoting feature i has weight d i ≥ 0 in document d
Examples
textual documents: features can be keywords weighted in a tf.idfmannergraph “documents”: features can be paths in the corpus graphin recsys scenario: features can be the set of users who recommend adocument
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 5/20
User Behavior Model
Assumption: the user examines the results in their order ofpresentation.
Usefulness of a document d : the probability that d is usefulRelevance: the probability that d is relevantNovelty: the probability that d ’s content is not redundant
Consider a document d preceded by d1, d2, ..., dm w.r.t a query q, itsusefulness is defined below:
use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))⇓
use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m
i=1(1− red(d |di , q))
red(d |di , q) can be decomposed further:sim(d , di ): the probability that the content of d is similar to, orcontained in, that of di ;fq: the estimated probability that, given a query q, a document withsimilar content to, or content contained in, a document previouslyemitted, is redundant.
red(d |di , q) = sim(d , di ) · fq
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
User Behavior Model
Assumption: the user examines the results in their order ofpresentation.
Usefulness of a document d : the probability that d is usefulRelevance: the probability that d is relevantNovelty: the probability that d ’s content is not redundant
Consider a document d preceded by d1, d2, ..., dm w.r.t a query q, itsusefulness is defined below:
use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))⇓
use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m
i=1(1− red(d |di , q))
red(d |di , q) can be decomposed further:sim(d , di ): the probability that the content of d is similar to, orcontained in, that of di ;fq: the estimated probability that, given a query q, a document withsimilar content to, or content contained in, a document previouslyemitted, is redundant.
red(d |di , q) = sim(d , di ) · fq
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
User Behavior Model
Assumption: the user examines the results in their order ofpresentation.
Usefulness of a document d : the probability that d is usefulRelevance: the probability that d is relevantNovelty: the probability that d ’s content is not redundant
Consider a document d preceded by d1, d2, ..., dm w.r.t a query q, itsusefulness is defined below:
use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))
⇓use(d |{d1, ..., dm}, q) = sim(d , q) ·
∏mi=1(1− red(d |di , q))
red(d |di , q) can be decomposed further:sim(d , di ): the probability that the content of d is similar to, orcontained in, that of di ;fq: the estimated probability that, given a query q, a document withsimilar content to, or content contained in, a document previouslyemitted, is redundant.
red(d |di , q) = sim(d , di ) · fq
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
User Behavior Model
Assumption: the user examines the results in their order ofpresentation.
Usefulness of a document d : the probability that d is usefulRelevance: the probability that d is relevantNovelty: the probability that d ’s content is not redundant
Consider a document d preceded by d1, d2, ..., dm w.r.t a query q, itsusefulness is defined below:
use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))⇓
use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m
i=1(1− red(d |di , q))
red(d |di , q) can be decomposed further:sim(d , di ): the probability that the content of d is similar to, orcontained in, that of di ;fq: the estimated probability that, given a query q, a document withsimilar content to, or content contained in, a document previouslyemitted, is redundant.
red(d |di , q) = sim(d , di ) · fq
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
User Behavior Model
Assumption: the user examines the results in their order ofpresentation.
Usefulness of a document d : the probability that d is usefulRelevance: the probability that d is relevantNovelty: the probability that d ’s content is not redundant
Consider a document d preceded by d1, d2, ..., dm w.r.t a query q, itsusefulness is defined below:
use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))⇓
use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m
i=1(1− red(d |di , q))
red(d |di , q) can be decomposed further:sim(d , di ): the probability that the content of d is similar to, orcontained in, that of di ;fq: the estimated probability that, given a query q, a document withsimilar content to, or content contained in, a document previouslyemitted, is redundant.
red(d |di , q) = sim(d , di ) · fq
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
User Behavior Model
Assumption: the user examines the results in their order ofpresentation.
Usefulness of a document d : the probability that d is usefulRelevance: the probability that d is relevantNovelty: the probability that d ’s content is not redundant
Consider a document d preceded by d1, d2, ..., dm w.r.t a query q, itsusefulness is defined below:
use(d |{d1, ..., dm}, q) = rel(d |q) · (1− red(d |{d1, ...dm}, q))⇓
use(d |{d1, ..., dm}, q) = sim(d , q) ·∏m
i=1(1− red(d |di , q))
red(d |di , q) can be decomposed further:sim(d , di ): the probability that the content of d is similar to, orcontained in, that of di ;fq: the estimated probability that, given a query q, a document withsimilar content to, or content contained in, a document previouslyemitted, is redundant.
red(d |di , q) = sim(d , di ) · fqCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
User Behavior Model (Cont.)
Focus Parameter fq
fq is the main tunable parameter in red(d |di , q) = sim(d , di ) · fqIt is defined on a per-query basis, and denotes the amount of desireddiversification
Smaller fq favors relevance over diversityLarger fq favors diversity over relevance
Probabilistic interpretation:
“how likely is a relevant document to be useful to the user, giventhat they have already examined a document with similar content?”
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 7/20
User Behavior Model (Cont.)
Focus Parameter fq
fq is the main tunable parameter in red(d |di , q) = sim(d , di ) · fqIt is defined on a per-query basis, and denotes the amount of desireddiversification
Smaller fq favors relevance over diversityLarger fq favors diversity over relevance
Probabilistic interpretation:
“how likely is a relevant document to be useful to the user, giventhat they have already examined a document with similar content?”
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 7/20
Answer Quality
Quantification properties
Tractable instantiation
An optimal answer for strict order dominance semantics can befound by greedily identifying the best result at position 1, 2, ..., k
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 8/20
Answer Quality
Quantification properties
Tractable instantiation
An optimal answer for strict order dominance semantics can befound by greedily identifying the best result at position 1, 2, ..., k
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 8/20
Answer Quality
Quantification properties
Tractable instantiation
An optimal answer for strict order dominance semantics can befound by greedily identifying the best result at position 1, 2, ..., k
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 8/20
The DivGen Approach
A First Stab to DAS
Steps:
1. Compute the relevance of each document to the query;2. Identify the highest score document d , and update the usefulness of
all other documents, based on their similarity to d ;3. Repeat the procedure k times.
Problems:
It requires access to the entire corpus.It is too inefficient even for a moderately large set of documents.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 10/20
A First Stab to DAS
Steps:
1. Compute the relevance of each document to the query;2. Identify the highest score document d , and update the usefulness of
all other documents, based on their similarity to d ;3. Repeat the procedure k times.
Problems:
It requires access to the entire corpus.It is too inefficient even for a moderately large set of documents.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 10/20
A Threshold Algorithm for DAS
Generate-Filter Idea:
Incrementally compute documents in descending order of relevance;Maintain upper and lower bounds on the relevance of everyencountered document;Rerank the documents with diversity taken into account.
Data Access Primitives
Sequential Access (SA): retrieve the id of the document with thenext highest weight for a specified feature iRandom Access (RA): retrieve the exact weight of feature i indocument d
Drawbacks
Fully compute the relevance, and retrieve the entire content;Wasted I/O efforts, and a lot of this I/O is not sequential in nature;Hardly any early pruning is possible.
DivGen: making Generate aware of diversity!
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
A Threshold Algorithm for DAS
Generate-Filter Idea:
Incrementally compute documents in descending order of relevance;Maintain upper and lower bounds on the relevance of everyencountered document;Rerank the documents with diversity taken into account.
Data Access Primitives
Sequential Access (SA): retrieve the id of the document with thenext highest weight for a specified feature iRandom Access (RA): retrieve the exact weight of feature i indocument d
Drawbacks
Fully compute the relevance, and retrieve the entire content;Wasted I/O efforts, and a lot of this I/O is not sequential in nature;Hardly any early pruning is possible.
DivGen: making Generate aware of diversity!
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
A Threshold Algorithm for DAS
Generate-Filter Idea:
Incrementally compute documents in descending order of relevance;Maintain upper and lower bounds on the relevance of everyencountered document;Rerank the documents with diversity taken into account.
Data Access Primitives
Sequential Access (SA): retrieve the id of the document with thenext highest weight for a specified feature iRandom Access (RA): retrieve the exact weight of feature i indocument d
Drawbacks
Fully compute the relevance, and retrieve the entire content;Wasted I/O efforts, and a lot of this I/O is not sequential in nature;Hardly any early pruning is possible.
DivGen: making Generate aware of diversity!
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
A Threshold Algorithm for DAS
Generate-Filter Idea:
Incrementally compute documents in descending order of relevance;Maintain upper and lower bounds on the relevance of everyencountered document;Rerank the documents with diversity taken into account.
Data Access Primitives
Sequential Access (SA): retrieve the id of the document with thenext highest weight for a specified feature iRandom Access (RA): retrieve the exact weight of feature i indocument d
Drawbacks
Fully compute the relevance, and retrieve the entire content;Wasted I/O efforts, and a lot of this I/O is not sequential in nature;Hardly any early pruning is possible.
DivGen: making Generate aware of diversity!
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
The DivGen Algorithm
Idea: maintain a set of candidate documents with bounds onusefulness
Novel Data Access Primitives
Bound Access (BA): retrieve the features with the highest weight ind , as well as an upper bound w on the weight of any other featuresof dBatch Sequential Access (BSA): retrieve the documents with thehighest weight of non-query feature i , as well as an upper bound won the weight of i in any other documentDocument Random Access (DocRA): retrieve all the features withnonzero weight in d , along with their exact weights
Advantages of BA, BSA, DocRA
Existing index techniques can be easily leveraged to enable theseprimitives.These primitives can enable a set of early prunings to make thealgorithm more efficient.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20
The DivGen Algorithm
Idea: maintain a set of candidate documents with bounds onusefulness
Novel Data Access Primitives
Bound Access (BA): retrieve the features with the highest weight ind , as well as an upper bound w on the weight of any other featuresof dBatch Sequential Access (BSA): retrieve the documents with thehighest weight of non-query feature i , as well as an upper bound won the weight of i in any other documentDocument Random Access (DocRA): retrieve all the features withnonzero weight in d , along with their exact weights
Advantages of BA, BSA, DocRA
Existing index techniques can be easily leveraged to enable theseprimitives.These primitives can enable a set of early prunings to make thealgorithm more efficient.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20
The DivGen Algorithm
Idea: maintain a set of candidate documents with bounds onusefulness
Novel Data Access Primitives
Bound Access (BA): retrieve the features with the highest weight ind , as well as an upper bound w on the weight of any other featuresof dBatch Sequential Access (BSA): retrieve the documents with thehighest weight of non-query feature i , as well as an upper bound won the weight of i in any other documentDocument Random Access (DocRA): retrieve all the features withnonzero weight in d , along with their exact weights
Advantages of BA, BSA, DocRA
Existing index techniques can be easily leveraged to enable theseprimitives.These primitives can enable a set of early prunings to make thealgorithm more efficient.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20
Algorithm Pseudo-code
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 13/20
Revisit Data Access Primitives
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 14/20
An Execution Example
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 15/20
An Execution Example
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 15/20
Evaluation
Experimental Setup
Java 6, Oracle BerkeleyDB Java Edition v3.3.74Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memoryext3fs filesystem with a page size of 4KB
Datasets
Real data: taken from Grapevine, a tool for distilling knowledge fromsocial mediaSynthetic data: Zipfian distribution across documents, and normaldistribution in each document. How to synthesize?
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20
Evaluation
Experimental Setup
Java 6, Oracle BerkeleyDB Java Edition v3.3.74Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memoryext3fs filesystem with a page size of 4KB
Datasets
Real data: taken from Grapevine, a tool for distilling knowledge fromsocial mediaSynthetic data: Zipfian distribution across documents, and normaldistribution in each document.
How to synthesize?
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20
Evaluation
Experimental Setup
Java 6, Oracle BerkeleyDB Java Edition v3.3.74Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memoryext3fs filesystem with a page size of 4KB
Datasets
Real data: taken from Grapevine, a tool for distilling knowledge fromsocial mediaSynthetic data: Zipfian distribution across documents, and normaldistribution in each document. How to synthesize?
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20
Evaluation (Cont. I)
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 17/20
Evaluation (Cont. II)
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 18/20
Conclusion
This paper
formally studied the diversity-aware search problem;
proposed a set of novel data access primitives to efficiently solveDAS;
performed experimental studies demonstrating the usefulness ofDivGen.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 19/20
Thank you!