Post on 12-Jan-2016
Improved search for Socially Annotated DataAuthors: Nikos Sarkas, Gautam Das, Nick KoudasPresented by: Amanda Cohen Mostafavi
Introduction• Social Annotation: A process where users
collaboratively assign a short sequence of keywords (tags) to a number of resources▫Each tag sequence is a concise and accurate
summary of the resource’s content▫Meant to aid navigation through a collection
• Leads to searching via tags▫Enables relevant text retrieval▫Allows accurate retrieval of non-textual objects▫Presents a need for an efficient retrieval and
ranking method based on user tags
RadING
•Ranking annotated data using Interpolated N-Grams
•Searching and ranking method based exclusively on user tags
•Uses interpolated n-grams to model tag sequences associated with every resource
•How does it rank?
Probabilistic Foundations
•Goal: To rank resources by the probability that they will be relevant to the query
•Given keyword query Q, and a collection of resources R, we apply Bayesian theorem to get:
p(R is relevant | Q) = p(Q|R is relevant)p(R is Relevant)
p(Q)
Where p(R is relevant) is the probability that R is relevant, independent of the query posed and p(Q) is the probability of the query issued
Probabilistic Foundations
•p(R is relevant) is constant throughout the resource collection, as well as p(Q)▫Meaning: ranking resources by p(R is
relevant|Q) is equivalent to ranking by p(Q|R is relevant)
•In order to estimate the probability of the query being “generated” by each resource, resources need to be modeled based on knowledge of social annotation
Dynamics and Properties of the Social Annotation Process•The goal of the tagging process is to
describe the resource’s content•User opinions crystallize quickly, can find
annotation trends after witnessing a small number of assignments
•Therefore we assume the following:▫p(Q | R is relevant) = p(Q is used to tag R)▫In English: Users will use keyword
sequences derived from the same distribution to both tag and search for a resource
Social Annotation Process: Things to consider…•Resources are rarely given assignments
with one tag•Also, tag positions are not random,
progress from left to right from more general to more specific
• Tags representing different perspectives on a resource are less likely to occur together in the same assigment
•Used n-gram models to model these co-occurance patterns
N-gram Models
•Given an assignment made up of a sequence (s) of l tags t1…tl, the probability of this sequence being assigned to a resource is:▫p(t1,…,tl ) = p(t1)p(t2|t1)…p(tl|t1,…, tl-1)
•The purpose of using n-gram models is to approximate the probability of a subsequence with only the last n-1 tags▫In the case of a bi-gram model, p(tk|t1,…,tk-1)
approximates to p(tk|tk-1)
N-gram Models
•Calculate the probability using the Maximum Likelihood equation
•c(t1, t2) = the number of occurrences of the bi-gram
•The summation is the sum of the occurrences of all bigrams involving t1 as the first tag
t
ttc
ttcttp
),(
),()|(
1
2112
Interpolation
•Interpolation is used to compensate for sparse data, distributes probability mass from high counts to low counts
•Used the Jelinek-Mercer interpolation technique. Applied to a bi-gram, yields:
1
10
)()(ˆ)|(ˆ)|(
210
2,1,0
202112212
tptpttpttp bg
Parameter Optimization
•Goal: to maximize the likelihood function L(λ1,λ2) in order to find the ideal interpolation parameters
•Definitions:▫D*: The constrained domain of λ1 and λ2
▫λ*: The global maximum of L(λ1,λ2)
▫λc : The point at which L(λ1,λ2) evaluates to its maximum value within D*, which must be found to optimize parameters
RadING Optimization Framework•Step 1: If L(λ1,λ2) is unbounded, perform
1D optimization to locate λc
•Step 2: If L(λ1,λ2) is bounded, apply 2D optimization to find λ*
•Step 3: If λ* is not in D*, locate λc
Searching Process•Step 1: Train a bi-gram model for each
resource▫Compute the bi-gram and unigram probability
and optimize the interpolation parameters•Step 2: At query-time compute the probability
of the query keyword sequence being generated by each resource’s bi-gram model
•Use Threshold Algorithm to compute top-k results
k
j
jjkR qqpqqp1
11 )|(),...,(
Searching Example
Experimental Evaluation
•Test data: web crawl of del.icio.us▫70,658,851 assignments▫Posted by 567,539 users▫Attached to 24,245,248 unique URLs▫Average length of assignment: 2.77▫Standard deviation: 2.70▫Median: 2
Optimization Efficiency
Optimization Efficiency
Optimization Efficiency
Ranking Effectiveness
•Compares RadING ranking method to adaptations of tf/idf ranking▫Tf/Idf: concatenates resources’ assignments
into a document and performs raking based tf/idf similarity to each document
▫Tf/Idf+: computes tf/idf similarity of each individual assignment and rank resources based on average similarity
•10 Judges contacted through Amazon Mechanical Turk to measure precision
Ranking Effectiveness
Ranking Effectiveness