CUBELSI: AN EFFECTIVE AND EFFICIENT METHOD FOR SEARCHING RESOURCES IN SOCIAL TAGGING SYSTEMS
Bin Bi, Sau Dan Lee, Ben Kao, Reynold Cheng
The University of Hong Kong
{bbi, sdlee, kao, ckcheng}@cs.hku.hk
TAG INCONSISTENCY
car? automobile?
car, automobile
car, Benz
car
car, automobileautomobile
Audi
car
4
A MULTITUDE OF ASPECTS
moon,worm moon,Perigee moon,lunar
cherry blossoms,Sakura,cherry
blossom
Nikon,astrophotograph
y,D40 5
SOLUTION
LSI(Latent
Semantic Indexing)
CubeLSI
SVD(Singular
Value Decomposition
)
Tucker Decompositio
n
Taggers
Analyzing semantic relations among tags by taking into account the role of taggers
6
PROPOSED RANKING FRAMEWORK
CubeLSI Algorithm:Input: tag assignmentsOutput: pairwise tag semantic distances
7
CONCEPT DISTILLATION
Tags with pairwise distances
mp3
music
photo
photos
video
movie
photophotos
musicmp3
videomovie
Concepts/Clusters8
RANKING SEARCH RESULTS
x
y
z
Query
Search results are sorted in descending order of their Cosine similarity scores.
Resource 1
Resource 2
13
PROPOSED RANKING FRAMEWORK
CubeLSI Algorithm:Input: tag assignmentsOutput: pairwise tag semantic distances
14
PAIRWISE TAG DISTANCE
Two sources of noise:
1. may not result from user considering tag to be irrelevant to 2.Tagging is a casual and ad-hoc activity
17
TUCKER DECOMPOSITION
Tag
Resource
User
1 2 3Tag
Resource
User
core tensor
original tensor
purified tensor
factor matrices
Purified Tag Distance:
18
SPACE & TIME COSTS
Last.fm dataset (3897 users, 3326 tags, 2849 resources)
36.9 billion entries
11.1 million entries
Computing the Frobenius-norm for EACH tag pair requires 11.1 million subtractions, squaring and additions.
There are a total of 5.5 million tag pairs for 3326 tags !
The amount of computations needed would be prohibitively huge!!!
19
• The new formula depends only on core tensor and factor matrix• There is no need to compute any entries of purified tensor• The relatively low dimensions of and implies much fewer
computations needed
SHORT-CUT TO EVALUATING
impractical
is a matrix that can be readily computed from the core tensor
20
OTHER RANKING METHODS
Freq: Resources are ranked in descending order of # of users who annotate the resource with query tags.
BOW (Bag-of-Words) : Use IR; each resource is a document and each tag is a word.
FolkRank [Hotho et al. 2006]: A modified version of PageRank. It follows the assumption that votes cast by important users with important tags would make the annotated resources important.
23
OTHER RANKING METHODS
LSI: This method projects the third-order tensor onto a 2D tag-resource matrix, and then applies traditional LSI on the tag-resource matrix using SVD.
CubeSim: This method is similar to CubeLSI except that it computes the distance between two tags and directly from the original tensor by
24
RANKING QUALITY
Evaluation Metric Normalized Discounted Cumulative Gain (NDCG) NDCG rewards more heavily to relevant
resources that are top-ranked than those that appear lower down in the list.
where denotes that the metric is evaluated only on the resources that are ranked top in the list, is the relevance level of the resource ranked in the list, and is a normalization factor that is chosen so that the optimal ranking’s NDCG score is 1.
16 users, each
proposing 8 queries
25
EFFICIENCY
Offline: pre-processing times (hours)
Online: query processing times (seconds)
Storage size:
29
RELATED WORK
Matrix Factorization Our work differs from MF in two ways:
We aim at capturing semantic relations among tags. We deal with a three-dimensional tensor.
Hotho et al. 2006 Our work differs from FolkRank in that our approach
performs offline semantic analysis, which allows online query processing to be efficiently done.
Wu et al. 2006 Our approach is technically different from that work.
Bi et al. 2009 Our approach scales to large social tagging databases,
which the previous work is unable to handle.30
CONCLUSIONS
We introduce a novel tag-based framework for searching resources in social tagging systems.
We study the role of taggers in search quality for social tagging systems.
We propose CubeLSI, which is a 3D extension of LSI, for semantic analysis over the third-order tensor of resources, taggers, and tags.
We present a comprehensive empirical evaluation of CubeLSI against a number of ranking methods on real datasets.
31
Top Related