Yahoo! Social Search Tag-based Social Interest Discovery Xin Li, Lei Guo, Eric Zhao Yahoo!...
-
Upload
nathaniel-ryan -
Category
Documents
-
view
216 -
download
0
Transcript of Yahoo! Social Search Tag-based Social Interest Discovery Xin Li, Lei Guo, Eric Zhao Yahoo!...
Yahoo! Social Search
Tag-based Social Interest Discovery
Xin Li, Lei Guo, Eric Zhao
Yahoo! International Social Search
Yahoo! International Social Search
Internet Social Networks Are Emerging!
• Internet social networks are self-organized by online users– Del.icio.us, facebook, flickr, MySpace, YouTube
• Users are driven by their interests– Fetch and bookmark contents– Create new contents– Share contents
• Interest discovery is crucial to a social network– Discover interests of users in different contents– Locate users with similar interests– Link people with similar interests to form communities
Yahoo! International Social Search
Important Features of Social Networks
• Organize users and contents– Cluster users into communities– Categorize contents into interesting topics
• Provide search functions– Given a topic, locate all matching contents
and all users that are interested in the topic– Given a user, locate all his fetched/created
contents and the topics of his interests– Given a user, locate all other users that have
similar interests
Yahoo! International Social Search
The Problem: Social Interest Discovery
• Questions to answer
– How to discover a user’s interests based on his fetched/created contents?
– How to use individual users’ interests to find interesting topics shared by users?
– How to use the topics to create interest-based user communities?
Yahoo! International Social Search
Existing Solutions and Limitations
• User-centric– Using social network graph to discover users with common
interests– Problem: online/offline user connections are hard to identify
• Object-centric– Detect common interests based on the common objects
fetched by users– Problem: discovered interests are object-base, non-descriptive
and implicit• Predefined categorization
– Not flexible, cannot catch most recent popular or hot user interests
– Cannot reflect various user interest groups which may keep changing over time
Yahoo! International Social Search
Our approach
• Leverage user-generated tags
• Compute frequent co-occurrences of tag patterns
• Use the tag patterns as topics of interests
• Cluster users and content around the topics to build communities
Yahoo! International Social Search
Overview
• Motivation and Problem
• Analysis of tags in a social network
• ISID system design
• Evaluation
• Conclusion
Yahoo! International Social Search
Tags in Social Networks
• User-generated labels for annotating the contents– Descriptive, summary, reflecting human judgment
– Meta data between users and contents
• Widely used in social networks– Del.icio.us: http://del.icio.us/help/tags
– Youtube: http://www.google.com/support/youtube/bin/answer.py?hl=en&answer=55769
– Facebook: http://www.facebook.com/help.php?hq=tag
Yahoo! International Social Search
del.icio.us Social Network
• A pioneer social bookmark system
– http://del.icio.us/
• Our Data Set
– Dump for a limited period of time
– 4.3 M public, tagged bookmarks, 0.2 M users, 1.4 M bookmarked URLs
Yahoo! International Social Search
URL Popularity Follows Power Law
The distribution of URL bookmarking frequency. Most URLs are unpopular.
1
10
100
1000
10000
100000
1e+06
1e+07
1 10 100 1000 10000
Num
ber
of U
RLs
(lo
g)
Number of occurrences (log)
Yahoo! International Social Search
User Activity Follows Heavy-tail
The distribution of user bookmarking frequency.
Most users are less active.
1
10
100
1000
10000
100000
1 10 100 1000 10000
Nu
mbe
r of
use
rs (
log)
Number of occurrences (log)
Yahoo! International Social Search
Tags vs. Keywords
URL http://ka1fsb.home.att.net/resolve.html
Top tf keywords
domain,name,file,resolver,server,conf,network,nameserver,ip,org,ampr
Top tfidf keywords
ampr,domain,jnos,nameserver,conf,
ka1fsb,resolver,ip,file,name,server
All tags linux,howto,network,sysadmin,dns
Yahoo! International Social Search
Tag Vocabulary
Tag coverage for tf keywords Tag coverage for tf-idf keywords
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
CD
F o
ver
all
UR
Ls
Fraction of keywords (TF) missed by tags
Top 10Top 20Top 40
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
CD
F o
ver
all
UR
Ls
Fraction of keywords (TFIDF) missed by tags
Top 10Top 20Top 40
User tags missed ≤ 20% of tf keywords for ≥ 98% docs and ≤ 10% of tf-idf keywords for ≥ 90% docs.
Tags covered most important keywords. But the total number of unique tags are ~10x smaller than that of keywords.
Yahoo! International Social Search
Tag Convergence
The total number of different tags users can use for a given document is limited no matter how popular the URL is.
0
50
100
150
200
250
300
0 200 400 600 800 1000 1200 1400
# of
tags
# of saves of URLs
Tag 0Tag -1Tag -2
Yahoo! International Social Search
Tags Capture Concepts of Contents
i i
Utk k
tw
twTUe k
)(
)(),( |
• Nearly 50% of all URLs have tag match ratio 1
• 70% of all URLs have a tag match ratio > 0.5
• Only 10% of the URLs have no matched tags
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Tag
ma
tch
ratio
URL ids normalized and ranked
Yahoo! International Social Search
From Tags to User Interests
• Bookmarks reflect user interests
• Tags summarize/describe bookmarked contents
– Meta data between users and contents
– Connect users and bookmarked contents
• Frequently used tag patterns reflect user interests
– The key is the co-occurrences of tags
Yahoo! International Social Search
Overview
• Motivation and Problem
• Analysis of tags in a social network
• ISID system design
• Evaluation
• Conclusion
Yahoo! International Social Search
System Design
• Find topics of interests – For a given set of tagged bookmarks, find all topics
of interests, i.e., frequent co-occurrences of tags
• Clustering– For each topic, find all the URLs and the users such
that those users have labeled each of the URLs with all the tags in the topic.
• Indexing– Import the topics and their user and URL clusters
into an indexing system for application queries.
Yahoo! International Social Search
ISID Architecture
Data Source Topic Discovery
Clustering Indexing
Posts
Topics, posts
Topics, Clusters
Yahoo! International Social Search
Topic Discovery
• Use the association rule algorithms to discover co-occurring tag patterns– Was invented for identifying frequently bought items in
supermarkets• E.g., bread and milk
– Use a support number to define the frequency threshold– Efficient in finding frequent patterns out of a large set
transactions for given support number (threshold)– The rule building part is not used
• One more step: remove pattern A if A is a sub-pattern of some other pattern B, and both A & B have the same support number– To remove duplicate clusters
Yahoo! International Social Search
Clustering
}.{..
}.{..
of topicallfor
post allfor
.
.
do topicallfor
urlPurlTurlT
userPuserTuserT
PT
P
urlT
userT
T
Yahoo! International Social Search
Indexing
• Find all URLs that contain a topic, i.e. tagged with same sets of tags
• Find all users interested in a topic
• Find all topics containing a tag
• Find all topics for a user
• Find all topics for a URL
• Combination of the above
Yahoo! International Social Search
Overview
• Motivation and Problem
• Analysis of tags in a social network
• ISID system design
• Evaluation
• Conclusion
Yahoo! International Social Search
Content Similarity of Topic Clusters
• Similarity of two documents
– Inner product of tf-idf document vectors• Keyword-based vector
• Tag-based vector (comparison)
• Intra-topic similarity
– Average cosine similarity of every document pairs
• Inter-topic similarity
– Similarity of two topics
– Average similarity of one topic to all other topics
Yahoo! International Social Search
Inter- and Intra- Topic Similarity
• Intra-topic similarity is significantly higher than inter-topic similarity
• Tag co-occurrence can well cluster similar content
• Tag-based similarity is quite close to keyword-based similarity
Keyword based (tf-idf)
0 100 200 300 400 5000
0.2
0.4
0.6
0.8
1
Topic rank
Ave
rag
e c
osi
ne
sim
ilarit
y
intra-topicinter-topic
0 100 200 300 400 5000
0.2
0.4
0.6
0.8
1
Topic rankA
vera
ge
co
sin
e s
imila
rity
intra-topicinter-topic
Tag based (tf-idf)
Yahoo! International Social Search
Inter-topic Similarity
0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
Number of overlapped tags
Ave
rag
e c
osi
ne
sim
ilari
ty
0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
Number of overlapped tags
Ave
rag
e c
osi
ne
sim
ilarit
y
Tag-based (tf-idf)Keyword-based (tf-idf)
Similarity of two topics with different number of overlapped tags
Inter-topic similarity increases with number of co-occurring tags. Tag co-occurrences capture similar contents.
Yahoo! International Social Search
User Interest Coverage
• 90% users have ≥ 90% top 5 tags covered
• 87% users have ≥ 90% top 10 tags covered
• 90% users have ≥ 80% tags covered
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
CD
F o
f fr
actio
n o
f use
rs
Fraction of top tags covered by topics
Of top 5 tagsOf top 10 tags
Of all tags
The topics discovered by ISID capture the interests of users.
Yahoo! International Social Search
Human Reviews
1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
# of interest topic
Ave
rage
sco
re
From the human being’s judgment, ISID indeed clusters related URLs into clusters for each topic defined by user tags.
Scores:
1, Highly unrelated
2, Unrelated
3, Not sure
4, Related
5, Highly related
Yahoo! International Social Search
Cluster Properties
Cluster size follows power-law User interests follows power-law. There exists really hot topics!
1
10
100
1000
10000
100 1000 10000 100000 1e+06
Num
ber
of
clus
ters
(lo
g)
Number of posts (log)
Yahoo! International Social Search
Cluster Properties
Most topics have less than 6 tags. Beyond 6, the number of clusters quickly drops.
1
10
100
1000
10000
100000
2 3 4 5 6 7 8 9
Num
ber
of
clus
ters
(lo
g)
Topic size (Number of tags)
Yahoo! International Social Search
Overview
• Motivation and Problem
• Data and Their Properties
• ISID system
• Evaluation
• Conclusion
Yahoo! International Social Search
Conclusion
• Tags reflect human judgments on contents
• Co-occurring tags are effective to represent user interests– Reflect human understanding for different
but similar web contents
– Consensus of judgments among users
• ISID system– Topic discovery, Clustering, Indexing
– Evaluation results are promising