Community-Centric Information Exploration on Social Content Sites Sihem Amer-Yahia and Cong Yu...
-
date post
21-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Community-Centric Information Exploration on Social Content Sites Sihem Amer-Yahia and Cong Yu...
Community-CentricInformation Explorationon Social Content Sites
Sihem Amer-Yahia and Cong Yu
Yahoo! Research New YorkM3SN, Shanghai, China
March 29th, 2009
2 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Acknowledgement
• Michael Benedikt, Oxford
• Alban Galland, INRIA
• Jian Huang, Penn State
• Laks Lakshmanan, UBC
• Julia Stoyanovich, Columbia
Acknowledgement (thanks to Facebook)
3 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Social Networking Sites Are Popular!
4 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
An Emerging Trend: Social Content Integration
5 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Social Content Sites (SCS)
• Web destinations that let users:– Consume content-oriented information: videos, photos,
news articles, etc.
– Engage in social activities with their friends and people of similar interests
• Two major driving factors:– Incorporating social activities improves the
attractiveness of traditional content sites• The “similar traveler” feature improves the user duration on Yahoo! Travel significantly
– Incorporating content is critical to the value of social networking sites• A significant amount of user time is spent on browsing other people’s photos, notes, etc.
• Privacy: important but not the focus here
6 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Main Challenges in SCS
• Social Content Management– Data must be maintained and analyzed at web-scale: the
underlying social content graph can be enormous. (Edward Chang’s keynote later)
– Information has to be integrated from various places.
• Information Discovery– Current: pure content based search or simple
popularity/rating based browsing
– Future: effectively and efficiently leverage the underlying social graph in more sophisticated ways
• Information Presentation– Current: a single ranked list based on relevance
– Future: metrics other than relevance / groups and clusters / explanations
InformationExploration
7 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Talk Outline
• Emerging Trend of Social Content Sites (SCS)
• Information Exploration on SCS– Community Centric Model
– Information Discovery: Improving Search Performance through Clustering
– Information Presentation: Diversification via Explanation
• SocialScope and Jelly
• Further Challenges
• Conclusion
8 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Information Exploration on SCS
• Major Paradigms– User-based: browsing the content of your friends or
other users you simply follow (Facebook, twitter)
– Search: content based keywords matching (most traditional content sites)
– Recommendations: hotlists, tag-based hotlists (Amazon, many collaborative tagging sites)
– Query: database-style querying with complex conditions (not many so far)
But which ones are more frequent?
9 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Search vs. Recommendation
• Majority of those searches are in fact recommendations in disguise!– Users are usually NOT looking for specific items
– Majority of the sessions are seeking recommendations with geographical and topical constraints
• Ranking the paradigms (for neurotic people who insist on orders):“Recommendation” ~ User-based > Search > Query
10 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Recommendation 2.0 (or 1.5)
• Users demand a lot more from us– Be able to specify the topic: “what are the good destinations
for a romantic trip in Asia?”
– Be able to specify the community: “what’s hot among my tech-savvy friends?”
– Be able to present the results in ways that can maximize the understanding of those results
• But we do know a lot more about the users and the contents (through the users)– Richer user profiles
– Friendships and relationships
– Content activities (rating movies, tagging travel destinations, etc.)
– Communication activities (IM, email, twitter follows, Facebook wall messages, etc.)
11 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Data Model: Social Content Graph
John
Joe NYC
AAAmy
friendvisit
fun,club
Yankees
Football
id: 301destinationstate: NY
id: 302destinationstate: MIcollege: umich
id: 102people
id: 101people
id: 103people
A social content graph is a logical graph structure where the labeled nodes represent users and objects, and the labeled edges represent relations between users and items, as well as activities users perform on items or other users. Both nodes and edges can have structural attributes.
12 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
A Community-Centric Model
• Communities of users as the central concept– Community: a group of “similar” or “related” users
– Recommendations are performed per community and results from each community are assembled together in the end
– A user can belong to multiple communities:
• A vector-based representation: user:<c1:w1, c2:w2, ...>
• Challenge I: Community Generation– Topic based
– Relationship based
– Activity based
• Challenge II: Community Selection– Given a user session, determine the appropriate set of
communities to work with
14 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
A Case Study in Yahoo! Travel on Topic-Based Communities
15 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Data in Yahoo! Travel
• Users– Yahoo! users
• Destinations– cities, attractions, etc.
• Tagging activities:– Destination tags: editorial + custom
– Self tags: mostly custom
• Goal:– Given a user session (user + tags + region),
recommending travel destinations to the user
• Approach:– Discover topics for users, tags and destinations based
on tagging actions
– Construct communities based on topics
16 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
• Joint work with Jian Huang
• Intuition– Each destination is modeled as a document consists of all tags
being used on it
– Tags are treated as words in the document
• Apply Latent Dirichlet Allocation [BNJ03]– A probabilistic generative model for document topic discovery
– Produces the following conditional probabilities• Prob(w|t): given a latent topic, the probability of the tag word
belongs to that topic
• Prob(t|d): given a document, the probability of the document being about a certain latent topic
• Prob(t|u) = (Prob(t|di)), where u tagged di
k
ik 1
1
Topic Discovery through LDA
17 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Examples of Topic Discovery
Topic 1: Seaside/Water related
Tokens: beach, scuba, summer, diving, fishing, snorkeling, island, lake …
Top cities: San Diego, Honolulu,Chicago, Miami, Seattle …
Topic 3: Nightlife, City-life
Tokens: nightlife, gambling, wine, drinking, exciting, casino …
Top cities: Las Vegas, New York City, Los Angeles, San Francisco …
Topic 0: Romantic/Luxury
Tokens: romantic, 4-star, shopping, spa, golf, luxury, honeymoon …
Top cities: Cancun, San Juan, Grand Canyon, Bangkok …
Topic 2: Arts/Historical/Cultural
Tokens: architecture, sightseeing, art, history, culture, cathedral, castle …
Top cities: Paris, London, Boston, Istanbul Amsterdam, Rome, Hong Kong …
San Diego
Las Vegas
Paris
Cancun
18 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Community Generation
• Identify destinations belong to the same topic– For each topic t, identify destination d belongs to t:
D(t) = { d | Prob(t|d) > threshold }
• Generate topical communities– D(u) = { d | user u visited destination d }
– Interests-based topical community
Cinterest(t) = { u | > threshold }
– Expertise-based topical community
Cexpert(t) = { u | > threshold }
)(
)()(
uD
uDtD
)(
)()(
tD
uDtD
D(u)
D(t)
19 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Community Selection
• Each user session consists of three pieces of information– (user, tag, region)
– Region is treated as a Boolean constraint
– We combine and convert user and tag into a single topical vector
Prob(t|(u,w)) =
w1·Prob(t|u) + w2·(Prob(w|t)·Prob(t)/Prob(w))
• A community C(t) is selected if Prob(t|(u,w)) is greater than a threshold
• The choice between Interests or Experts Communities is heuristically decided– User is an expert: leverage interests-based community
– User is not an expert: leverage experts-based community
20 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Example Results
21 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Talk Outline
• Emerging Trend of Social Content Sites (SCS)
• Information Exploration on SCS– Community Centric Model
– Information Discovery: Improving Search Performance through Clustering
– Information Presentation: Diversification via Explanation
• SocialScope and Jelly
• Further Challenges
• Conclusion
22 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Querying Items within Social Network
• Problem Definition:– Given a user and a query, retrieve items relevant to the
query based only on the user’s network
– The score of the item depends on keyword and user
– E.g., score(u,q,i) = cnt(v),
where tag(v,i,q) and friend(u,v)
– A common scenario on social tagging sites like del.icio.us
• Naïve Approach– One inverted index for each (user,keyword)
– Apply top-k threshold algorithms at query time
– Optimal efficiency
– But, high storage overhead
23 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Space Overhead of Per-User Indexing
item score
i7
i1i2i3i4i5i6
i816
736562403918
16
user Jane
i7
i5i9i2i6i5i8
i3
user Ann
10
533630151410
5
scoreitem
tag = music
item score
i7
i1i8i4i2i3i6
i915
302927252320
13
user Jane
i4
i5i2i8i7i1i6
i3
user Ann
60
998078757263
50
scoreitem
tag = news
• Conservative Example:– 100K users, 1M items, 1K tags
– 20 tags/item from 5% of the taggers
– 10 bytes per inverted list entry
– 1 Terabyte index!
24 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Clustering to the Rescue!• Joint work with Michael Benedikt, Laks Lakshamanan,
Julia Stoyanovich• Clustered Approach
– Group users based on shared behavior into communities• Shared items• Shared items with same tags
– One inverted index for each (group,keyword)– Replace item score with score upper bound (among all users in
the group)– Sacrifice a bit of query performance for indexing space
efficiency
• Community-based (seeker clustering):– Groups are keyword-independent– E.g., shared items, shared tags
• Behavior-based (tagger clustering):– Groups are keyword-specific– E.g., Jane may be similar to John on “travel”, but very
different from him on “sports”.
25 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
User Clustering
JaneChris Sarah Ann
Community-based Behavior-based
one user, one cluster
one inverted index per cluster
upper-bound score per item
one user, many clusters
one inverted index per cluster
upper-bound score per item
26 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Community-based Clustering: Space
27 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Community-based Clustering: Performance
28 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Talk Outline
• Emerging Trend of Social Content Sites (SCS)
• Information Exploration on SCS– Community Centric Model
– Information Discovery: Improving Search Performance through Clustering
– Information Presentation: Diversification via Explanation
• SocialScope and Jelly
• Further Challenges
• Conclusion
29 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Information Presentation• A broad definition:
Given the set of relevant items, identify the appropriate subset of items to be shown to the user(s) in the right organization and with the right amount of information.
• Appropriate Subset:– Timeliness
– Geographical closeness
– Diversity
– Etc.
• Right Organization:– Ranking
– Grouping
– Faceted Navigation
• Right Amount of Information– Explanation
30 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
A Case Study in Recommendation
• Joint work with Laks Lakshmanan
• While relevance is important to recommendation, others are critical too:– Novelty: avoid returning results that users are likely
to know already.
– Serendipity: aim to return less relevant results that might give users a pleasant surprise.
– Diversity: avoid returning results that are too similar to each other.
• Recommendation Diversification:
From the pool of candidate items, identify a list of items that are dissimilar to each other while maintaining a high cumulative relevance, i.e., strike a good balance between relevance and diversity.
31 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Existing Solutions for Diversification
• Attribute-Based Diversification– Diversity semantics: pair-wise distance functions based on
item attributes (e.g., movie attributes).
– Combining with relevance:• Threshold either relevance or distance, maximize the other
• Optimize an overall score as a weighted combination of relevance and distance
– Algorithm• Perform traditional recommendation
• Obtain the attributes of each candidate item and compute the pair-wise distance
• Ad-hoc methods follow how diversity and relevance are combined
• A Major Problem:– Lack of attributes suitable for estimating distance between
pairs of items: e.g., URLs in del.icio.us, photos on Flickr, videos on Vimeo.
32 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
• Intuition– Explanation is the set of objects because of which a
particular item is recommended to the user.
– Two items share similar explanations are likely to be similar to each other.
• Explanation for Item-Based Strategies
• Explanation for Collaborative Filtering Strategies (social!)
Explanation-Based Diversification
33 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Explanation-Based Diversity
• Pair-wise diversity distance between two recommended items– Standard similarity measures like Jaccard similarity and
cosine similarity
– E.g. (Distance based on Jaccard similarity)
• Diversity for the set of recommended results (S)
34 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Benefits and Practicality of Explanation-Based Diversification
• Applicable to items without attributes or whose attributes are difficult to analyze– Common on social content sites
• Explanations are by-products of many recommendation processes– They can be maintained with little overhead
See our poster on Monday for details
35 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Talk Outline
• Emerging Trend of Social Content Sites (SCS)
• Information Exploration on SCS– Community Centric Model
– Information Discovery: Improving Search Performance through Clustering
– Information Presentation: Diversification via Explanation
• SocialScope and Jelly
• Further Challenges
• Conclusion
36 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
SocialScope Platform
Facebook Y! Sports
Content Integrator
Social Content Analyzer Social Query Evaluator
Acti
vit
y
Man
ag
er
Social Content Grouper
Social Content
Graph
Data Manager
Query / Result
OpenSocial API
Activities
Y! IM
OpenSocial API
Co
nte
nt
Ma
na
ge
me
nt
Information Discovery
Important Information Flow(1) Raw/derived social nodes and links(2) Social content sub-graph(3) Relevant social content sub-graph(4) Relevant social nodes and links
Information Presentation Social Content Ranker
Social Content Admin
User
User
Inte
rface
37 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
A Graph Based Logical Algebra Framework
• Designed for information discovery on social content sites
• Aim to provide a declarative way of specifying analysis and query tasks– Uniformity and flexibility
– Opportunities for performance optimization
• Basic operators:– Node Selection (σN), Link Selection (σL)
– Composition, Semi-Join
– Node Aggregation, Link Aggregation
– Details in [CIDR09]
38 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
A Simple Search Task
John’s friends
People visited Denver
John’s friendswho visited Denver
Their activities
39 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Jelly:A Language Over Social Content Sites
• Designed with a focus on community-centric information exploration applications– Most useful applications
• A restricted implementation of the SocialScope algebra– Based on nested relation model, instead of full graph
model
• Built-in primitives for topic and community generation– Topic Generation, Community Extraction
– Recommendation Generation
– Group Generation, Explanation Generation
40 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
A Simple (Incomplete) Example• Topic Generation and Community Extraction
• Recommendation
Generation
• Information
Presentation
generate topics for item into topicsfrom tagging Rusing LDA (seed, th=0.8)seed R.item group R.tag weight-with count()
generate communities into expertsfrom topics T, tagging Rwhere T.*.item = R.itemusing jaccard-similarity (seed, th=0.7)seed (R.user, T.topic) list R.item
generate recommendations into candidatesgiven user u, query qfrom experts Twhere Selected (T.topic, u, q)using count-users (seed)seed T.*.item list T.user
generate explanations into resultsfrom candidates C, tagging Twhere C.item = T.itemusing identity (seed)seed C.item list T.user weight-with count()
41 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Talk Outline
• Emerging Trend of Social Content Sites (SCS)
• Information Exploration on SCS– Community Centric Model
– Information Discovery: Improving Search Performance through Clustering
– Information Presentation: Diversification via Explanation
• SocialScope and Jelly
• Further Challenges
• Conclusion
42 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Scalability Challenges
• Social content graphs are large– 10’s of millions of users
– millions (movies, destinations) or billions (web links) of items
– Challenge 1: being able to analyze the graph efficiently• We need to design Map-Reduce frameworks that are suitable for those analyses
• Activities are being generated at a very high rate– millions of status updates per day (Facebook, twitter)
– Similar amount of content activities (browsing, tagging, etc.)
– Challenge 2: being able to incrementally incorporate new activities into the existing model
43 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Semantic Challenges and Questions
• Connections are multi-dimensional– Two main categories on the surface: explicit
(friendship) vs. implicit (shared activities)
– However: not all friendship links are the same and not all activities are the same
– Challenge: for a given user on a given topic, identify the appropriate set of explicit and implicit links• expert versus interest based communities is one step toward that goal
• Temporal, geographical, and other external influences– How does a community evolve at time goes on and as
members moves around?
• Social graph interactions– How does one social graph (say Facebook) impact another
(say MySpace)?
44 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu
Yahoo! Research New York
Conclusion
• The emerging trend of social content sites present many challenges– Social information integration
– Information discovery
– Information presentation
• The notion of communities and topics are crucial for effective and efficient information exploration on those sites– Community discovery and analysis
– Community-based information exploration
• Lots more!
• At Yahoo!, we are focusing on building systems that can address those challenges