Recommendations and Discovery at StumbleUpon

38
Recommendations and Discovery at StumbleUpon Sumanth Kolar, Director, Engineering @_5K

description

RecSys 2012 Industry Track - Sumanth Kolar, StumbleUpon It's human nature to be curious, to learn new things, to want to find out more. Discovery is an innate human need, and with the rise of the Web, the urge to learn more has increased by leaps and bounds. According to David Hornik, investor at August Capital, "The massive scale of the Web not only creates huge challenges for search, it also cripples discovery. Gone are the good old days in which fortuity would lead to the unearthing of interesting new websites." Indeed, we live in the age of "infovores" and there is definitely a need for a service that provides serendipity. Providing serendipitous discovery that can inform, entertain and enlighten our users is of utmost importance to StumbleUpon. This talk will focus on how StumbleUpon uses several machine learning techniques such as collaborative filtering techniques, active learning, decision trees, Bayesian models and more to solve complex problems involving classification, user behavior analysis, modelling, anti-spam and recommendations. An average StumbleUpon user spends over 7 hours per month using the product, equating to hundreds of varied recommendations and ample feedback. The talk will also provide insights into some of StumbleUpon's rich data and how we can use scale to accomplish what would otherwise not be possible. We will look at innovative ways that StumbleUpon figures out the right metrics to evaluate recommender systems - a very complex problem. We will also discuss our research on StumbleUpon's mobile activity, which is growing 800% year over year and is the fastest growing part of our business, and how mobile recommendations are unique and important. Bio: As Engineering Director at StumbleUpon, Sumanth Kolar leads the applied research team, overseeing recommendations, anti-spam, content analysis, user modeling, data sciences and infrastructure. ?Sumanth tackles very interesting and challenging research problems as StumbleUpon delivers more than 1 billion personalized recommendations a month to its more than 25 million users. Prior to joining the company in 2009, Sumanth engineered bidding and computer vision systems at Yahoo! and Adobe Research. Sumanth holds a masters degree in computer science from the University of California at Santa Cruz.

Transcript of Recommendations and Discovery at StumbleUpon

Page 1: Recommendations and Discovery at StumbleUpon

Recommendations and Discoveryat StumbleUpon

Sumanth Kolar,

Director, Engineering

@_5K

Page 2: Recommendations and Discovery at StumbleUpon

StumbleUpon’s Mission

Help users find content they did not expect to find

Be the best way to discover new and interesting things from across

the Web.

Page 3: Recommendations and Discovery at StumbleUpon

How StumbleUpon works

1. Register 2. Tell us your interests 3. Start Stumbling and rating web pages

We use your interests and behavior to recommend new content for you!

Page 4: Recommendations and Discovery at StumbleUpon

There is a ongoing shift from search to discovery

Discovery is very different from search

Discovery at StumbleUpon Search

Serendipitous Intent driven

One at a time List of articles

Never repeats Always repeats

Constantly adapting Fixed results

Tailored for you Impersonal

Page 5: Recommendations and Discovery at StumbleUpon

StumbleUpon

Page 6: Recommendations and Discovery at StumbleUpon

StumbleUpon Overview

Discovery Crawled

Ingestion Pipeline

Sampling Pass?

Rec Engine

Users Automated

URL Index

Yes

1

2

3

Page 7: Recommendations and Discovery at StumbleUpon

What are the key challenges to good recommendations?

Page 8: Recommendations and Discovery at StumbleUpon

Pillars of good recommendations

Understand who the user is and what he is interested in.

Separate good content from the bad.

Learn from your recommendations.

Explore various techniques for matching users to content.

Page 9: Recommendations and Discovery at StumbleUpon

Pillars of good recommendations

Understand who the user is and what he is interested in.

Separate good content from the bad.

Learn from your recommendations.

Explore various techniques for matching users to content.

Page 10: Recommendations and Discovery at StumbleUpon

User self reports topics of interest

Part of the sign up flow…

Page 11: Recommendations and Discovery at StumbleUpon

User’s Interest Graph

Food/Cooking User

Cars

VintageCars

Italian Recipes

Page 12: Recommendations and Discovery at StumbleUpon

Continually Enhance a User’s Interest Graph

Analyze user’s StumbleUpon history to expand on interest preferences:

• Add/remove topics• Follow/block particular domains

Page 13: Recommendations and Discovery at StumbleUpon

Continually Enhance a User’s Interest Graph

Leverage social network data:

• Find friends & people to follow

• Find content trending in your social circles

• Find additional interests

Page 14: Recommendations and Discovery at StumbleUpon

Continually Enhance a User’s Interest Graph

Mine internal StumbleUpon rating and sharing data to suggest other stumblers, topics.

Page 15: Recommendations and Discovery at StumbleUpon

Enhanced Interest Graph

Food/Cooking

User

Cars

VintageCars

Italian Recipes

nasa.gov

1x.com

News

Friends

Trending

Page 16: Recommendations and Discovery at StumbleUpon

Pillars of good recommendations

Understand who the user is and what he is interested in.

Separate good content from the bad.

Learn from your recommendations.

Explore various techniques for matching users to content.

Page 17: Recommendations and Discovery at StumbleUpon

On average hundreds of URLs are ingested into the

StumbleUpon pipeline every minute.

• Sampling key goals:

1. Determine which URLs to sample and which to skip completely

2. Examine sampling results to identify good URLs

• URL features used when sampling:

• Known domain performance(ratings, timespent)• Content related features (#images, #ads, url length etc)• User features of the discoverer (spammer vs trusted user)

Sampling

Page 18: Recommendations and Discovery at StumbleUpon

Vote

Yes

Yes

NoYes

Recommend

Classifier based on User Feedback (Timespent, Ratings)

Yes

Random Forest

Webpage

Recommendations at StumbleUpon: Sampling

Rating Timespent

Good 35sec

Good 22sec

Bad 15sec

Good 45sec

Good 14sec

Good 28sec

Page 19: Recommendations and Discovery at StumbleUpon

• Users who thumb-up good content and thumb-down bad content

• For example– Joe DiMaggio – Baseball– Julia Child- Food/Cooking– Da Vinci- Art and Architecture

• Ratings from Experts are more trustworthy and earn more weight.

Leveraging In-Network Experts

Page 20: Recommendations and Discovery at StumbleUpon

Expe

rtRecommendations at StumbleUpon: Experts

Non

Exp

ert

Page 21: Recommendations and Discovery at StumbleUpon

Pillars of good recommendations

Understand who the user is and what he is interested in.

Separate good content from the bad.

Learn from your recommendations.

Explore various techniques for matching users to content.

Page 22: Recommendations and Discovery at StumbleUpon

Challenge: User expectations are different

“I LOVE cars!”-Anonymous Stumbler

“Me too!”-Another Stumbler

Page 23: Recommendations and Discovery at StumbleUpon

• Find users who like content similar to the content you do

• Signals can be ratings, time spent, interests, etc.

• Use the content they’ve liked

Like-Minded Users

Page 24: Recommendations and Discovery at StumbleUpon

NeuroscienceAstronomySpace ExplorationComedy Movies

Astronomy Space ExplorationPhysics Classic Movies

Vintage CarsAction moviesAstronomyRobotics

Science

Space

Movies

Cars

PLSI based like-minded

Page 25: Recommendations and Discovery at StumbleUpon

Total Pairwise Similarity Calculations

= 50K users * 5 million users * 1K features

= 250 Trillion Probabilistic Latent Semantic Index (PLSI)

based similarity over 500 trillion calculations PLSI based similarity framework computes in

less than an hour

Like-Minded Users: Challenges Scaling

Page 26: Recommendations and Discovery at StumbleUpon

Food/Cooking

User

Cars

VintageCars

Italian Recipes

nasa.gov

1x.com

News

Experts Friends

Trending

Grow User’s Interest Graph: Implicit + Explicit

LikemindedUsers

Page 27: Recommendations and Discovery at StumbleUpon

Different methods perform differently for different users at different times

User 1 User 2 User 3 User 4 User 50%

25%

50%

75%

100%

TrendingFollowBias domainsExpertsNewsLike-minded

Page 28: Recommendations and Discovery at StumbleUpon

Recommendation context

Page 29: Recommendations and Discovery at StumbleUpon

Pillars of good recommendations

Understand who the user is and what he is interested in.

Separate good content from the bad.

Learn from your recommendations.

Explore various techniques for matching users to content.

Page 30: Recommendations and Discovery at StumbleUpon

Two Main Signals from Recommendation

Rating Time Spent

Both present numerous challenges . . .

Page 31: Recommendations and Discovery at StumbleUpon

# Ra

tings

Time

Users rate more during their initial experience

Why is this happening?

Ratings: volume decay

Page 32: Recommendations and Discovery at StumbleUpon

Images

T5 sec

?

Text

T4 sec

• Ratings are sparse• < 10% of recommendations have explicit ratings.

• Using time spent decide whether the stumble was skipped• Timespent on videos is longer than images. • Solution: Estimate p(Like | Timespent)

• Model based on user, content patterns

Video

T3 sec

?

Images

T2 sec

Time Spent

Video

T1 sec

Page 33: Recommendations and Discovery at StumbleUpon

Installed plugin

Stumble Bar

Mobile / Tablets

Challenges: Time spent on different devices

5th percentile time spent per stumble

Med

ian

time

spen

t per

stum

ble

Page 34: Recommendations and Discovery at StumbleUpon

Pillars of good recommendations

Understand who the user is and what he is interested in.

Separate good content from the bad.

Learn from your recommendations.

Explore various techniques for matching users to content.

Page 35: Recommendations and Discovery at StumbleUpon

How do we know we are doing a good job?

Page 36: Recommendations and Discovery at StumbleUpon

Extensive A/B Testing

AB Tests on metrics such as session length, retention, rating behavior etc

Page 37: Recommendations and Discovery at StumbleUpon

Measurable Improvements In Rec Quality

12/1/0

8 0:00

2/1/0

9 0:00

4/1/0

9 0:00

6/1/0

9 0:00

8/1/0

9 0:00

10/1/0

9 0:00

12/1/0

9 0:00

2/1/1

0 0:00

4/1/1

0 0:00

6/1/1

0 0:00

8/1/1

0 0:00

10/1/1

0 0:00

12/1/1

0 0:00

2/1/1

1 0:00

4/1/1

1 0:00

6/1/1

1 0:00

8/1/1

1 0:00

10/1/1

1 0:00

12/1/1

1 0:00

2/1/1

2 0:00

4/1/1

2 0:00

6/1/1

2 0:00

8/1/1

2 0:000

2

4

6

8

10

12

14

16

R² = 0.737311794306772

Normalized Likes vs Dislikes

Recent Months

+111% improvement!

Page 38: Recommendations and Discovery at StumbleUpon

• Dupe detection• Anti-spam• News• Topic classification• Metrics, quality analysis• Trending• Search• User biases, mood• Many more…

Many other interesting problems…

We are HIRING !!!