haventreddityet demo

13
THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES haventreddityet.com A reddit rec om men da tion discovery engine James Douglas Pearce 1 / 13

Transcript of haventreddityet demo

THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES

haventreddityet.comA reddit recommendation discovery engine

James Douglas Pearce

1 / 13

THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES

HOW CAN I DISCOVER NEW SUBREDDITS?

I Results seem to be skewed by a few very popular subreddits...I Simple listI No contextI Does not encourage discovery!

2 / 13

THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES

Live Demo

3 / 13

THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES

ALGORITHM

I Jaccard similarity: overlap of user in subreddits divided by totalI Categories are defined as a set of example subreddits

4 / 13

THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES

GRAPH GERNERATION

I Subgraphs generated on the fly with a series of random walksI Robert Frost rule: take the road less traveledI Hipster rule: Penalize popular, i.e. highly connected nodes

5 / 13

THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES

MORE ABOUT ME

Research:I Searching for Dark Matter at

the LHCI Big Data!I Specialize in data mining and

machine learningHobbies:

I Kaggle competitionsI PokerI MotorcyclesI Boardgames

6 / 13

THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES

Questions?

7 / 13

THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES

UNIQUE CATEGORIES?

8 / 13

THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES

REDDIT BY CATEGORIES

9 / 13

THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES

JACCARD COEFFICIENT

The Jaccard coefficient can be used as a user-user type similarity measure.

JB(A) =|A ∩ B||A ∪ B|

(1)

I A is the set of redditors in subreddit A,I B is the set of redditors in subreddit B.

JCatC (A) =

1|C|

∑Bi∈C

JBi(A) (2)

I C is the set of sets of redditors in subreddits Bi in category C,I |C| is the size of set C (number of sets).

10 / 13

THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES

CRAWLING REDDIT WITH THE PRAW API

I reddit.com is HUGE, I can onlytake a small sample

I The redditor-subreddit matrix Iam sampling from is sparse

I I want to collect informationabout as many subreddits aspossible

I I want my redditor sets tooverlap

Strategy:I Collect “smart” dataI Use “dumb” (Jaccard)

algorithm

11 / 13

THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES

MODEL VALIDATION

I For each redditor hold out one subreddit they are part ofI Make a list of the top N subreddits based on Jacquard similarityI Calculate the accuracy of that recommendation as a function of N12 / 13

THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES

SUBGRAPH GENERATOR

Ptransition(ni) ∝ Jcanada(ni) · αntrans · βncon (3)

I α ∈ (0, 1) is a transition decay factorI ntrans is the number of times ni has been traversedI β ∈ (0, 1) is a connectivity decay factorI ncon is ni number of connections (degree)

13 / 13