haventreddityet demo
-
Upload
james-pearce -
Category
Data & Analytics
-
view
20 -
download
0
Transcript of haventreddityet demo
THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
haventreddityet.comA reddit recommendation discovery engine
James Douglas Pearce
1 / 13
THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
HOW CAN I DISCOVER NEW SUBREDDITS?
I Results seem to be skewed by a few very popular subreddits...I Simple listI No contextI Does not encourage discovery!
2 / 13
THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
Live Demo
3 / 13
THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
ALGORITHM
I Jaccard similarity: overlap of user in subreddits divided by totalI Categories are defined as a set of example subreddits
4 / 13
THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
GRAPH GERNERATION
I Subgraphs generated on the fly with a series of random walksI Robert Frost rule: take the road less traveledI Hipster rule: Penalize popular, i.e. highly connected nodes
5 / 13
THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
MORE ABOUT ME
Research:I Searching for Dark Matter at
the LHCI Big Data!I Specialize in data mining and
machine learningHobbies:
I Kaggle competitionsI PokerI MotorcyclesI Boardgames
6 / 13
THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
JACCARD COEFFICIENT
The Jaccard coefficient can be used as a user-user type similarity measure.
JB(A) =|A ∩ B||A ∪ B|
(1)
I A is the set of redditors in subreddit A,I B is the set of redditors in subreddit B.
JCatC (A) =
1|C|
∑Bi∈C
JBi(A) (2)
I C is the set of sets of redditors in subreddits Bi in category C,I |C| is the size of set C (number of sets).
10 / 13
THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
CRAWLING REDDIT WITH THE PRAW API
I reddit.com is HUGE, I can onlytake a small sample
I The redditor-subreddit matrix Iam sampling from is sparse
I I want to collect informationabout as many subreddits aspossible
I I want my redditor sets tooverlap
Strategy:I Collect “smart” dataI Use “dumb” (Jaccard)
algorithm
11 / 13
THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
MODEL VALIDATION
I For each redditor hold out one subreddit they are part ofI Make a list of the top N subreddits based on Jacquard similarityI Calculate the accuracy of that recommendation as a function of N12 / 13
THE PROBLEM LIVE DEMO ALGO DATA THANK YOU! BONUS SLIDES
SUBGRAPH GENERATOR
Ptransition(ni) ∝ Jcanada(ni) · αntrans · βncon (3)
I α ∈ (0, 1) is a transition decay factorI ntrans is the number of times ni has been traversedI β ∈ (0, 1) is a connectivity decay factorI ncon is ni number of connections (degree)
13 / 13