Enhance micro-blogging recommendations of posts with an ... · 1 State of the art 2 Data Analysis...
Transcript of Enhance micro-blogging recommendations of posts with an ... · 1 State of the art 2 Data Analysis...
Enhance micro-blogging recommendations of posts withan homophily-based graph
Quentin Grossetti1,2
Supervised by Cedric du Mouza2,Camelia Constantin1 and Nicolas Travers2
1LIP6 - Universite Pierre Marie Curie - Paris, France
2CEDRIC Laboratory - CNAM - Paris, France
BDA - Novembre 2017
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 1 / 31
IntroductionContext
Growth of microblogging plateforms since 2000
700 millions of messages/day in 2017
300 millions of messages/day in 2017
70 millions of publications/day in 2017
70 millions of pictures/day in 2017
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 2 / 31
IntroductionReal life examples
Finding Users of Interest in Micro-blogging Systems (EDBT 2016)
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 3 / 31
IntroductionReal life examples
Finding Users of Interest in Micro-blogging Systems (EDBT 2016)
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 3 / 31
Problem
How to connect users to relevant messages ?
Recommendation of messages
700M new messages every day
300M of users
Real time
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 4 / 31
Table of contents
1 State of the art
2 Data AnalysisTopologyRetweetsHomophily
3 ApproachSimilarity graphPropagation Model
4 ExperimentsProtocolResultsUpdating strategies
5 Conclusion
6 Annexes
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 5 / 31
State of the art
State of the art
Content-based [Lops (2011)]
Method Pros ConsContent-based No need of interactions tweets are hard to describe
Collaborative filtering simple model and good results too large matrixMatrix Factorization efficient to fight sparsity matrix growing too fast
Hybrid systems increase user engagement hard to describe relationshipRandom walks models very cheap low memory
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 6 / 31
State of the art
State of the art
Collaborative filtering [Schafer (2007)]
Method Pros ConsContent-based No need of interactions tweets are hard to describe
Collaborative filtering simple model and good results too large matrix
Matrix Factorization efficient to fight sparsity matrix growing too fastHybrid systems increase user engagement hard to describe relationship
Random walks models very cheap low memory
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 6 / 31
State of the art
State of the art
Matrix Factorization [Koren (2009)]
Method Pros ConsContent-based No need of interactions tweets are hard to describe
Collaborative filtering simple model and good results too large matrixMatrix Factorization efficient to fight sparsity matrix growing too fast
Hybrid systems increase user engagement hard to describe relationshipRandom walks models very cheap low memory
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 6 / 31
State of the art
State of the art
Hybrid systems [Bostandjiev (2010)]
Method Pros ConsContent-based No need of interactions tweets are hard to describe
Collaborative filtering simple model and good results too large matrixMatrix Factorization efficient to fight sparsity matrix growing too fast
Hybrid systems increase user engagement hard to describe relationship
Random walks models very cheap low memory
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 6 / 31
State of the art
State of the art
Random walks models [Sharma (2016)]
Method Pros ConsContent-based No need of interactions tweets are hard to describe
Collaborative filtering simple model and good results too large matrixMatrix Factorization efficient to fight sparsity matrix growing too fast
Hybrid systems increase user engagement hard to describe relationshipRandom walks models very cheap low memory
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 6 / 31
State of the art
State of the artNot only recommendations
User recommendation(topology,content-based,demographic etc...)
Hashtag (Bayesian model,euclidien...)
Timeline Filtering (DeepLearning)
Few papers on tweetsrecommendation except Twitterin 2016
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 7 / 31
Data Analysis
Data AnalysisDataset
Updated connected component from the graph found in [Kwak (2009)].
No of nodes 2,182,867No of edges 325,451,980No of tweets 2,571,173,369
Avg. out-degree 57.8Avg. in-degree 69.4max out-degree 348,595max in-degree 185,401
Diameter 15Average shortest path 3.7
Table – Twitter dataset characteristics
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 8 / 31
Data Analysis Topology
Data AnalysisTopology
1 2 3 4 5 10 15
100
102
104
106
108
1010
Smallest path
Nu
mb
erof
pat
hs
Figure – Twitter smallest paths distribution
Small world with averagedistance of 3.7
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 9 / 31
Data Analysis Retweets
Data AnalysisRetweets
0 1 2-5 6-50 51-200201-500 500+
103
104
105
106
107
108
109
1010
Number of retweets
Nu
mb
erof
twee
ts
Figure – Distribution of the number ofretweets per tweet
1 retweet - 7%
2-5 retweets - 1%
6+ - 0,2%
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 10 / 31
Data Analysis Retweets
Data AnalysisLifespan
10 100 500 1,000102
103
104
105
106
107
Lifespan (in hours)
Nb
ofm
essa
ges
Figure – Lifespan of a message
< 1hour : 40%
< 3days : 90%
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 11 / 31
Data Analysis Homophily
Data AnalysisHomophily
Distance No of users % Mean similarity
1 3 229 02,65 0,00852 32 668 26,86 0,00143 81 645 67,13 0,00094 3 820 03,14 0,00105 43 00,03 0,00146 1 0 0,0008
Impossible 216 0,18 0,0017
Table – Evolution of the similarity score through distance in the network
sim(u, v) =
∑i∈Lu∩Lv
1log(1+pop(i))
|Lu ∪ Lv |(1)
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 12 / 31
Data Analysis Homophily
Data AnalysisHomophily
0 5 10 15 20 250
0.5
·10−2
Position in the ranking
Ave
rage
scor
e
Distances distribution (%)
Rank Average Distance 1 2 3 4
1 1,55 57,03 31,53 10,64 0,82 1,68 49,60 33,13 16,87 0,43 1,8 42,45 36,02 20,72 0,84 1,86 38,71 38,71 20,56 2,025 1,98 31,44 40,16 27,59 0,81
Table – Link beetween distance in the network and position in the Top-Nranking Top-NEnhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 13 / 31
Data Analysis Homophily
Data AnalysisConclusions
Many conclusions from this analysis :
Freshness is crucial (Messages dies very fast)⇒ real-time recommendation
Few users have high similarity⇒ use transitivity
Distance 2 successfully gather important users⇒ rely on this homophily
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 14 / 31
Approach Similarity graph
Similarity GraphBuilding process
U W
Z
V Y
X
Z1
Z2
Z3
Z4
Figure – Twitter Graph
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 15 / 31
Approach Similarity graph
Graphe de similariteExemple de construction
U W
Z
V Y
X
Z1
Z2
Z3
Z4
Figure – Twitter Graph
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 15 / 31
Approach Similarity graph
Similarity GraphBuilding process
U W
Z
V Y
X
Z1
Z2
Z3
Z4
Figure – Twitter Graph
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 15 / 31
Approach Similarity graph
Graphe de similariteExemple de construction
U W
Z
V Y
X
Z1
Z2
Z3
Z4
Figure – Twitter Graph
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 15 / 31
Approach Similarity graph
Similarity GraphBuilding process
U Y
V
Z1
sim(u, v)
sim(u, y)
sim(u, z1)
Figure – Similarity Graph
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 15 / 31
Approach Similarity graph
Similarity GraphCharacteristics
Twitter Network Similarity Graph
No of nodes 2 182 867 1 149 374No of edges 325,451,980 4 950 417
Avg. similarity score 0.008Mean out-degree 57.8 5.9
Table – Similarity Graph Characteristics
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 16 / 31
Approach Similarity graph
Propagation ModelIn a nutshell
p(u, t) =
∑v∈Fu
p(u ← v , t)
|Fu|(2)
With Fu the set of users influential to u and p(u ← v , t) a probabilityestimation that u likes t determined by the behavior of the user v .
p(u ← v , t) = p(v , t)× sim(u, v) (3)
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 17 / 31
Approach Similarity graph
Propagation ModelExample
U W
V Y
X
0.3
0.5
0.1
0.5
0.4 0.8
Figure – Propagation example
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 18 / 31
Approach Propagation Model
Propagation ModelExample
U W
V Y
X t1
0.3
0.5
0.1
0.5
0.4 0.8
Figure – Propagation example - a tweet t1 is published
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 18 / 31
Approach Propagation Model
Propagation ModelExample
U W
V Y
X t1
0.3
0.5
0.1
0.5
0.4 0.8
Figure – Propagation example - X shares/likes t1
p(x , t1) = 1
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 18 / 31
Approach Propagation Model
Propagation ModelExample
U W
V Y
X t1
0.3
0.5
0.1
0.5
0.4 0.8
Figure – Propagation example - Propagation
p(w , t1) =
∑v∈Fw
p(w←v ,t)
|Fw | = 0+1×0.52 = 0.25
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 18 / 31
Approach Propagation Model
Propagation ModelExample
U W
V Y
X t1
0.3
0.5
0.1
0.5
0.4 0.8
Figure – Propagation example - Propagation
p(u, t1) = 0.25×0.52 = 0.0625
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 18 / 31
Approach Propagation Model
Propagation ModelConvergence
Let n be users (u1, u2, ..., un) :a11pu1 + a12pu2 + ....+ a1npun = b1
a21pu1 + a22pu2 + ....+ a2npun = b2
... = ...an1pu1 + an2pu2 + ....+ annpun = bn
Could also be written as Ap = b with
A =
u1 u2 · · · un
u1 a11 a12 . . . a1n
u2 a21 a22 . . . a2n...
......
. . ....
un an1 an2 . . . ann
p =
p(u1)p(u2)
...p(un)
b =
b1
b2...bn
Because ∀u, v sim(u, v) ≤ 1, |ajj | ≥
∑j 6=i
|aij | for every i , the matrix A is
diagonally dominant.Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 19 / 31
Approach Propagation Model
Propagation ModelOptimizations
Speed up the convergence
Let ∆(u, t1) = p(u, t)k+1 − p(u, t)k
If ∆(u, t1) < β we stop the propagation
Limitation of popular messages
If p(u, t) < f (t) no need to propagate.f (t) = 1− kp
kp+pop(t)p
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 20 / 31
Approach Propagation Model
Propagation ModelOptimizations
Speed up the convergence
Let ∆(u, t1) = p(u, t)k+1 − p(u, t)k
If ∆(u, t1) < β we stop the propagation
Limitation of popular messages
If p(u, t) < f (t) no need to propagate.f (t) = 1− kp
kp+pop(t)p
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 20 / 31
Experiments Protocol
ExperimentsProtocol
130 Millions of messages shared at least twice
Split the ranked set 90% - 10%
Compute recommendation during this 10% for 1500 random users(500 small, 500 medium, 500 big)
Comparison with
CF : naive collaborative filteringBayes : probabilistic modelGraphJet : Twitter used solution
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 21 / 31
Experiments Results
ExperimentsHits
20 40 60 80 100 120 140 160 180 2000
0.5
1
1.5
2
2.5·104
Number of daily recommendations per user
Num
ber
ofhi
ts(×
104)
BayesCF
GraphJetSimGraph
Figure – Hits pour 1500 utilisateurs
Linear growth of CF
Fast growth forSimGraph
GraphJet stuck around5000 hits
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 22 / 31
Experiments Results
ExperimentsHits according to user profiles
20 40 60 80 100 120 140 160 180 2000
200
400
600
800
Number of daily recommendations per user
Num
ber
ofhi
ts
BayesCF
GraphJetSimGraph
Figure – 500 small
20 40 60 80 100 120 140 160 180 2000
1,000
2,000
3,000
4,000
5,000
6,000
Number of daily recommendations per user
BayesCF
GraphJetSimGraph
Figure – 500 medium
20 40 60 80 100 120 140 160 180 2000
0.5
1
1.5
·104
Number of daily recommendations per user
BayesCF
GraphJetSimGraph
Figure – 500 big users
small < 50 ; medium < 1000 ; big > 1000Tendencies are very stables no matter the profile of users
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 23 / 31
Experiments Results
ExperimentsHits accuracy
20 40 60 80 100 120 140 160 180 200
101
102
Number of daily recommendations per user
Avg
.nu
mb
erof
shar
es
BayesCF
GraphJetSimGraph
Figure – Hits popularity
Bayes targets closemessages
GraphJet targets popularmessages
CF and SimGraph aremixing both popular andclose messages
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 24 / 31
Experiments Results
ExperimentsF1 scores
20 40 60 80 100 120 140 160 180 2000
0.2
0.4
0.6
0.8
1·10−2
Number of daily recommendations per user
F1
Sco
re(×
10−
2)
BayesCF
GraphJetSimGraph
Figure – F1 Scores
Small values
Peak around 20recommendations
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 25 / 31
Experiments Results
ExperimentsRunning time
init. (per user) init total time time (per message) total time (70 cores //) total time1,149,374 users 13,238,941 Tweets (Trial period) init + recos
Bayes 10ms 0.04h 975ms 51.22h 51.26hCF 8,583ms 39.40h 0.5ms 0.02h 41.01h
SimGraph 311ms 1.41h 38ms 2.00h 3.41h
init. (per user) init total time time (per user) total time (70 cores //) total time1,149,374 users 1,149,374 users * 66 days (Trial period) init + recos
GraphJet 0ms 0h 14ms 4.2h 4.2h
Table – Initialization and recommendation time (in ms)
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 26 / 31
Experiments Updating strategies
ExperimentsUpdating strategies
How to update SimGraph ?
Split the last 10% in 2
Evaluate hits prediction impact for the remaining 5% :
do nothingrecompute everythingupdate only weightscrossfold
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 27 / 31
Experiments Updating strategies
ExperimentsUpdating strategies
20 40 60 80 100 120 140 160 180 2000
1,000
2,000
3,000
4,000
5,000
6,000
Number of daily recommendations per user
Num
ber
ofhi
ts
recompute everythingdo nothingcrossfold
update weights
Figure – Hits / updating strategies
doing nothing is thesame as updatingweights
crossfold (very cheap)works very well
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 28 / 31
Experiments Updating strategies
Experiments
Convergence property of the SimGraph
Iteration Number of edges
1 4 950 4172 7 519 0313 10 836 1294 11 496 4455 11 678 747
Table – Number of edges evolution through iterations
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 29 / 31
Conclusion
ConclusionContribution
Construction and analysis of a large Twitter dataset
Method relying on homophily to find nearest neighbors at low cost
Construction and optimization of a convergent propagation model
Comparison of the recommendations made by our model with state ofthe art solutions
Possibility for the model to be updated at low cost
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 30 / 31
Conclusion
ConclusionFuture works
Densify points of comparison between users
Burst recommendation bubbles
Work on the crossfold convergence of the model
Add a popularity prediction optimization
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 31 / 31
Conclusion
Thanks for you attention !
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 31 / 31
Annexes
ANNEXES
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 31 / 31
Annexes
AnnexesLifespan and popularity
100 101 102 103 104100
101
102
103
104
Avg. lifepan (hours)
Avg
.n
um
ber
ofre
twee
ts
Figure – Correlation between lifespan andpopularity
Strong correlation up to103 hours
After a month, thecorrelation fades
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 31 / 31
Annexes
AnnexesTopology
0 10 20100
101
102
103
104
105
106
107
108
Shortest distance
Nu
mb
erof
pat
hs
Figure – Smallest path distribution for thesimilarity graph
Diameter of 21 for anaverage path of 7.5
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 31 / 31
Annexes
AnnexesSimilarities
0 5 10 15 20 250
0.5
·10−2
Position dans le classement
Sco
rem
oyen
Figure – Score similarity evolution
Really weak scores
Breaks after the fifthmost similar user
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 31 / 31
Annexes
AnnexesIntersections
20 40 60 80 100 120 140 160 180 2000
0.2
0.4
0.6
0.8
1
Number of daily recommendations per user
Rat
ioof
hits
inco
mm
onw
ith
Sim
Gra
ph
BayesCF
GraphJetSimGraph
Figure – Parts of hits included in SimGraphEnhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 31 / 31
Annexes
AnnexesNumber of recommendations
20 40 60 80 100 120 140 160 180 2000
20
40
60
80
100
120
140
Number of daily recommendations per user
Num
ber
ofac
tual
reco
mm
enda
tion
s
BayesCF
GraphJetSimGraph
Figure – Recall capacity
CF is less limited
Other methods arebunched together
Threshold effect forSimGraph and Bayes
Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 31 / 31