Enhance micro-blogging recommendations of posts with an ... · 1 State of the art 2 Data Analysis...

Enhance micro-blogging recommendations of posts withan homophily-based graph

Quentin Grossetti1,2

Supervised by Cedric du Mouza2,Camelia Constantin1 and Nicolas Travers2

1LIP6 - Universite Pierre Marie Curie - Paris, France

2CEDRIC Laboratory - CNAM - Paris, France

BDA - Novembre 2017

Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 1 / 31

IntroductionContext

Growth of microblogging plateforms since 2000

700 millions of messages/day in 2017

300 millions of messages/day in 2017

70 millions of publications/day in 2017

70 millions of pictures/day in 2017


IntroductionReal life examples

Finding Users of Interest in Micro-blogging Systems (EDBT 2016)


Problem

How to connect users to relevant messages ?

Recommendation of messages

700M new messages every day

300M of users

Real time


Table of contents

1 State of the art

2 Data AnalysisTopologyRetweetsHomophily

3 ApproachSimilarity graphPropagation Model

4 ExperimentsProtocolResultsUpdating strategies

5 Conclusion

6 Annexes


State of the art

State of the art

Content-based [Lops (2011)]

Method Pros ConsContent-based No need of interactions tweets are hard to describe

Collaborative filtering simple model and good results too large matrixMatrix Factorization efficient to fight sparsity matrix growing too fast

Hybrid systems increase user engagement hard to describe relationshipRandom walks models very cheap low memory


State of the art

State of the art

Collaborative filtering [Schafer (2007)]


Collaborative filtering simple model and good results too large matrix

Matrix Factorization efficient to fight sparsity matrix growing too fastHybrid systems increase user engagement hard to describe relationship

Random walks models very cheap low memory


State of the art

State of the art

Matrix Factorization [Koren (2009)]





State of the art

State of the art

Hybrid systems [Bostandjiev (2010)]



Hybrid systems increase user engagement hard to describe relationship

Random walks models very cheap low memory


State of the art

State of the art

Random walks models [Sharma (2016)]





State of the art

State of the artNot only recommendations

User recommendation(topology,content-based,demographic etc...)

Hashtag (Bayesian model,euclidien...)

Timeline Filtering (DeepLearning)

Few papers on tweetsrecommendation except Twitterin 2016


Data Analysis

Data AnalysisDataset

Updated connected component from the graph found in [Kwak (2009)].

No of nodes 2,182,867No of edges 325,451,980No of tweets 2,571,173,369

Avg. out-degree 57.8Avg. in-degree 69.4max out-degree 348,595max in-degree 185,401

Diameter 15Average shortest path 3.7

Table – Twitter dataset characteristics


Data Analysis Topology

Data AnalysisTopology

1 2 3 4 5 10 15

100

102

104

106

108

1010

Smallest path

Nu

mb

erof

pat

hs

Figure – Twitter smallest paths distribution

Small world with averagedistance of 3.7


Data Analysis Retweets

Data AnalysisRetweets

0 1 2-5 6-50 51-200201-500 500+

103

104

105

106

107

108

109

1010

Number of retweets

Nu

mb

erof

twee

ts

Figure – Distribution of the number ofretweets per tweet

1 retweet - 7%

2-5 retweets - 1%

6+ - 0,2%


Data Analysis Retweets

Data AnalysisLifespan

10 100 500 1,000102

103

104

105

106

107

Lifespan (in hours)

Nb

ofm

essa

ges

Figure – Lifespan of a message

< 1hour : 40%

< 3days : 90%


Data Analysis Homophily

Data AnalysisHomophily

Distance No of users % Mean similarity

1 3 229 02,65 0,00852 32 668 26,86 0,00143 81 645 67,13 0,00094 3 820 03,14 0,00105 43 00,03 0,00146 1 0 0,0008

Impossible 216 0,18 0,0017

Table – Evolution of the similarity score through distance in the network

sim(u, v) =

∑i∈Lu∩Lv

1log(1+pop(i))

|Lu ∪ Lv |(1)



Data AnalysisHomophily

0 5 10 15 20 250

0.5

·10−2

Position in the ranking

Ave

rage

scor

e

Distances distribution (%)

Rank Average Distance 1 2 3 4

1 1,55 57,03 31,53 10,64 0,82 1,68 49,60 33,13 16,87 0,43 1,8 42,45 36,02 20,72 0,84 1,86 38,71 38,71 20,56 2,025 1,98 31,44 40,16 27,59 0,81

Table – Link beetween distance in the network and position in the Top-Nranking Top-NEnhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 13 / 31


Data AnalysisConclusions

Many conclusions from this analysis :

Freshness is crucial (Messages dies very fast)⇒ real-time recommendation

Few users have high similarity⇒ use transitivity

Distance 2 successfully gather important users⇒ rely on this homophily


Approach Similarity graph

Similarity GraphBuilding process

U W

Z

V Y

X

Z1

Z2

Z3

Z4

Figure – Twitter Graph



Graphe de similariteExemple de construction

U W

Z

V Y

X

Z1

Z2

Z3

Z4





U W

Z

V Y

X

Z1

Z2

Z3

Z4




Graphe de similariteExemple de construction

U W

Z

V Y

X

Z1

Z2

Z3

Z4





U Y

V

Z1

sim(u, v)

sim(u, y)

sim(u, z1)

Figure – Similarity Graph



Similarity GraphCharacteristics

Twitter Network Similarity Graph

No of nodes 2 182 867 1 149 374No of edges 325,451,980 4 950 417

Avg. similarity score 0.008Mean out-degree 57.8 5.9

Table – Similarity Graph Characteristics



Propagation ModelIn a nutshell

p(u, t) =

∑v∈Fu

p(u ← v , t)

|Fu|(2)

With Fu the set of users influential to u and p(u ← v , t) a probabilityestimation that u likes t determined by the behavior of the user v .

p(u ← v , t) = p(v , t)× sim(u, v) (3)



Propagation ModelExample

U W

V Y

X

0.3

0.5

0.1

0.5

0.4 0.8

Figure – Propagation example


Approach Propagation Model


U W

V Y

X t1

0.3

0.5

0.1

0.5

0.4 0.8

Figure – Propagation example - a tweet t1 is published




U W

V Y

X t1

0.3

0.5

0.1

0.5

0.4 0.8

Figure – Propagation example - X shares/likes t1

p(x , t1) = 1




U W

V Y

X t1

0.3

0.5

0.1

0.5

0.4 0.8

Figure – Propagation example - Propagation

p(w , t1) =

∑v∈Fw

p(w←v ,t)

|Fw | = 0+1×0.52 = 0.25




U W

V Y

X t1

0.3

0.5

0.1

0.5

0.4 0.8

Figure – Propagation example - Propagation

p(u, t1) = 0.25×0.52 = 0.0625



Propagation ModelConvergence

Let n be users (u1, u2, ..., un) :a11pu1 + a12pu2 + ....+ a1npun = b1

a21pu1 + a22pu2 + ....+ a2npun = b2

... = ...an1pu1 + an2pu2 + ....+ annpun = bn

Could also be written as Ap = b with

A =

u1 u2 · · · un

u1 a11 a12 . . . a1n

u2 a21 a22 . . . a2n...

......

. . ....

un an1 an2 . . . ann

p =

p(u1)p(u2)

...p(un)

b =

b1

b2...bn

Because ∀u, v sim(u, v) ≤ 1, |ajj | ≥

∑j 6=i

|aij | for every i , the matrix A is

diagonally dominant.Enhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 19 / 31


Propagation ModelOptimizations

Speed up the convergence

Let ∆(u, t1) = p(u, t)k+1 − p(u, t)k

If ∆(u, t1) < β we stop the propagation

Limitation of popular messages

If p(u, t) < f (t) no need to propagate.f (t) = 1− kp

kp+pop(t)p


Experiments Protocol

ExperimentsProtocol

130 Millions of messages shared at least twice

Split the ranked set 90% - 10%

Compute recommendation during this 10% for 1500 random users(500 small, 500 medium, 500 big)

Comparison with

CF : naive collaborative filteringBayes : probabilistic modelGraphJet : Twitter used solution


Experiments Results

ExperimentsHits

20 40 60 80 100 120 140 160 180 2000

0.5

1

1.5

2

2.5·104

Number of daily recommendations per user

Num

ber

ofhi

ts(×

104)

BayesCF

GraphJetSimGraph

Figure – Hits pour 1500 utilisateurs

Linear growth of CF

Fast growth forSimGraph

GraphJet stuck around5000 hits


Experiments Results

ExperimentsHits according to user profiles

20 40 60 80 100 120 140 160 180 2000

200

400

600

800


Num

ber

ofhi

ts

BayesCF

GraphJetSimGraph

Figure – 500 small

20 40 60 80 100 120 140 160 180 2000

1,000

2,000

3,000

4,000

5,000

6,000


BayesCF

GraphJetSimGraph

Figure – 500 medium

20 40 60 80 100 120 140 160 180 2000

0.5

1

1.5

·104


BayesCF

GraphJetSimGraph

Figure – 500 big users

small < 50 ; medium < 1000 ; big > 1000Tendencies are very stables no matter the profile of users


Experiments Results

ExperimentsHits accuracy

20 40 60 80 100 120 140 160 180 200

101

102


Avg

.nu

mb

erof

shar

es

BayesCF

GraphJetSimGraph

Figure – Hits popularity

Bayes targets closemessages

GraphJet targets popularmessages

CF and SimGraph aremixing both popular andclose messages


Experiments Results

ExperimentsF1 scores

20 40 60 80 100 120 140 160 180 2000

0.2

0.4

0.6

0.8

1·10−2


F1

Sco

re(×

10−

2)

BayesCF

GraphJetSimGraph

Figure – F1 Scores

Small values

Peak around 20recommendations


Experiments Results

ExperimentsRunning time

init. (per user) init total time time (per message) total time (70 cores //) total time1,149,374 users 13,238,941 Tweets (Trial period) init + recos

Bayes 10ms 0.04h 975ms 51.22h 51.26hCF 8,583ms 39.40h 0.5ms 0.02h 41.01h

SimGraph 311ms 1.41h 38ms 2.00h 3.41h

init. (per user) init total time time (per user) total time (70 cores //) total time1,149,374 users 1,149,374 users * 66 days (Trial period) init + recos

GraphJet 0ms 0h 14ms 4.2h 4.2h

Table – Initialization and recommendation time (in ms)


Experiments Updating strategies

ExperimentsUpdating strategies

How to update SimGraph ?

Split the last 10% in 2

Evaluate hits prediction impact for the remaining 5% :

do nothingrecompute everythingupdate only weightscrossfold



ExperimentsUpdating strategies

20 40 60 80 100 120 140 160 180 2000

1,000

2,000

3,000

4,000

5,000

6,000


Num

ber

ofhi

ts

recompute everythingdo nothingcrossfold

update weights

Figure – Hits / updating strategies

doing nothing is thesame as updatingweights

crossfold (very cheap)works very well



Experiments

Convergence property of the SimGraph

Iteration Number of edges

1 4 950 4172 7 519 0313 10 836 1294 11 496 4455 11 678 747

Table – Number of edges evolution through iterations


Conclusion

ConclusionContribution

Construction and analysis of a large Twitter dataset

Method relying on homophily to find nearest neighbors at low cost

Construction and optimization of a convergent propagation model

Comparison of the recommendations made by our model with state ofthe art solutions

Possibility for the model to be updated at low cost


Conclusion

ConclusionFuture works

Densify points of comparison between users

Burst recommendation bubbles

Work on the crossfold convergence of the model

Add a popularity prediction optimization


Conclusion

Thanks for you attention !


Annexes

ANNEXES


Annexes

AnnexesLifespan and popularity

100 101 102 103 104100

101

102

103

104

Avg. lifepan (hours)

Avg

.n

um

ber

ofre

twee

ts

Figure – Correlation between lifespan andpopularity

Strong correlation up to103 hours

After a month, thecorrelation fades


Annexes

AnnexesTopology

0 10 20100

101

102

103

104

105

106

107

108

Shortest distance

Nu

mb

erof

pat

hs

Figure – Smallest path distribution for thesimilarity graph

Diameter of 21 for anaverage path of 7.5


Annexes

AnnexesSimilarities

0 5 10 15 20 250

0.5

·10−2

Position dans le classement

Sco

rem

oyen

Figure – Score similarity evolution

Really weak scores

Breaks after the fifthmost similar user


Annexes

AnnexesIntersections

20 40 60 80 100 120 140 160 180 2000

0.2

0.4

0.6

0.8

1


Rat

ioof

hits

inco

mm

onw

ith

Sim

Gra

ph

BayesCF

GraphJetSimGraph

Figure – Parts of hits included in SimGraphEnhance micro-blogging recommendations of posts with an homophily-based graph BDA - Novembre 2017 31 / 31

Annexes

AnnexesNumber of recommendations

20 40 60 80 100 120 140 160 180 2000

20

40

60

80

100

120

140


Num

ber

ofac

tual

reco

mm

enda

tion

s

BayesCF

GraphJetSimGraph

Figure – Recall capacity

CF is less limited

Other methods arebunched together

Threshold effect forSimGraph and Bayes


Enhance micro-blogging recommendations of posts with an ... · 1 State of the art 2 Data Analysis...

Documents

Transcript of Enhance micro-blogging recommendations of posts with an ... · 1 State of the art 2 Data Analysis...