Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages,...

177
Mining Content Information Networks [email protected] 2013 Master DMKM UPMC

Transcript of Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages,...

Page 1: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks

[email protected]

Master DMKMUPMC

Page 2: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Content

• Link analysiso PageRank, Hits …

• Markov Chains• Random Walks

• Learning on graphso Classification, clusteringo Transductive and inductive learning

• Theme and sentiment Analysiso Latent modelso Markov Chain Monte Carlo

• Graph miningo Community detectiono Diffusion in graphs

• Recommendationo Collaborative recommendationo Singular value decomposition, Non negative matrix factorizationo Ranking, etc

2Mining Content Information Networks

Page 3: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Content Information Networks

• Webo Hyperlinks

• Social networkso Friends, comments, tags, metadata (date, geo‐localization, etc.)…

• Bibliographical networkso Authors, co‐authors, conferences, editor site, metadata,…

• Blogso Comments, messages, backlinks, linkbackso Micro blogging: followers

• E‐mailso To, from, subject, date, etc.

• Any collection of content elements with relationso Images, video, texts, …o Implicit relations based on similarities

• Collaborative recommendation networks

3Mining Content Information Networks

Page 4: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Examples

4

Enron e‐mail

11 K Web hosts ‐Webspam

Wikipedia themes classification

Flickr Friendship network

Mining Content Information Networks

Page 5: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Heterogeneous network

5Mining Content Information Networks

Page 6: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Modelingo Graphs

• Nodes are content elements• Links represent relations

• Characteristicso Content elements

• Maybe of different types (heterogeneous)o Relations

• Simple• homogeneous• Heterogeneous• Multiple• Directed / undirected

o Static or dynamic networks

6Mining Content Information Networks

Page 7: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Needso Structural characteristics of the networko Dynamics

• Network evolution• Information propagation

o Nodes importanceo Classification, rankingo Content analysis

• Thematic• Sentiment• ….

7Mining Content Information Networks

Page 8: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Link analysis

PageRankHITSSalsa

Page 9: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Motivations

• Computing score functions on graph datao Importance of an item

• web pageo Number of incoming links measures the popularity of the page

• Social networkso Links measure social interaction (e.g. friends)

• Scientific literatureo Impact factor (journals)

• Average number of citation per published item

o Classification or ranking score• Annotation of items (images)

o Recommendation

9Mining Content Information Networks

Page 10: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

PageRank

• Generalo Popularized by googleo Assign an authority score for each web pageo Using only the structure of the web graph (query independent)o Now one of the many components used for computing page 

scores in Google S.E.• Intuition

o Assign higher scores to pages with many in‐links from authoritative pages with few out‐links

• Modelo Random surfer model

• Stationary distribution of a Markov Chaino Principal eigenvector of a linear system

10Mining Content Information Networks

Page 11: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Notations

o G = (V, E) grapho A adjacencymatrix

• Binary matrixo aij =1 if there is a link between i and jo aij =0 otherwise

o P transitionmatrix

• , i, j 1. .

• di degree of node vi ( ∑ )

• pij is the probability to move from node i to node j in the graph• P is row stochastic

o ∑ 1

11Mining Content Information Networks

Page 12: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

0 1 0 10 0 1 11 0 0 00 0 1 0

, P

0 1/2 0 1/20 0 1/2 1/21 0 0 00 0 1 0

• Basic PageRank

o Initialize the PageRank score vector to a stochastic vector e.g. p 0o Update the PRank vector until convergence

• p(k+1) = PTp(k)

• 0

1/41/41/41/4

,  1

2/81/83/82/8

,  2

6/162/165/163/16

, …

• Conditions for convergence, unicity of solution ?

12

1

2

3 4

Mining Content Information Networks

Page 13: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Non Negative Matrices

• A square matrix Anxn is non negative if aij 0o Notation A  0o Example: graph incidence matrix

• Anxn is positive if aij > 0o Notation A > 0

• Anxn is irreducible if o ∀ , , ∃ ∈ / 0o If A is a graph incidence matrix, this means that G is strongly 

connected• There is a path between any pair of vertices

• Anxn is primitive if ∃ ∈ / 0o A primitive matrix is irreducibleo Converse is false

13Mining Content Information Networks

Page 14: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Examples (Baldi et al. 2003)

14Mining Content Information Networks

Page 15: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Perron‐Frobenius theorem

• Anxn a non negative irreducible matrix o A has a real and positive eigenvalue  /  | ’| for any other 

eigenvalue ’o corresponds to a strictly positive eigenvectoro No other eigenvector is positiveo is a simple root of the characteristic equation  0

• Remarkso is called the dominant eigenvalue of A and the corresponding 

eigenvector the dominant eigenvector• The dominant eigenvalue is denoted 1 in the following

o There might be other eigenvalues j / | j| = | 1| 

• e.g.  0 11 0 is non negative and irreducible, with two eigenvalues 1, ‐1 

on the unit circle

15Mining Content Information Networks

Page 16: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Perron‐Frobenius theorem for a primitive matrixo In property 1, the inequality is strict

• i.e. A has a real and positive eigenvalue 1 / 1 > | ’| for any other eigenvalue ’

• For a primitive stochastic matrixo 1 = 1 since A1 = 1

• Why is it interesting ?o Simple procedure for computing the eigenvalues of a matrix using the powers of a matrix

16Mining Content Information Networks

Page 17: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Intuition on the power method

• Leto ∈o u1,…,un theeigenvectorsofA,o c1,…,cn thecoordinatesofx intheeigenvectorbasis

•o ∑

o ∑ o 1 dominates,then → fortlarge

• Trueifxnonorthogonaltou1• Sinceu1 positive,anypositivevectorwilldo

o e.g.x 1n

17Mining Content Information Networks

Page 18: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Power method

• Let A a be a primitive matrix• Start with an arbitrary vector x0

o yt = Axto xt+1  = yt/|| yt ||

• Convergenceo Converges towards u_1 the eigenvector associated to  , the largest 

eigenvalue of A.o Whatever the initial vector x0

• Rate of convergenceo Geometric with ratio  XX o > are the first two dominant eigenvalues of A

18Mining Content Information Networks

Page 19: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Pagerank

• Recallo G a directed graph (Web)o A its adjacency matrixo P the transition matrix

19Mining Content Information Networks

Page 20: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Intuitiono Rank of a document is high if the rank of its parents is higho Embodied e.g. in

• ∝ ∑ ∈

• r(v) : rank value at vo Each parent contributes

• Proportionally to r(w)• Inversely to its out degree

o Amounts at solving• for a given matrix M• Eigenvector problem

20Mining Content Information Networks

Page 21: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Examples (Baldi et al. 2003)

21Mining Content Information Networks

Page 22: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• In order to converge to a stationary solutiono Remove sink nodes

• Many such situations in the webo Images, files, etc

o Make M primitive

22Mining Content Information Networks

Page 23: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Adjustments of the P matrix

• The transition matrix P most often lacks these propertieso Stochasticity

• Dangling nodes (nodes with no outlinks) make P non stochastic.

• Rows corresponding to dangling nodes are replaced by a stochastic vector v

o a common choice is  , with 1 the vector of 1s

• The new transition matrix iso .o ai = 1 if i is a dangling node, 0 otherwiseo P’ is row stochastic

23Mining Content Information Networks

Page 24: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Example (Langville&Meyer 2006)

24Mining Content Information Networks

Page 25: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Primitiveo The M matrix shall be primitive in order for the PageRank vector 

to existo One possible solution is

• " ′ 1 . with 0<  1 and v a stochastic vector• Different v correspond to different random walks

o v uniform  1: teleportation operator in the random walk modelo v non uniform: personalization vector

• P"isamixtureoftwostochasticmatriceso Itis stochastico P” is trivially primitive since every node is connected to itselfo controls the proportion of time P’ and  . are used

• also controls convergence rate of the Random Walk• P” is called the Google matrix

25Mining Content Information Networks

Page 26: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Example (Langville&Meyer 2006) 

26Mining Content Information Networks

Page 27: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Two formulations of the PageRank problem1. Eigenvector solution

• Solve o With y stochastic vectoro PageRank original algorithm uses the power method

• y " 1 , with any starting vector  0o Rewrites as

• ′ 1 1• 1 1 1

o Note• Computations can be performed on the sparse matrix P instead of the dense matrix P ’’

27Mining Content Information Networks

p2

Page 28: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Diapositive 27

p2 pg; 30/01/2011

Page 29: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Checko Irreducibility guarantees the convergence P” being stochastic, its dominant eigenvalue 1 = 1

o P” being primitive, the eigenvector associated to 1(PageRank vector) is unique

28Mining Content Information Networks

Page 30: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Rate of convergenceo For the web graph, convergence is governed by o The rate of convergence is the rate at which t 0

• Initial paper by Brin & Page uses  = 0.85 and 50 to 100 iterations

29Mining Content Information Networks

Page 31: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Two formulations of the PageRank problem2. Linear system formulation

• Solve  T

o ThiscanberewrittenasafunctionofPdirectly• Solve  T

•o Itisnonsingularo Columnsumsare1‐ fornondanglingnodesor1fordanglingnodes

o e.g.Jacobi,Gauss‐Seidel,successiveoverrelaxationmethods

30Mining Content Information Networks

Page 32: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Jacobi method for solving linear systems

• Let the linear system• Ax = b• Decompose A into

o A = D + R with D the diagonal of A• R diagonal is 0

• Ax = b writes Dx = b – Rx• If D invertible, Jacobi method solves the linear equation by

• Matrix form:  1

• Element form:  1 ∑• Converges if A is strictly diagonally dominant

• i.e.  ∑ (strict row diagonal dominance)• Th Levy‐Desplanques

o a square matrix with a diagonal strictly dominant is invertible

31Mining Content Information Networks

Page 33: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• PageRank with Jacobi• Algorithm

o Start with an arbitrary vector y(0)o Iterate

• 1

32Mining Content Information Networks

Page 34: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Personalization vectoro Any probability vector v with positive elements can be used

o uniform teleportation

o Can be used to• personalize the search• Control spamming (link farms)

33Mining Content Information Networks

Page 35: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Convergence rate of PageRank

• Theorem (Bianchini et al. 2005)o Let y* the stationary vector of PageRank, 

∗∗

the 1 norm of the relative error in the 

computation of PageRank at time t, then0

o If there is no dangling page, then there exists v  0 andv = P”v, s.t. the equality holds

Mining Content Information Networks 34

Page 36: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Random Walk interpretation

• Initial formulation of PageRank was with Random Walkso A surfer walks the web and moves from page to page according to a transition probability matrix M

o Rank of a page v• Probability that the surfer is browsing page v

o M is interpreted as the matrix of a first order Markov Chaino The google vector r is the stationary distribution of a discrete time Markov Chain

35Mining Content Information Networks

Page 37: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Markov Chains

• Stochastic processo Set of random variables {Xt} defined on a state space S = {S1,…,Sn}

• t is often the time, Xt is the state of the process at time to e.g. S: pages of the Web, process: surfing the web, Xt the web page 

viewed at time t

• Markov Chaino Is a stochastic process that satisfies the Markov Propertyo | ,…, )=  | )o i.e. memoryless process, the state at time t+1 only depends on the 

state at time t• | ) is the transition probability, i.e. probability of moving from Si at time t‐1 to Sj at time t

• P is a row stochastic matrix • Stationary Markov Chain

o MC in which ∀t

36Mining Content Information Networks

Page 38: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• transition matrix

o

• Initial distribution vectoro 0 0 ,… , 0

• Pi(0) is the probability that the chain starts in Si

• Irreducible Markov Chaino The transition matrix is irreducible

37Mining Content Information Networks

Page 39: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Graphical representation

• State= circle, Transition = directed link

p33

S1

S3S2

p31p21

p32

S1 S2 S3p12

p11

38Mining Content Information Networks

Page 40: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Exemple 1 (Rabiner ‐ Juang)

• S1 = Rain, S2 = Clouds, S3 = Sun

39Mining Content Information Networks

Page 41: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Example 2

• Web surfingo States: pageso Transitions : hyperlinkso Parameter estimation: statistics on users’ browsingo Use: model the browsing behavior

40Mining Content Information Networks

Page 42: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Exemple 3 : n‐gram language model

• Build a language model which captures the sequential nature of texts in a corpus.

• n‐gram model = MC of order n‐1• Example : 20 K words vocabulary

1,6*1017Quatre‐gram

8*1012Trigram

20K*20K = 400*106Bigram

# PARAMETERSMODEL

41Mining Content Information Networks

Page 43: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Probability distribution vector• Non negative vector , … , whose components sum to 1.

• Stationary distribution vector for MC P• Vector s.t. 

• kth step probability vector of a MC• Probability of being in a state at time k• , … ,

42Mining Content Information Networks

Page 44: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Properties

• For an initial vector p, what is the state distribution pk at time k ?

• Propertyo P be the transition matrix for a MC on states S1,…,Sno Pk is the kth step transition matrix

• [Pk]ij is the probability of moving from i to j in k stepso p(k) = (Pk)Tp(0) is the kth probability vectoro if P is primitive,  → with  the unique dominant eigenvector of the transition matrix P

Mining Content Information Networks

Page 45: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

RandomWalks on graphs

• Random Walko is a stochastic process which randomly jumps from nodes to 

nodes.o i.e. a MC 

• G = (V, W) a weighted grapho The transition matrix of the random walk is

• , i, j 1. .

•• If the graph is connected and non bipartite (P primitive), 

the random walk possesses a unique stationary distribution 

44Mining Content Information Networks

Page 46: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

HITS (Kleinberg 98)

• Hub

• Authority

45Mining Content Information Networks

Page 47: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Hubo Important reference pageso Points to good authority 

pageso Hub score of a page: sum of 

the authority scores of its children

• Authorityo Important reference pages for 

a topico Pointed by  good hub pageso Authority score of a page: 

sum of the hub scores of its parents

46Mining Content Information Networks

Page 48: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

HITS ‐ Algorithm

• Inputo Web subgraph relative to a query

• The subgraph is composed of the retrieved documents + the linked (in and out) web documents• Only a part of the linked document is considered (e.g. 100)

• Outputo Authority and hub scores, h() and a() for all pages in the graph

• Algorithmo Initialize

• a(v) = 1, h(v) = 1,  v (any positive vector value will do)o Repeat

• ∑ →

• ∑ →• Normalize h and a

o

o

o Until convergenceo Return the two lists

47Mining Content Information Networks

Page 49: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

48

HITS algorithm (followed)

• For the subgraph, leto h : vector of page hubso a : vector of page authoritieso A the adjacency matrix

• In matricial form, the algorithm writes

ionnormalizat

or

ionnormalizat

AA

AA

A

A

T

T

T

1-tt

1-tt

1-tt

1-tt

aahh

haah

Mining Content Information Networks

Page 50: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Matriceso ATA is called the authority matrix

• Determines authority scoreo AAT is called the hub matrix

• Determines hub scoreo Both are symmetric, positive semi‐definite

• The dominant eigenvalue 1 is unique

• Algorithmo The update algorithm is the power method for matrices AAT and 

ATAo It converges towards one of the dominant eigenvector 

associated to AAT and ATA

49Mining Content Information Networks

Page 51: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Convergenceo Although 1 is unique, it may have multiple eigenvectors, so that the convergence will depend on the initial vectors a(0) and h(0).

o A trick similar to PageRank can be used to make the matrices primitive and converge to a unique eigenvector:

o Replace AAT with  1 . ,0<  1 and v a stochastic vector

o Same thing with ATA

50Mining Content Information Networks

Page 52: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Example (Langville – Meyer 2006)

51Mining Content Information Networks

Page 53: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Matriceso A symmetric matrix B is positive semi‐definite if

• For all non zero vector x, xTBx 0• Or equivalently all eigenvalues are  0

o A matrix B is positive definite if• is replaced by >

52Mining Content Information Networks

Page 54: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

SALSA (Lempel – Moran 2001)

• Many variants or algorithms inspired from the success of PageRank and HITS.

• SALSA (Stochastic Approach for Link Structure Analysis) is a stochastic extension to HITSo Makes use of a subgraph of the webo Computes hub and authority values

53Mining Content Information Networks

Page 55: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• G = (V, E)• Build a bipartite undirected graph with two sets of vertices

o Vh: all vertices with outdegree > 0 in Go Va: all vertices with indegree > 0 in Go Edges connect Vh to Va

• Perform two separate random walks to compute hub and authority scores using hub and authority transition matrices H and B

o Hub score• Start from a node in Vh• Jump to a node in Va according to H

o Follow a link in G• Jump back to a node in Vh according to B

o Follow a backlink in Go Authority score

• Idem starting from Vao The stationary vectors of the random walks are  the two score vectors h and a.

• h and a are the principal eigenvectors of H and Bo Note

• Each walk starts on one side of the bipartite graph and remains on this side

54Mining Content Information Networks

Page 56: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Example (Langville – Meyer 2006)

55Mining Content Information Networks

Page 57: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Transition matriceso Hub matrix H

• ∑ deg deg∈ / , ∈

,

• u, v  Vh, w  Vao Authority matrix B

• ∑ deg deg∈ / , ∈

,

• u, v  Va, w  Vh

56Mining Content Information Networks

Page 58: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• The transition matrices can be computed from the adjacency matrix A of the initial graph Go Ar row normalized adjacency matrixo Ac column normalized adjacency matrixo H non zero rows and columns of Ar(Ac)T

o B non zero rows and columns of (Ac)TAr

57Mining Content Information Networks

Page 59: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Example (Langville – Meyer 2006)

Mining Content Information Networks 58

Page 60: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Example (Langville – Meyer 2006)

Mining Content Information Networks 59

Page 61: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Bibliography

• Langville A. Meyer C.D., Google's PageRank and Beyond: The Science of Search Engine Rankings, Princeton University Press Princeton, NJ, USA ©2006, ISBN:0691122024

• Baldi P., Frasconi P., Smyth P., Modeling the Internet and the Web: Probabilistic Methods and Algorithm, Wiley, 2003

• Bianchini M, Gori M, Scarselli F. Inside PageRank. ACM Transactions on Internet Technology. 2005;5(1):92‐128. 

60Mining Content Information Networks

Page 62: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Classification and ranking on networked data

Introduction• Motivation• Graph Laplacian

Regularization based methodsCollective classification

Page 63: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 62

Relational graph data

• Fast growing semantic resourceso Webo Social networkso Sharing media

• Newo Serviceso Data typeso industrial problemso Research problems

• Challengeo Machine learning and Information Retrieval for networked data

Page 64: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 63

Differents types of analysis

• Analysis of network (Kleinberg, Faloutsos, …)o Mainly based on connectivity analysiso Structureo Dynamicso Information propagation

• Machine learning on graph datao Mainly classificationo Two main approaches

• Collective classification• Regularization framework

o May take into account both content and connectivity

Page 65: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Classification and Ranking on networked data

• Problemo Classification and ranking are two important generic problems for Machine learning

• Mainly developed for vectorial and sequential datao Classification/ Ranking on graphs

• As usualo Some data points are labeledo Infer labels of other nodes

• Specificity of graph datao Node interdependencyo The labels inferred at a given node will depend on their neighbors

Mining Content Information Networks 64

?

??

?

Page 66: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 65

Example : webspam detection

• WebSpam challenge 2007

• 11 K hosts• 7 K labeled• 26 % spam• Partial view of the host grapho Black : spamo White : non spam

Page 67: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Exemple Blog Spam

66Mining Content Information Networks

Page 68: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Graph Laplacians (von Luxburg 2007)

• The following definitions hold for undirected graphso Let G = (V, E) an undirected graph

• |V| = n, W a nxn non negative, symmetric weight matrix• D a nxn diagonal matrix with  ∑

o The unnormalized Laplacian of G is•

o Properties

• ∈ , ∑ ,

• L is symmetric, positive semi‐definite• L has n non negative real eigenvalues

• The smallest eigenvalue of L is 0 with eigenvector 1

Mining Content Information Networks 67

Page 69: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Graph Laplacians (followed)

o The following two matrices are called normalized Laplacians• ,  Lrw is related to random walks

• Ls is symmetrico Properties

• Ls and Lrw are positive semi‐definite, they have n non negative real eigenvalues

• 0 is an eigenvalue of Ls and Lrw with eigenvector respectively D1/21 and 1

• Ls and Lrw are similar matriceso They have the same eigenvalueso is an eigenvalue of Lrw with eigenvector v iff is an eigenvalue of Ls with eigenvector D1/2v

• ∈ , ∑ ,

Mining Content Information Networks 68

Page 70: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Classification and ranking on networked data

Introduction• Motivation• Graph Laplacian

Regularization based methodsCollective classification

Page 71: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 70

Graph labeling with regularization

• Initial framework comes from semi‐supervised learning

• Later extended to other situationso Classification on graphso Ranking

Page 72: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Classification frameworko Classical setting for classification is inductive learning

• Learn from a set of labeled datao Usually manual labeling of the data

• Infer on new data• Semi‐supervised learning

o Motivation• Labeling data is expensive while unlabeled data is often available in large quantities

o Often the case for e.g. web applications• Train classifiers using both (few) labeled data and unlabeled data.• The regularization framework mainly concerns transductivelearning

o i.e. All data (labeled + unlabeled) are available at once

Mining Content Information Networks 71

Page 73: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Where does the graph come from in semi‐supervised learning ?

• Make use of local data consistency (proximity, similarity) besides global consistency

• See illustration

Mining Content Information Networks 72

Page 74: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 73

Data consistency (Zhou et al. 2003)• Context: semi‐supervised learning (SSL)• SSL rely on local (neighbors share the same label) and global (data 

structure) data consistency

Fig. from Zhou et al. 2003

Page 75: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 74

• Graph methods for semi‐supervised learning general ideao Given

• An undirected graph defined on data points• a similarity matrix between nodes in G• A set of labeled nodes on G• Propagate observed labels to unlabeled nodes using similarities

Page 76: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Notationso D ={x1,…, xl ,xl+1,…xn} data points 

• First l points labeled, others unlabeledo y: n*1 vector

• We consider binary classification (classes C1 and C2) for simplifying• n : # data points• yi: class scores for pattern xi

o e.g. we target is 1 if xi is C1 and 0 if C2o G = (V, E) an undirected graph

• A its adjacency matrix• W a similarity matrix : Wij is the similarity between nodes i and j• S a row stochastic matrix defined on G

o Different possibilities for S

Mining Content Information Networks 75

Page 77: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Iterative algorithm for semi‐supervised classification –general scheme

• I/Oo Input

• Labeled and unlabeled data pointso Output

• Labeled data

• Algorithmo Compute

• a similarity matrix W• a normalized similarity matrix S

o Iterate

o Label each point • e.g. y*i = 1 if y*i  > 0.5, 0 otherwise

• Example

• D is a diagonal matrix whose ith element is the sum of ith row of W

Where y(0):• matrix of initial labels for labeled nodes (1 is

C1 and 0 if C2 XX or -1 ???)• 0 for unlabeled nodes

Mining Content Information Networks 76

0),2

exp(: 2

2

iiji

ij Wxx

WW

WDS 1

)0()1()(.)1( ytySty

Page 78: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Properties• Classical convergence conditions of iterative methods

o e.g. S primitive• Converges to

o y* = (1‐ α)(I – S)‐1y(0)

Mining Content Information Networks 77

Page 79: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Different variantso Different W matrices can be used

• Inverse exponential distanceso Dense connection matrix

• May use a threshold to make it sparse• K Nearest Neighbors

o Local connectivity, sparse connection matrix• Kernels on graphs

o See latero Any S matrix which satisfies convergence conditions can be usedo e.g.

Mining Content Information Networks 78

Page 80: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 79

Iterations

Fig. from Zhou et al. 2003

Same algorithm but with

Page 81: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Regularization view of the algorithm

• y* could be obtained as the solution minimizing the following cost function

o ∑ 0.. ∑ , .. • First term : fitting constraint wrt the initial labels• Second term : smoothness constraint on neighbor nodes• y(0)I = 1 if node I is class 1, 0 otherwise (class 2 and unlabeled points)

• In compact formo 0 0

• Differentiating Q wrt y giveso 0 (*)

• is nonsingular,the solution is • ∗ 0

o The algorithm on slide XX is Jacobi iterative algorithm for solving the linear system (*)

Mining Content Information Networks 80

Page 82: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Multiclass extension

• Direct extension of the above algorithmo Replace vector ynx1 witho Matrix Ynxc

• c is the number of classes• Yij = 1 if xi is of class Cj and 0 otherwise• Y(0) is the matrix of initial labels• Y*ij = 1 if Yij = argmaxk Yik

Mining Content Information Networks 81

Page 83: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 82

Ranking extensions

• Remark

o Similar formulations have been proposed for ranking in web search engines (e.g. Zhou 2004, Deng 2008)

o Ideas• Documents and queries are the graph vertices• Scores are propagated for computing document relevance to queries while considering document similarity

• Documents are ranked for each query according to scores

Page 84: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Content + link Information

• Propagation methodo do not consider directly the content of the different nodeso Content only appears through the similarity or kernel matrix

• It is possible to use the graph regularization idea together with content based classifiers

Mining Content Information Networks 83

Page 85: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Content + link Information (Continued)Abernathy et al. 2010 (classification)

Denoyer et al. 2010 (ranking)

• Contexto Transductive semi‐supervised learningo Each node is characterized by a content information

• e.g. image, text, other• Content classifier

o ∑ ,∈ 0• Smoothing term

o ∑ ′ ,, ∈

• Regularized content + link classifiero ∑ ,∈ 0 ∑ ′ ,, ∈

• Learningo Gradient like algorithm for learning f parameterso Extensions allow to learn the weights as well

Mining Content Information Networks 84

Page 86: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 85

Ranking model for image annotation in a social network (Denoyer 2009)

• Problemo Automatic annotation of images in large social networks (e.g. Flickr)

o Consider simultaneously• Explicit relations (authorship, friendship)• Implicit relations (similarity)• Different types of content

o Text, image

Page 87: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 86

• Approacho Regularization based method

• Cost functiono Fitting term: ranking functiono Regularity term: based on one type of relation

• Resultso Importance of social links

• Authors, friendship• Large improvement over non relational (classical) ranking methods

• Few improvement with implicit relations

Page 88: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 87

Experiments

• 3 corpora extracted from Flickr

Page 89: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 88

Results

Page 90: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Other extensions

• Directed graphs• Multiple relations• Heterogeneous networks

Mining Content Information Networks 89

Page 91: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B. Learning with local and global consistency. In: Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference.Vol 1.; 2004:595–602.

• Zhu X, Ghahramani Z. Semi‐supervised learning using gaussian fields and harmonic functions. MACHINE LEARNING‐. 2003;20(2):912.• Abernethy J, Chapelle O, Castillo C. Graph regularization methods for Web spam detection. Machine Learning. 2010;81(2):207‐225.

Mining Content Information Networks 90

Page 92: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Classification and ranking on networked data

Introduction• Motivation• Graph Laplacian

Regularization based methodsCollective classification

Page 93: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 92

Networked data

• Available informationo Connectivity

• Labels : partial labeling• Links

o assumed known usually – explicit or implicito Node featureso Others

• Label metric

• Three types of correlations can then be exploitedo Correlation between label of node i and its featureso Correlation between the label of i and the observed features and / or 

labels of node i neighborso Correlation between the label of i and the unobserved features 

and labels of node i neighbors

Page 94: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 93

• Solving the global label assignment is usually NP hard• Exact inference algorithms, when they exist are too costly

• Most methods use approximate inference algorithms

• Note: most methods consider onlyo Unweighted linkso Single links

Page 95: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 94

Notations and problem definition

• Notationso Graph G = (V, E)o Node i features : xi

• xi may incorporate input features (e.g. text) and/ or relational featureso local features : e.g. neighbor labels, number of neighbors, …o global features : e.g. centrality.

o Node i label : yio Neighborhood of node i : N(i)o Labels take their values in L= {l1, …, lp}

• Classification problemo Some labels and / or features being observedo Infer the unobserved labels of other nodes

Page 96: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 95

Collective classification methods (Sen et al. 2008)

• Usual schemeo Bootstrap

• Assign an initial value to each node using a local classifier• Any classifier may be used

o Iterate• compute node labels using graph contextual information• Iterations are needed since the new label values for nodes in N(i) provide new information for yi

• Most methods for collective classification thus requireo A relational classifiero An iteration policy

Page 97: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 96

Collective classification methods

o Gibbso Iterative classificationo Relaxation labelingo Stacked learningo Random walks …..

Page 98: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 97

Feature vectors

• For vector classifiers, xi should be of fixed sizeo Neighborhoods N(i) may be of variable size for different nodes i

o Usual solution : use aggregate features in order to build fixed size feature vectors

• e.g. # class k labels in N(i), class k relative frequency in N(i), majority label in N(i), ….

• The value of xi may change from one iteration to the other• xi shall be computed at each iteration

Page 99: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Example – aggregate features

Mining Content Information Networks 98

Page 100: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 99

Iterative classification (Neville et al 2000, Lu et al. 2003)

• Boostrapo For each unlabeled node i

• Local classifiero Compute xio Compute label yi using observed nodes in N(i) : yi = F(xi)

• Iterateo Generate an ordering on unlabeled nodeso For each unlabeled node i, 

• Relational classifiero Compute xio Compute label yi using N(i) : yi = F(xi)

Page 101: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 100

Simulated Iterative Classification (Maes et al. 2009)

• Training Bias in ICAo Training is performed on correct labelso Inference is performed with noisy labels

Page 102: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

SICA (Followed)

• Ideao Training and test conditions should be made similaro Make training examples representative of test ones by simulating inference during learning

o How• Repeatedly run inference during training by sampling from the current classifier distribution of predicted labels

• Different sampling schemes

Mining Content Information Networks 101

Page 103: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 102

Gibbs sampling (McDowell et al. 2007, Neville et al. 2007)

• Simplified version of the original Gibbs sampling strategy (Geman & Geman 84)

o Introduces a classifier F() not present in the original Gibbs

• Training often requires a fully labeled training set

• Inferenceo Sample the outputs for each node and take the majority label

• Difference with ICo Label sampling

• Sequential update• Any classifier can be used for Bootstrap or Iterate steps

Page 104: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 103

Gibbs sampling

• Boostrapo For each unlabeled node i

• Local classifiero Compute xio Compute label yi using observed nodes in N(i) : yi = F(xi)

• For each label lo Counts [i, l] = 0

• Iterateo Generate an ordering on unlabeled nodeso For each unlabeled node i, 

• Relational classifiero Compute xio Compute label yi using N(i) : yi = F(xi)

• Counts [i, yi] = Counts [i, yi] +1• yi = argmaxl count[i, l]

Page 105: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 104

Remarks

• For both ICA and Gibbso A Sequential update of unobserved labels is performedo Any classifier for Bootstrap and Iterate steps can be usedo Hard labels are computed at each stepo Node ordering has no real impacto Classifier choice may impact the performance

• Trainingo Usually requires a fully annotated data set

Page 106: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 105

ICA ‐ Gibbs

Page 107: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 106

Stacked graphical learning  (Cohen et al XX)

• Main difference is in the training phase• Ideas

o Train a local classifier y = F(x)o Train a second classifier using both input x and predicted outputs in N(i)

o Uses stacked learning

• Usually requires only few (1 !) iterations

Page 108: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 107

Stacked graphical learning

• Trainingo Boostrap

• Learn a local classifier F0 on training set Do Iterate k = 1 to K

• Build training set Dk by augmenting xi with YN(i):o xk = (x,YN(x))

• Learn Fk on Dk

• Note : this step uses stacked learningo Final model : FK

• Inférenceo y0 = F0 (x)o For k = 1 to K

• Compute xk as above• yk = Fk (xk)

o yK = FK (xK)

Page 109: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 108

Stacked learning

o For robustness, training is performed using stacked learning

o Training set D• Let D1, .., Dm be a partition of D• Fk is trained as follows:

o Train m fonctions fi• fi is trained on D – Di

• Let x  Di , y = F(x) = fi(x)

o Note• At each iteration a different partition will be used• This prevents overtraining 

Page 110: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 109

Stacked learning

Page 111: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 110

Some tests

• Sequential data (handwritten word recognition)

Page 112: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 111

Other approaches

• Extensions of graphical models classifiers have been proposed for collective and relational classificationo Directed models

• Relational Bayesian Networks (Taskar et al. 2001)o Undirected models

• Relational Dependency Networks (Neville et al. 2003, 2007)• Relational Markov Networks (Taskar et al 2002)

o …….

Page 113: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks 112

Special case : univariate classification

• When the input features are ignored, collective classification is known as univariate collective classification

• Labels are propagated from observed nodes to unlabeled nodes

• All the above methods can be used in this setting (Macskassy et al. 2007)

Page 114: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

References

• Sen P, Namata G, Bilgic M, Getoor L, Galligher B, T. Collective classification in network data. AI Magazine. 2008;29(3):1‐24. 

Mining Content Information Networks 113

Page 115: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Graph kernels

Page 116: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Motivations

• Graph Kernels allow to define similarities between nodes in a graph, based on the graph structureo e.g. # paths connecting two nodes, mean weight of paths, etc

• i.e. complex similarity measures• Distance measures between nodes could be easily derived from 

these similaritieso The kernel framework allows to consider a large variety of distance 

measures and to represent the nodes of a graph as points in an euclideanspace

• Link with regularization based approacheso Some graph kernels may be obtained as solutions to the optimization of 

loss functions• Link with random walks

o Some graph kernels may be defined in terms of random walks on the graph

Mining Content Information Networks 115

Page 117: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Kernels

• Kernels are “similarity” functions k(x,x’) s.t.o k(x,x’) can be computed via an inner product of some transformation of x and x’ in a feature space

• Definitiono K: X x X R is a kernel function if for all x, z in X,  K(x,z) = < Φ(x), 

Φ(z)>whereΦ is a mapping from X onto an inner product feature space

(Hilbert space)

Mining Content Information Networks 116

Page 118: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Initial motivations in machine learningo Non linear classification

• Map the data onto a possibly high dimensional space, so that the problem becomes linear in that space

• Limit the complexity of similarity computations to O(input space dimension)

o Computations may be performed in the original (smaller) space at a linear cost

Mining Content Information Networks 117

Page 119: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Initial motivations in machine learningo Non linear classification

Mining Content Information Networks 118

xx’

(x)(x’)

K(x, x’)

Page 120: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Kernel functions ‐ examples

• Linear Kernels

• Order 2 polynomials

Mining Content Information Networks 119

zxzxK .),(

spolynomial 2d ofset theofsubset i.e.),)2(,).(()(/ with)().(),(

).(),(

21n

: 2 d of monomials all i.e.

).()(/)( with)().(),(

).)(.(.),(

.),(

,1,1,,

2

,1,,

1,

2

1

2

ccxxxx(x)zxzxK

czxzxK

xxxxzxzxK

zzxxzxzxK

zxzxK

niinjijiji

njijiji

n

jijiji

n

iii

Page 121: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Kernels on finite spaces

• Leto X = { x1,…, xN}, with x  Xo K(x,x') a symmetric function, K: XxX→ R 

• K is a kernel function iff matrix  , .. ispositive semi‐definite

• There are several equivalent characterizations for a kernel functiono matrix K is symmetric positive semi‐definite iff any of the 

following property holds• 0∀• All the eigenvalues of the real matrix K are real and non negative• for some real matrix B.

o B is not unique in general and different decompositions may exist

Mining Content Information Networks 120

Page 122: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• , .. is called the kernel matrix

• n

G , .. , , ..

Mining Content Information Networks 121

Page 123: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Symmetric matricesUseful properties

• A symmetric matrix has only real eigenvalues• The eigenvectors of a symmetric matrix are orthogonal 

and can then be chosen orthonormal• If a symmetric matrix A has k non zero eigenvalues, then 

it can be diagonalized and expressed aso A = UUT

o With• the diagonal kxk matrix of eigenvalues

o Usually ordered in decreasing order• U the nxk orthonormal matrix of corresponding eigenvectors

• Another expression for A iso A = (U 1/2)(U 1/2 )T = XXT

Mining Content Information Networks 122

Page 124: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• The data matrix associated to a kernel matrixo Let

• K kernel matrix• K = (U 1/2)(U 1/2 )T = XXT its eigenvalue decomposition• xi the column vector of XT

o Then• (K)ij = xiTxj• xi is a r‐dimensional vector, this is the feature vector associated to the ith pattern• xi is the euclidean representation of pattern i in this space

o When pattern i characterizes a node in a graph, xi is its “euclidean representation”o X is called the data matrix associated to the kernel matrix Ko Note

• This means that data points in complex spaces (e.g. graphs) may be represented in an euclidean space using this data matrix representation.

• Classical euclidean operations (dot product, distances, projections) can be defined on these complex objects via the kernel matrix directly

Mining Content Information Networks 123

Page 125: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Angles

o

• Distances

o• With  0,… , 0,1,0, … , 0 , with 1 in position i

Mining Content Information Networks 124

Page 126: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Kernels on graphsSimilarity between graph nodes

Graph “Metric” Space

Mining Content Information Networks 125

Page 127: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

How to define meaningful kernels on graphs

Mining Content Information Networks 126

Page 128: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Example: kernels based on adjacency matrix

Kernel matrix

K An # paths of length n

…+ Allpaths from length 1tonK= ∑ Infinite discounted sum

Mining Content Information Networks 127

0 1 1 11 0 0 11 0 0 01 1 0 03 1 0 11 2 1 10 1 1 11 1 1 2

v1

v4v3

v2

# paths of length 2

Page 129: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Diffusion kernels on graphs (Shawe‐Taylor et al. 2004)

• Let B = (bij)i,j=1..n denote a similarity matrix between the graph nodes s.t. bij is the similarity between nodes i and j, B is symmetric.

• Consider the following similarity:o 2 ∑o i.e. bij(2) is the sum of similarities of all length 2 paths between i 

and j whereo Then  2 and B2 is a kernel matrix

• In the same wayo Bk is  a kernel matrix

• Gives the sum of all k length path similaritieso Any linear combinations of B power matrices is a kernel matrix.

Mining Content Information Networks 128

Page 130: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Von Neumann diffusion kernelo ∑o Ak measures the number of paths of length k between any pair 

of nodeso The similarity between two nodes i and j: (KVN)ij integrates 

contributions from all the paths from I to j in the graph, with a discounting factor decreasing with k.

o Here the importance of a path is inversely proportional to its length

o Converges if 0 <  < ((A))‐1 with (A) the spectral radius of Ao Converges too Note

• Similar kernels could be defined for any symmetric B matrix

Mining Content Information Networks 129

Page 131: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Other graph kernels (Fouss et al. 2009)

• Exponential diffusion kernelo ∑

!exp with A the adjacency matrix of graph G

o is the number of length k paths between i and j.o Similar to Von Neuman Kernel, with a different discounting rate

• The Laplacian exponential diffusion kernel is the same as the exponential diffusion kernel except adjacency matrix A is replaced with minus the laplacian matrix L.

o ∑!

exp• The regularized Laplacian kernel is similar to Von Neumann kernel 

with minus the unormalized laplacian ‐ L substituted to Ao ∑o This kernel also appears in the regularized approach to semi‐supervised 

learning (slide XX)

Mining Content Information Networks 130

Page 132: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Random walk with restart kernelo Let us consider a random walker which jumps from node i to node j 

with probability  according to a row stochastic transition matrix P and at each step jumps back to node i with probability 1 ‐ (Gori et al. 2006))

o The RW is described by the following process

•0

1 1• The steady state solution for a walk starting at node i is:• 1• x is the ith column of  1 , it provides a similaritybetween node i and the other nodes of the graph

• The random walk with restart matrix is• XX explain why we transpose the matrix

Mining Content Information Networks 131

Page 133: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Using graph kernels for  recommendation

• Collaborative filteringo U a set of userso I a set of itemso Each user rates some of the items

• User‐item matrix ‐ sparseo Collaborative filtering

• Recommend items for users• Usually based on the user similarity of ratings• Predict the missing ratings for users• Many different techniques

Mining Content Information Networks 132

ItemsUsers

1 2 3 4

1 5 323 34

5 1 2

ItemsUsers

1 2 3 4

1 5 ? ? 32 ? ? ? ?3 ? 3 ? ?4 ? ? ? ?5 1 2 ? ?

Page 134: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Popular challenges on movie recommendationo CAMRa2010o NetFlix Prize

• On September 21, 2009 we awarded the $1M Grand Prize to team “BellKor’s Pragmatic Chaos”.

• There are currently 51051 contestants on 41305 teams from 186 different countries. We have received 44014 valid submissions from 5169 different teams;

Mining Content Information Networks 133

Page 135: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Using graph kernels for  recommendation

• Let G = (V, E) be the user‐item bipartite grapho Nodes V are users and itemso Links

• A adjacency matrix• aij = 1 if user i has rated item I• aij = 0 otherwise

o Bipartite graph

Mining Content Information Networks 134

Page 136: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Using graph kernels for  recommendation

• The different graph kernels could be computed on this bipartite graph.o The kernel matrix K (N+M)x(N+M) provides the similarities between graph nodes

o It could be partitioned into 4 matrices

•o KUU is the MxM user‐user similarity matrixo KUU is the NxN item‐item similarity matrixo KUI is the NxM user‐item preference matrix and KIU its symmetric matrix

Mining Content Information Networks 135

Page 137: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Using graph kernels for  recommendation

• Three ways for computing recommendations (Fouss et al 2009)o Direct

• Use sim(Useri, Itemj) for direct ranking of the recommendations

o User based• Compute sim(Useri, Userj) • Keep the k‐nearest‐neighbors of Useri

o k hyperparametero The recommendation score of item j for user I is

• _ ,∑ ,..∑ , ..

• apj = 1 if Userp rated Itemj and 0 otherwise

Mining Content Information Networks 136

Page 138: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Using graph kernels for  recommendation

o Item based• Compute sim(Itemi, Itemj) • Keep the k‐nearest‐neighbors of Itemi

o k hyperparametero The recommendation score of item j for user I is

• _ ,∑ ,..∑ , ..

• aip = 1 if Useri rated Itemp and 0 otherwise

Mining Content Information Networks 137

Page 139: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Using graph kernels for  classification

• Kernels may be used for semi‐supervised classificationo Several of the matrices obtained via transductive approaches to 

semi‐supervised learning are indeed kernels – or similarity matrices

o Given a Kernel matrix K (nxn), a simple rule for classification is:• Let yc be an nx1 vector s.t. yc,I = 1 if node i is from class c and 0 otherwise (unknown or other class)

• K. yc is the vector of class c scores for the graph nodeso It computes the similarity of any node with the labeled nodes from class c

• Finally, a node may be classified into the class with highest scoreo This is similar to what we did with regularization based approaches

• Noteo Other, more sophisticated classification rules might be used

Mining Content Information Networks 138

Page 140: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

References

• Shawe‐Taylor J., Cristianini N., Kernel Methods for Pattern Analysis, Cambridge University Press, 2004• Yen L, Pirotte A, Saerens M. An Experimental Investigation of Graph Kernels on Collaborative Recommendation and 

Semisupervised Classification. Submitted. 2009:1‐39.

Mining Content Information Networks 139

Page 141: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Latent models

Non Negative Matrix Factorization

Apprentissage Statistique     ‐ P. Gallinari 140

Page 142: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Non Negative Matrix Factorization

• Ideao Project data vectors in a latent space of dimension k < m size of the original space

o Axis in this latent space represent a new basis for data representation

o Each original data vector will be approximated as a linear combination of k basis vectors in this new space

o Data are assigned to the nearest axiso This provide a clustering of the data

Apprentissage Statistique     ‐ P. Gallinari 141

Page 143: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

o {x1,…, xn},  ∈ , 0o Xm x n non negative matrix with columns the xi so Find non negative factors U, V, / 

• With U an m x k matrix, V a k x n matrix, k <  m, n

x

m x n m x k k x n

Apprentissage Statistique ‐ P. Gallinari 142

X UV

Page 144: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

•o , ∑

o Columns ofU,uj arebasisvectors,the arethecoefficientofxi inthis basis

•o Solve

, Underconstraints , 0

o Convex loss function inUandinV,butnotinboth UandV

Apprentissage Statistique     ‐ P. Gallinari 143

Page 145: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Algorithmo Constrained optimization problemo Can be solved by a Lagrangian formulationo Iterative algorithm (Xu et al. 2003)

• U, V initialized at random values• Iterate until convergence

o ←

o ←

o The solution U, V is not unique, if U, V are solution, then UD, D‐1V for D diagonal positive are also solution

Apprentissage Statistique     ‐ P. Gallinari 144

Page 146: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Clusteringo Normalize U as a column stochastic matrix (each column vector is of norm 1)

• ←∑

• ← ∑

o Under the constraint “U normalized” the solution U, V is unique

o Associate xi to cluster j if 

Apprentissage Statistique     ‐ P. Gallinari 145

Page 147: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Noteo many different versions of NMFo Different loss functions

• g.g. different constraints on the decompositiono Different algorithms

• Applicationso Clusteringo Recommendationo Link predictiono Etc

• Specific forms of NMF can be shown equivalent too PLSAo Spectral clustering

Apprentissage Statistique     ‐ P. Gallinari 146

Page 148: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Illustration (Lee & Seung 1999)• Basis images for

• NMF

• Vector Quantization

• Principal Component Analysis

Apprentissage Statistique     ‐ P. Gallinari 147

Page 149: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Latent models

Probabilistic Latent Semantic Analysis‐ PLSA

Apprentissage Statistique     ‐ P. Gallinari 148

Page 150: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Apprentissage Statistique     ‐ P. Gallinari 149

Preliminaries : unigram model

• Generative model of a document

Select document length Pick a word w with probability p(w) Continue until the end of the document

• Applications Classification Clustering Ad‐hoc retrieval (language models)

i

i dwpdp )()(

Page 151: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Apprentissage Statistique     ‐ P. Gallinari 150

Preliminaries ‐ Unigram model – geometric interpretation

P(w1|d)

P(w3|d)

P(w2|d)

Document d

Word simplex

2/1)(

4/1)(

4/1)(

3

2

1

tionrepresenta d doc

dwp

dwp

dwp

Page 152: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Apprentissage Statistique     ‐ P. Gallinari 151

Latent models for document generation

• Several factors influence the creation of a document (authors, topics, mood, etc).o They are usually unknown

• Generative statistical modelso Associate the factors with latent variableso Identifying (learning) the latent variables allows us to uncover (inference) complex latent structures

Page 153: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Apprentissage Statistique     ‐ P. Gallinari 152

Probabilistic Latent Semantic Analysis ‐ PLSA (Hofmann 99)

• Motivationso Several topics may be present in a document or in a document collection

o Learn the topics from a training collectiono Applications

• Identify the semantic content of documents, documents relationships, trends, …

• Segment documents, ad‐hoc IR, …

Page 154: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Apprentissage Statistique     ‐ P. Gallinari 153

PLSA

• The latent structure is a set of topicso Each document is generated as a set of words chosen from selected 

topicso A latent variable z (topic) is associated to each word occurrence in the 

document

• Generative Processo Select a document d, P(d)o Iterate

• Choose  a latent class z, P(z|d)• Generate a word w according to P(w| z)

o Note : P(w| z) and P(z|d) are multinomial distributions over the V words and the T topics

Page 155: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Apprentissage Statistique     ‐ P. Gallinari 154

PLSA ‐ Topic

• A topic is a distribution over words

• Remark A topic is shared by several words A word is associated to several topics

P(w|z)

words

word P(w|z)

machine 0.04

learning 0.01

information 0.09

retrieval 0.02

…… …….

Page 156: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Apprentissage Statistique     ‐ P. Gallinari 155

PLSA as a graphical model

z

dzPzwPdwP

dwPdPwdP

)()()(

)(*)(),(

Boxes represent repeated samplingd wz

Corpus level

Document level

P(z|d) P(w|z)

DNd

Page 157: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Apprentissage Statistique     ‐ P. Gallinari 156

PLSA model

• Hypothesiso # values of z is fixed a priorio Bag of wordso Documents are independent

• No specific distribution on the documentso Conditional independence

• z being known, w and d are independent

• Learning• Maximum Likelihood : p(Doc‐collection)• EM algorithm and variants

Page 158: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Apprentissage Statistique     ‐ P. Gallinari 157

PLSA ‐ geometric interpretation

• Topici is a point on the word simplex• Documents are constrained to lie on the topic simplex• Creates a bottleneck in document representation

Topic simplex

topic2

topic1

topic3w2 w1

w3

Word simplex

Document d

z

dzPzwPdwP )()()(

Page 159: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Apprentissage Statistique     ‐ P. Gallinari 158

Applications

• Thematic segmentation• Creating documents hierarchies• IR : PLSI model• Clustering and classification• Image annotation

o Learn and infer P(w|image)• Collaborative filtering

• Note : #variants and extensionso E.g. Hierarchical PLSA (see Gaussier et al.)

Page 160: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

An introduction to recommendersystems:

Collaborative FilteringLocal Neighborhood methodsMatrix Factorization methods

Mining Content Information Networks 159

Page 161: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Example ‐ Amazon

Mining Content Information Networks 160

Page 162: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Example (2) ‐ Amazon

Mining Content Information Networks 161

Page 163: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Example (3) ‐ Netflix

Mining Content Information Networks 162

Page 164: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Example (4) ‐ Recsys

Mining Content Information Networks 163

Page 165: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Personalized recommendation

• Two main strategieso Content filtering (not in this course)

• Use product or user characteristics to recommend a list of product to a user• Learn to associate users to product

o Collaborative filtering: this course• Use user previous transactions or ratings to associate users to products

o Introduced by researchers from Xerox PARC in 1992:Using collaborative filtering to weave an information TapestryD. Goldberg, D. Nichols, B.M. Oki, D. Terry

o Domain freeo Implements the ”word‐of‐mouth” principle,

given a user, its interests for a given product are predicted using tasteinformation from the other users which, globally, have the same tastes as thecurrent user.

• Methodso Neighborhood methodso Factorization methods

Mining Content Information Networks 164

Page 166: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Collaborative filtering : Data

• The data take the form of recommendation matriceso m users , … ,o n products , … ,o The rating matrix Rm x n contains values characterizingthe interest of users for products

• mesures the interest of user i for item j• = ? If no known value• R is very sparse (often almost empty)

o Mesures of interest• Ratings, binary values (e.g. like)• Clicks on search results, purchase, etc

Mining Content Information Networks 165

Page 167: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Collaborative filtering : neighborhhoodmethods

• Ideaso Define a similarity between users or between items

• Two users are similar is they share the same tastes or have similarinteractions with the system

• Two items are similar if they were given similar ratings by manyusers or user‐product interactions are similar for the two items

o Predict an unknown rating for a product• User based

o Product p ratings for user u are weighted averages over userssimilar to the current user u, of the known ratings for product p

• Item basedo Product p ratings for user u are weighted averages over the similar products, of the known ratings of user u

Mining Content Information Networks 166

Page 168: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

User based collaborative filtering

• Similarity measures between userso Cosine measure

• cos ,∑ / ? ?

∑ / ? ∑ / ?

o Correlation coefficient • Let  be the average of the known ratings for 

• co ,∑ / ? ?

∑ / ? ∑ / ?

Mining Content Information Networks 167

Page 169: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

User based collaborative filtering

• Prediction functiono Let

• be the predicted rating for product and user • U(i) the K users most similar to • Prediction function

•∑ ,∈ ; ?

∑ ,∈ ?

• where sim(,) is one of the similarity functions

Mining Content Information Networks 168

Page 170: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Item based collaborative filtering

• Similarity measures between productso Cosine measure

• cos ,∑ / ? ?

∑ / ? ∑ / ?

o Adjusted cosine• Let  be the average of the known ratings for 

• adjcos ,∑ / ? ?

∑ / ? ∑ / ?

o Correlation coefficiento Other measures

Mining Content Information Networks 169

Page 171: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Item based collaborative filtering

• Prediction functiono Let

• be the predicted rating for product and user • N(j) the k products which are most similar to • Prediction function

•∑ ,∈ ; ?

∑ ,∈ ; ?

• where sim(,) is one of the similarity functions

Mining Content Information Networks 170

Page 172: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Quality of the predicted ratings

Mining Content Information Networks 171

Page 173: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Matrix Factorization methods

• Datao The interaction matrix

• Users and Items are mapped onto a common latent factor space of dimensionality

• User‐Item interactions are measured as innerproducts in this space

Mining Content Information Networks 172

Page 174: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Example (Koren et al. 2009)

Mining Content Information Networks 173

Page 175: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Basic model

• User i represented by a d dimensional vector

• Item j represented by a d dimensional vector

• Predicted rating • Optimization problem

o , ∈ , ∈ ⨀• Where⨀ denotes the elementwise multiplication• i.e. only the known ratings are considered

Mining Content Information Networks 174

Page 176: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

• Avoiding overfitting using regularization terms•

∈ , ∈

+

Mining Content Information Networks 175

Page 177: Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Algorithms

• Two popular approacheso Stochastic gradient descento Alternating Least Squares

Mining Content Information Networks 176