Today lecture: 1. Overview of social network analysis ...
Transcript of Today lecture: 1. Overview of social network analysis ...
Social network analysis
Today lecture:
1. Overview of social network analysis
Properties of social networks
Analysis tasks
2. Martino: Community detection in temporal networks
3. Bruno: Community detection in signed networks
Aggarwal Ch. 19
MDM course Aalto 2020 – p.1/20
Many types of social networks
online networks (Twitter, LinkedIn, Facebook)
indirect communication networks (telecommunications,email, chat messages)
media sharing sites (Youtube, Instagram, Tiktok)
interaction networks in professional communities (e.g.,citation networks between researchers)
MDM course Aalto 2020 – p.2/20
Presentation as a graph G = (V,E)
V set of nodes corresponding to actors
may have labels or content (attributes, documents)
E set of edges corresponding to links
usually undirected (friendship)
sometimes directed (“following” in twitter)
may have weights wi j
a node is a neighbour of another node, if they areconnected by an edge
MDM course Aalto 2020 – p.3/20
Example of a social network
edge weight = number of exchanged messagesnode attributes: city and education
Image source: Zardi et al.: A Multi-agent homophily-based approachfor community detection in social networks, ICTAI 2014
MDM course Aalto 2020 – p.4/20
Two basic properties
1. Homophily: Nodes connected to one another tend toshare similar (content-based) properties.
e.g., same background, hobbies or interests
2. Triadic closure: If two actors share common neighbours,it is more likely they are already connected or willeventually become connected
MDM course Aalto 2020 – p.5/20
Typical dynamic properties
1. Preferential attachment: likelihood of a node to receivenew edges increases with node degree→ smallnumber of very high degree nodes (hubs)
2. Small world property: average path lengths betweenany two nodes are quite small
3. Densification: Groups become denser over time (morenew links than nodes)⇒
4. Shrinking diameters: distances between nodes shrinkover time
5. Giant connected component: often a giant connectedcomponent emerges in the network
MDM course Aalto 2020 – p.6/20
Analysis tasks
I Social influence analysis (influential nodes andinfluence spread)
II Community detection (graph clustering)
III Collective classification (predict missing node labels)
IV Link prediction (predict future links between nodes)
MDM course Aalto 2020 – p.7/20
I Social influence analysis
Which nodes have most influence? How influence(information, ideas, opinions) spreads?
A valuable advertising channel!
1. Measures for evaluating which nodes are influential:
centrality of a node in an undirected graph
prestige of a node in a directed graph
2. Influence propagation or diffusion models
given influence weights on edges and a model toevaluate total influence of a set of nodes
determine a set of seed nodes such that spread ofinfluence is maximal
MDM course Aalto 2020 – p.8/20
Measures for the centrality of node v
Degree centrality: CD(v) =Degree(v)
n−1
Closeness centrality: CC(v) = 1avgu∈V {Dist(v,u)}
Betweenness centrality: CB(v) =
∑u,w∈V,u,w
#{shortest-paths(u,w) through v}
#{shortest-paths(u,w)}
( n2 )
Image source: Aggarwal Ch 19MDM course Aalto 2020 – p.9/20
II Community detection: cluster the graph
Given G = (V,E). Each edge (vi, v j) has weight wi j
if cost ci j, transform, e.g. by wi j =1
ci j(ci j , 0)
Common objective: Cluster V into groups V1, . . . ,VK
such that the edge-cut cost
cost(V1, . . . ,Vk) =∑
(vi,v j)∈E,vi∈Vp,v j∈Vq,p,q
wi j
is minimal.
many variants and extra constraints!
in general NP-hard problem, but polynomially solvable, if∀i, j : wi j = 1, K = 2 and no balancing requirements
MDM course Aalto 2020 – p.10/20
Example
Clustering based on both structural and content-basedfeatures
Image source: Zardi et al.: A Multi-agent homophily-based approachfor community detection in social networks, ICTAI 2014 MDM course Aalto 2020 – p.11/20
Some community detection methods
Kerningham-Lin: balanced 2-way partitioning
at each iteration, test a set of possible swap sequencesand choose the one with greatest improvement
Girwan-Newman: Remove
“bridge edges” until K con-
nected components remain
edges with highbetweenness: largeproportion of shortestpaths go through them
e.g., edges incident onhubs
Image source: Aggarwal Ch 19 MDM course Aalto 2020 – p.12/20
METIS algorithm
1. Coarsen the graph by combining nodes and edges
2. Partition the coarsened representation (easier)
3. Refine partitioning when expanding graphs back
Image sourceAggarwal Ch 19
MDM course Aalto 2020 – p.13/20
METIS Graph coarsening: Combine tightly
interconnected nodes
simple strategies: choose two neighbours along a randomedge, heavy edge or “high-density” edge
new node weight = number of contracted nodes it presents
combine parallel edges by summing their weights
image source
Aggarwal Ch 19.3
MDM course Aalto 2020 – p.14/20
III Collective classification: predict missing labels
could look at the most common label in theneighbourhood, but node labels very sparse→ look atalso connections through unlabeled nodes
semisupervised classification, where unlabeled data isutilized in learning
+ suits for any data type given a similarity graph!
Image source: Aggarwal Ch 19MDM course Aalto 2020 – p.15/20
Collective classification: Common methods
1. Iterative classification algorithm (ICA)
of test nodes and movePredict labels
most certain to thetraining set
e.g., class distribution of neighbour
nodes, weighted by edge weights
nodescurrently labeled
for all nodes utilizingExtract link features Use link features
and content features totrain a probabilistic
classifier
repeat ITER times
e.g., naive Bayesif it has high probability
class prediction of a node "certain"
2. Label propagation with random walks
What is the most probable class of a node where arandom walk (through unlabeled nodes) leads?
3. Supervised spectral methods
spectral embedding + classify multidimensionaldata or solve directly spectral optimization problem
MDM course Aalto 2020 – p.16/20
IV Link prediction: predict future links
Utilize especially structural features
Image source: Aggarwal Ch 19
MDM course Aalto 2020 – p.17/20
Link prediction approaches
1. Evaluate potential connections with node similaritymeasures
+ easy and fast to compute
2. Learn a classifier for predicting links or their absence
+ more accurate– computationally more expensive
3. Use missing value estimation methods (like matrixfactorization)
MDM course Aalto 2020 – p.18/20
Measures for similarity between nodes vi and v j
a) Neighbourhood-based measures: (normalized) numberof common neighbours
– not good, if number of common neighbours small
b) Walk-based measures
Katz measure produces often excellent results
Katz(vi, v j) =
∞∑
t=0
βt · n(t)
i j,
where n(t)
i j= number of walks of length t between vi and
v j and β < 1 discount factor (punishes long walks)
SimRank and PageRank-based measuresMDM course Aalto 2020 – p.19/20
Classification approach
create multidimensional data, where each node pair isone instance
in training use all linked pairs (class 1) and a sample ofunlinked pairs (class 0)
features: similarity measures between nodes, otherstructural features (e.g., node degrees), possiblycontent-based features
learn a classifier and make predictions for remainingpairs
e.g. logistic regression
MDM course Aalto 2020 – p.20/20