1
1
Jerzy KORCZAKemail: [email protected]
http://www.korczak-leliwa.pl
http://citi-lab.pl
Mining of Financial Databases
3. Clustering
J. Korczak, UE J. Korczak, UE 2
Contents
Introduction - problem definition
Similarity - Distances
Clustering algorithms
Classical algorithms: K-means
Hierarchical algorithms
Scalable algorithms: CURE, DBSCAN
Self Organizing Maps
Neural Gaz
Semi-supervised methods
Reminder: what is clustering ?
Clustering is a process of partitioning a set of data (or
objects) into a set of meaningful sub-classes, called clusters.
Clustering is an unsupervised classification: no predefined classes
Related issues: space reduction, outliers detection, understanding of clusters, user engagement,
background knowledge,…
J. Korczak, UE 3
Introduction –
what is a natural grouping among these objects?
4
School EmployeesSimpson's Family MalesFemales
Clustering is subjective
What is a natural grouping among these objects?
5
How-to Hierarchical Clustering?
The number of dendrograms with
n leafs = (2n -3)!/[(2(n -2)) (n -2)!]
Number Number of possible
of leafs dendrograms2 1
3 34 15
5 105... …
10 34,459,425
Since we cannot test all possible trees we
will have to heuristic search of all possible trees. We could do:
Bottom-Up (agglomerative): Starting with each item in its own cluster, find the
best pair to merge into a new cluster.
Repeat until all clusters are fused together.
Top-Down (divisive): Starting with all the data in a single cluster, consider every
possible way to divide the cluster into
two. Choose the best division and recursively operate on both sides.
6
2
J. Korczak, UE 7
Problem of clustering
Given:
Data points and number of
desired clusters K
Group the data points into K clusters
Data points within clusters are
more similar than across clusters
Sample applications:
Customer profiles/segmentation
Market basket customer analysis
Clustering countries/companies
Security – suspicious transactions
Churn analysis
J. Korczak, UE 8
Problem of clustering
Given:
Data points and number of
desired clusters K
Group the data points into K clusters
Data points within clusters are
more similar than across clusters
Sample applications:
Customer segmentation
Market basket customer analysis
Clustering countries/companies
Security – suspicious transactions
Churn analysis
Getting best value of K
Try different K, looking at the change in the average
distance to centroid
Average falls rapidly until right K, then changes little
J. Korczak, UE 9
Best K
Ave distance
to centroid
K
Getting best value of K
J. Korczak, UE 10
Too few; many long distances to centroid
Getting best value of K
J. Korczak, UE 11
Too many; little improvement in ave distance
Data Types in Cluster Analysis
Financial reports - numerical and categorical data
Publications, expertises - text data
Product images, videos - multimedia data
Stock time series, sequences of transactions
Social network information
Blogs, uncertain data
J. Korczak, UE 12
3
Dimensionality Reduction
Dimensionality reduction approaches are capable of
improving learning performance, lowering computational complexity, building better generalizable models, and
decreasing required storage
Feature extraction: PCA, LDA,…
Feature selection: Information Gain, Relief, c2, …
Selection strategies: filter, wrapper and hybrid
J. Korczak, UE 13 J. Korczak, UE 14
J. Korczak, UE 15 J. Korczak, UE 16
J. Korczak, UE 17
Similarity
If q = 2, d is an Euclidien distance :
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
)||...|||(|),( 22
22
2
11 pp jx
ix
jx
ix
jx
ixjid
J. Korczak, UE 18
Similarity measures
Distances indicate the similarity
Minkowski’s distance (no Euclidean space):
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two objects, p-
dimension and q is a positive number
If q = 1, d is a Manhattan distance
pp
jx
ix
jx
ix
jx
ixjid )||...|||(|),(
2211
||...||||),(2211 pp j
xi
xj
xi
xj
xi
xjid
4
J. Korczak, UE 19
Example: Manhattan distance
Age Salary
Person1 50 11000
Person2 70 11100
Person3 60 11122
Person4 60 11074
d(p1,p2)=120
d(p1,p3)=132
Age Salary
Person1 -2 -0,5
Person2 2 0,175
Person3 0 0,324
Person4 0 0
d(p1,p2)=4,675
d(p1,p3)=2,324
Conclusion: p1 resembles more p2 than p3
Conclusion: p1 resembles more p3 than p2
Time-series data – measures of similarity
J. Korczak, UE 20
Case B
Case A
Distance between time-series data
J. Korczak, UE 21
Measures (of magnitude, shape, subsequences):
- Euclidean distance
- DTW (Dynamic Time Wraping) distance
- Frechet distance
- Longest Common Subsequence
- … int DTWDistance(s: array [1..n], t: array [1..m]) { DTW := array [0..n, 0..m] for i := 1 to n DTW[i, 0] := infinity for i := 1 to m DTW[0, i] := infinity DTW[0, 0] := 0 for i := 1 to n
for j := 1 to m cost := d(s[i], t[j]) DTW[i, j] := cost + minimum(DTW[i-1, j ], // insertion DTW[i , j-1], // deletion DTW[i-1, j-1]) // match
return DTW[n, m] }
Distances between clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one
cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e.,
dist(Ki, Kj) = dist(Mi, Mj)
Medoid: a chosen, centrally located object in the cluster
J. Korczak, UE 22
J. Korczak, UE 23
Mahalanobis distance
Normalized Euclidean distance from centroid.
For point (x1,…,xk) and centroid (c1,…,ck):
1. Normalize in each dimension: yi=(xi-ci)/σi
2. Take sum of the squares of the yi’s.
3. Take the square root.
If clusters are normally distributed in d dimensions,
then after transformation, one standard deviation <√d.
i.e., 70% of the points of the cluster will have a Mahalanobis
distance <√d.
Accept a point for a cluster if its MD is < some threshold,
e.g. 4 standard deviations.
J. Korczak, UE 24
5
25
Clustering algorithms
Statistical algorithms: K-means, k-medoids,…
Hierarchical algorthms (agglomerative and divisive)
COBWEB, Birch, Cameleon, …
Density-based algorithms
DBSCAN, Optics and DenCLu, …
Grid
STING and CLIQUE (subspace clustering)
Evaluation of clusters
25 J. Korczak, UE 26
K-Means :Example
A={1,2,3,6,7,8,13,15,17}. Create 3 clusters in A.
Take randomly 3 objects, e.g. 1, 2 and 3.
C1={1}, M1=1, C2={2}, M2=2, C3={3} et M3=3
Each object is assigned to the closest cluster.
So 6 is assigned to C3 because dist(M3,6)<dist(M2,6)
and dist(M3,6)<dist(M1,6)
The result is: C1={1}, M1=1
C2={2}, M2=2
C3={3, 6, 7,8,13,15,17}, M3=69/7=9.86
J. Korczak, UE 27
dist(3,M2)<dist(3,M3)3 moves to C2. The other objects stay in C3.
C1={1}, M1=1, C2={2,3}, M2=2.5,C3={6,7,8,13,15,17} et M3= 66/6=11
dist(6,M2)<dist(6,M3)6 moves to C2. The other objects do not move.
C1={1}, M1=1, C2={2,3,6}, M2=11/3=3.67, C3={7,8,13,15,17}, M3= 12
dist(2,M1)<dist(2,M2)2 moves to C1.
dist(7,M2)<dist(7,M3)7 moves to C2. The other objects do not move. C1={1,2}, M1=1.5, C2={3,6,7}, M2=5.34, C3= {8,13,15,17}, M3=13.25
dist(3,M1)<dist(3,M2)3 moves to C1.
dist(8,M2)<dist(8,M3)8 moves to C2.
C1={1,2,3}, M1=2, C2={6,7,8}, M2=7, C3={13,15,17}, M3=15
Nothing change. End.
K-Means :Example (cdn)
J. Korczak, UE 28
Algorithm K-Means
Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K-Means
K=2
Aktualizacja
centroidów
Aktualizacj
a
centroidów
Przyporzadkowanie
obserwacjiIteracja
29
Poczatkowy zbiór
Podział na k niepustych
podzbiorów
Powtarzaj
Oblicz centroid (i.e.,
średnia) point) dla każdego
podzb.
Przydziel obserwację do
najbliższego centroidu
Aż do braku zmian
Variations of K-means
K-Medoids Clustering
K-Medians Clustering
K-Modes Clustering
Fuzzy K-Means Clustering
X-Means Clustering
Intelligent K-MeansClustering
Bisecting K-MeansClustering
Kernel K-Means Clustering
Mean Shift Clustering
Weighted K-Means Clustering
Genetic K-Means Clustering
J. Korczak, UE 30
6
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
aa b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
divisive
31 J. Korczak, UE 32
A cluster hierarchy here can be interpreted using the standard
binary tree terminology.
The root represents all the sets of data objects to be clustered
and this forms the apex of the hierarchy (level 0). At each
level, the child entries (or nodes) which are subsets of the
entire dataset correspond to the clusters.
This cluster hierarchy is also called a dendrogram.
Hierarchical Clustering - terminology
J. Korczak, UE 33
Agglomerative Clustering
1 2 3 4
1 0.0 0.20 0.15 0.302 0.20 0.0 0.40 0.50
3 0.15 0.40 0.0 0.10
4 0.30 0.50 0.10 0.0
(a) Dissimilarity Matrix (b) Single Link © Complete Link3 4 1 2 3 4 1 2
Group Averaged Agglomerative Clustering (GAAC) considers the
similarity between all pairs of points present in both the clusters and diminishes the drawbacks associated with single and complete link methods.
Ward: For any two clusters, Ca and Cb, the Ward’s criterion is calculated bymeasuring the increase in the value of the Sum of Squared Errors criterion
for the clustering obtained by merging them into Ca ∪ Cb.
J. Korczak, UE 34
Divisive Clustering – top-down approach
Factors affecting the performance
1. Splitting criterion: The Ward’s K-means square error criterion. The greater
reduction obtained in the difference in the SSE criterion should reflect the goodness of the split. SSE criterion can be applied to numerical data only!
2. Splitting method: The splitting method used to obtain the binary split of the
parent node is also critical since it can reduce the time taken for evaluating the Ward’s criterion. The Bisecting K-means approach can be used here
(with K = 2) to obtain good splits since it is based on the same criterion of
maximizing the Ward’s distance between the splits.3. Choosing the cluster to split: The choice of cluster chosen to split may not
be as important as the first two factors, but it can still be useful to choose
the most appropriate cluster to further split when the goal is to build a compact dendrogram.
4. Handling noise: Since the noise points present in the dataset might result
in aberrant clusters, a threshold can be used to determine the termination criteria rather splitting the clusters further.
J. Korczak, UE 35 J. Korczak, UE 36
Clustering: Summary of Drawbacks of Traditional Methods
Partition-based algorithms split large clusters
Centroid-based method splits large and non-hyperspherical
clusters
Centers of subclusters can be far apart
Minimum spanning tree algorithm is sensitive to outliers and
slight change in position
Exhibits chaining effect on string of outliers
Cannot scale up for large databases
7
37
Outline of Advanced Clustering Analysis
Probability Model-Based Clustering
Each object may take a probability to belong to a cluster
Clustering High-Dimensional Data
Curse of dimensionality: Difficulty of distance measure in high-D
space
Clustering Graphs and Network Data
Similarity measurement and clustering methods for graph and
networks
Clustering with Constraints
Cluster analysis under different kinds of constraints, e.g., that raised
from background knowledge or spatial distribution of the objects
38
Probability Model-Based Clustering
A hidden category (i.e., probabilistic cluster) is a distribution
over the data space, which can be represented using a
probability density function (or distribution function).
Ex. Categories of digital cameras
EM (Expectation Maximization) Algorithm
The k-means algorithm has two steps at each iteration:
Expectation Step (E-step): Given the current cluster centers, each
object is assigned to the cluster whose center is closest to the
object: An object is expected to belong to the closest cluster
Maximization Step (M-step): Given the cluster assignment, for
each cluster, the algorithm adjusts the center so that the sum of
distance from the objects assigned to this cluster and the new
center is minimized
The EM algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models.
E-step assigns objects to clusters according to the current fuzzy
clustering or parameters of probabilistic clusters
M-step finds the new clustering or parameters that maximize the sum of squared error (SSE) or the expected likelihood
39
EM (Expectation Maximization) Algorithm
Given a statistical model consisting of a set of observed variables X, a set
of unobserved latent variables Z, and a vector of unknown parameters Θ, the goal is to maximize the log-likelihood with respect to the parameters Θ.
1: Start with an initial guess for the parameters Θ(0)
and compute the initial log-likelihood log p(X|Θ(0)).
2: E-step: Evaluate q(t) = argmax qL(q, Θ(t)) = p(zn|xn, Θ(t)).
3: M-step: Update the parameters Θ(t+1) = argmax ΘQ (Θ, Θ(t)).
4: Compute the log-likelihood log p(X|Θ(t+1)) and check for convergence
of the algorithm. If the convergence criterion is not satisfied, then repeat
steps 2-4, otherwise, return the final parameters.
40
J. Korczak, UE 41
Main applications of the EM Algorithm
when the data indeed has missing values, due to
problems with or limitations of the observation process when optimizing the likelihood function is analytically
intractable but when the likelihood function can be
simplified by assuming the existence of values for
additional but missing (or hidden) parameters
EM is becoming a useful tool to price and manage risk of a
portfolio
Density-Based Clustering: DBSCAN
Two parameters:
Eps: maximum radius of the neighbourhood
MinPts: minimum number of points in an Eps-neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if
p belongs to NEps(q)
core point condition:
|NEps (q)| ≥ MinPts
MinPts = 5
Eps = 1 cm
p
q
42
8
Chameleon - idea
J. Korczak, UE 43
Graph partition – k-NN P and q are connected if q is among the top k closest neighbors of p
Chameleon - idea
J. Korczak, UE 44
Merging
Evaluation of Clustering Quality
Assessing Clustering Tendency
Assess if non-random structure exists in the data by measuring the probability that the data is generated by a uniform data distribution
Determine the Number of Clusters
Empirical method: # of clusters ≈√n/2
Elbow method: U the turning point in the curve of sum of within cluster variance w.r.t # of clusters
Cross validation method
Measuring Clustering Quality
Extrinsic: supervised
Compare a clustering against the ground truth using certain clustering quality measure
Intrinsic: unsupervised
Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are
45
se
46
Clustering High-Dimensional Data
Clustering high-dimensional data (How high is high-D in clustering?)
Many applications: text documents, DNA micro-array data
Major challenges:
Many irrelevant dimensions may mask clusters
Distance measure becomes meaningless—due to equi-distance
Clusters may exist only in some subspaces
Methods
Subspace-clustering: Search for clusters existing in subspaces of
the given high dimensional data space
CLIQUE, ProClus, and bi-clustering approaches
Dimensionality reduction approaches: Construct a much lower
dimensional space and search for clusters there (may construct new
dimensions by combining some dimensions in the original data)
Dimensionality reduction methods and spectral clustering
Traditional Distance Measures May Not Be
Effective on High-D Data
Traditional distance measure could be dominated by noises in many
dimensions
Ex. Which pairs of customers are more similar?
By Euclidean distance, we get,
despite Ada and Cathy look more similar
Clustering should not only consider dimensions but also attributes
(features)
Feature transformation: effective if most dimensions are relevant
(PCA & SVD useful when features are highly correlated/redundant)
Feature selection: useful to find a subspace where the data have
nice clusters47
J. Korczak, UE 48
Clustering - scalability(from Database and Machine Learning Community )
Scalable Clustering Algorithms
CLARANS – sampling database
DBSCAN – density based method
BIRCH – partitions objects hierarchically using tree structure
CLIQUE – integrates density-based and grid-based method
STING – grid-based method
CURE
ROCK – merges clusters based on their interconnectivity
COBWEB and CLASSIT
Neural networks: SOM, GNG
…
9
Density-based Clustering
49
Criterion: Density-connected points
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN estimates the density by counting the number of points in a
fixed-radius neighborhood and considers two points as connected if they lie within each other’s neighborhood.
A point is called core point if the neighborhood of radius Eps contains
at least MinPts points, i.e., the density in the neighborhood has to exceed some threshold. A point q is directly density-reachable from a
core point p if q is within the Eps-neighborhoodof p, and density-
reachability is given by the transitive closure of direct density-reachability.
Two points p and q are called density-connected if there is a third point
o from which both p and q are density-reachable.
A cluster is then a set of density-connected points which is maximal with respect to density-reachability.
J. Korczak, UE 50
DBSCAN
Example: A point is called core point if the neighborhood of radius Eps
contains at least MinPts points. In the diagram, MinPts = 4.
J. Korczak, UE 51
Point A and the other red points are core points, because the area surrounding these points in an Eps radius contain at least 4 points. Because they are all reachable
from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well.
Point N is a noise point that is neither a core point nor directly-reachable.
NN
CA
B
J. Korczak, UE 52
CURE (Clustering Using REpresentatives )
CURE is an algorithm which incorporates a novel feature of
representing a cluster using a set of well-scattered representative points. The distance between two clusters is
calculated by looking at the minimum distance between the
representative points chosen.
Stops on k cluster
Based on representative points
The classical methods generate the clusters (b)
J. Korczak, UE 53
CURE: merging representative points
shrinking representative points to the center by a factor (outliers!).
points allow to define a shape of the cluster
x
y
x
y
Conceptual Clustering - COBWEB
Conceptual clustering
A form of clustering in machine learning
Produces a classification scheme for a set of unlabeled
objects
Finds characteristic description for each concept (class)
COBWEB (Fisher’87)
A popular a simple method of incremental conceptual
learning
Creates a hierarchical clustering in the form of a
classification tree
Each node refers to a concept and contains a
probabilistic description of that concept
J. Korczak, UE 54
10
J.Korczak, UE
Clustering: Self-Organizing Maps
SOM the algorithms used to interpret and visualize high-dimensional data sets. The map consists of a grid of neurons representing all available observations (data).
Type of application :
Clustering : data classification
Vector Quantization : space discretisation
Reduction of data dimension
Data Preprocessing
Feature extraction
J.Korczak, UE
SOMs are competitive networks that provide a ‘topological’
mapping from the input space to the clusters.
Neuron – data structure
Neural network : connected neurons
Concepts
J.Korczak, UE
Unsupervised learning
Competitive learning
Rule : «winner-take-all»
Kohonen’s Self-Organizing Maps
Network with a fixed dimension (grid 2D)
Data space mapping onto the grid 2D of the network
Growing Neuronal Gaz
growing network; neurons are inserted where the error is the
highest
J.Korczak, UE
Self-Organizing Maps
Inspiration : … topographic maps in the visual cortex
The network topology is a 2-dimensional grid that does not
change during self-organization
Each neuron of SOM is linked to all neurons of the map.
Learning principle:
Generate an input and determine the winner.
The distance on the grid is used to determine how strongly a neuron
is adapted when the neuron is the winner.
Mecanism of lateral interaction : «mexican hat»
J.Korczak, UE
SOM: «mexican hat»
The weigths are updated according to «mexican hat».
excitatory action
Interaction
inhibitory action
+
_ _
Hints : - define a large neighborhood range in the beginning
- adaptation rate is a linear decreasing function
J.Korczak, UE
11
J.Korczak, UE
Kohonen’s Self-organizing Map
Map of 10*10 neurons
J.Korczak, UE
GND
J.Korczak, UE
Neural Gas et Growing Neural Gas
[Fritzke, 1994] [Martinetz, Schulten, 1991]
NG
NG algorithm sorts for each input signal the neurons of the network according to the distance of their reference vectors.
Based on this ‘rank order a certain number of units is adapted.
Both the number of adapted neurons and the adaptation strength are
decreased according to a fixed schedule.
Neurons are not interconnected.
GNG
Self-organization: starting with very few neurons new neurons are inserted
succesively.
Each new neuron is inserted near the neuron which has accumulated most
errors.
Neurons are connected dynamically: age of connections is used to delete a
connection.
J.Korczak, UE64
J.Korczak, UE
Growing Neural Gas: Algorithm
1. Initialize the set to contain two units A ={c1, c2}, t=0. Initialize the connection set.
2. Generate at random an input signal x.
3. Determine the winner s1 and the second-nearest unit s2 ,the closest to x.
4. If a connection between s1 and s2 does not exist already, create it . Set the age of the connection to 0
C = C U {(s1,s2 )}.age(s1,s2) = 0
5. Add the squared distance between the input signal and the winner to a local error variable: DEs1 = II x wsiII
2.
6. Adapt the reference vectors of the winner and its direct topological neighbors by fractions :
Dwsi = eb*(x -wsi) , Dwi = en*(x -wn)
7. Increment the age of all edges emanating from si
8. Remove edges with an age larger than amax If this results in units having no more emanating edges, remove those units as well.
9. If the number of input signals generated so far is an integer multiple of a parameter l, add a new unit r to the network and interpolate its reference vector from q and f, decrease the error variables of q and f by a fraction
10. If a stopping criterion (e.g., net size or some performance measure) is not yet fulfilled continue with step 2.
J.Korczak, UE
Growing Neural Gas: example
GNG 100 neurons max.
12
J.Korczak, UE
GND: Cover of data space
SOM Growing Neural Gas
J.Korczak, UE
GND: Growing Neural Gas
Economic Maps
Data: World Bank1992, 39 indicators of quality of life.
Oil from Italy
572 samples of olive
oil were taken from 9 Italian provinces.
SOM 20 x 20, trained
% of 8th fats contained in oils
Map 8D => 2D.
Accuracy 95-97%.
71
Why Not Semi-Supervised Clustering?
Much information (in multiple relations) is needed to judge
whether two tuples are similar
A user may not be able to provide a good training set
It is much easier for a user to specify an attribute as a hint,
such as a student’s research area
Tom Smith SC1211 TA
Jane Chang BI205 RA
Tuples to be compared
User hint
72
Comparing with Semi-Supervised Clustering
Semi-supervised clustering: User provides a training set
consisting of “similar” (“must-link) and “dissimilar” (“cannot link”) pairs of objects
User-guided clustering: User specifies an attribute as a hint, and more relevant features are found for clustering
All
tuple
s f
or
clu
ste
ring
Semi-supervised clustering
All tuples for clustering
User-guided clustering
x
13
Link analysis
Link analysis techniques are applied to data that can be
represented as nodes and links
J. Korczak, UE 73
A node (vertice): person, bank account, document,…
A link: a relationship between two bank accounts
Link analysis - measures
Degree - # of connected nodes
Closeness – average distance from the node to all other
nodes
Betweenness – ave path between pairs of nodes
Cutpoints – divide the network into innconnected
systems
Clique – small, highly-interconnected subgraph within a
larger network
Equivalence – structural and regular
J. Korczak, UE 74
Link analysis applications
Social network analysis
Which peaple are powerful?
Which people infuence other people?
How does information spread within the network?
Who is relattively isolated, and who is well connected?
…
Internet search engines
Google search engine: PageRank algorithm
Marketing
Viral marketing: „word-of-mouth” advertising
Hotmail – free email service
Fraud detection
AML systems
…J. Korczak, UE 75
Network Clustering
Networks: social networks, the web, or biological
interaction networks.
Networks can naturally be modeled as graphs.
Let G = (V,E) be a graph with a set of vertices V and a
set of edges E. Vertices represent objects, and edges
represent relationships between pairs of objects.
Intuitively, vertices sharing a lot of neighbors should
belong to the same cluster.
J. Korczak, UE 76
Social Networks
77
A social network is a social structure made up of a set of social actors
(such as individuals or organizations) and a set of the dyadic ties between these actors.
A Social Network Model
Cliques, hubs and outliers
Individuals in a tight social group, or clique, know many of the
same people, regardless of the size of the group
Individuals who are hubs know many people in different groups
but belong to no single group. Politicians, for example bridge
multiple groups
Individuals who are outliers reside at the margins of society.
Hermits, for example, know few people and belong to no group
The Neighborhood of a Vertex
78
v
Define () as the immediate
neighborhood of a vertex (i.e. the set
of people that an individual knows )
14
Network Clustering
A similarity function for pairs of vertices v and w, denoted
by sim(v,w), is based on the intersection of their sets of neighbors:
sim(v,w) = |Γ(v)∩Γ(w)| / √ |Γ(v)| · |Γ(w)|
where Γ(v) denotes the set of all (direct) neighbors of
vertex v, i.e., Γ(v) = { w|(v,w) ∈ E} ∪ {v}
The ε-neighborhood of a vertex v is given by the set of all
neighbors whose similarity exceeds the threshold of ε, i.e.,
Nε(v) = {w|wεΓ(v)∧sim(v,w) ≥ ε}
A vertex v is called a core, if its ε-neighborhood has a
cardinality of at least μ. If a vertex is not a member of any
cluster, it is either a hub or an outlier.
J. Korczak, UE 79
Sample social network dataset with feature vectors
J. Korczak, UE 80
81 82
83
Latvian political parties and donations
84
15
Corruption on the Cuyahoga River
85
Large Graph Mining[C.Faloutsos et.al., KDD2009]
J.Korczak, UE Wroclaw 86
Social networks
Summary
Clustering is one of the most fundamental data mining
problems because of its numerous applications to customer segmentation, target marketing, and data summarization.
Challenges
Leveraging Dimensionality Reduction Methods
High Dimensional Scenario
Scalable Techniques for Cluster Analysis
I/O Issues in Database Management
Streaming Algorithms
Big Data Framework
J. Korczak, UE 87
Top Related