Download - Data Mining: Theory, Tools and ApplicationsK -Means K=2 Aktualizacja centroidów Aktualizacj a centroidów Przyporzadkowanie obserwacji Iteracja 29 Poczatkowy zbiór Podział na k

1

1

Jerzy KORCZAKemail: [email protected]

http://www.korczak-leliwa.pl

http://citi-lab.pl

Mining of Financial Databases

3. Clustering

J. Korczak, UE J. Korczak, UE 2

Contents

Introduction - problem definition

Similarity - Distances

Clustering algorithms

Classical algorithms: K-means

Hierarchical algorithms

Scalable algorithms: CURE, DBSCAN

Self Organizing Maps

Neural Gaz

Semi-supervised methods

Reminder: what is clustering ?

Clustering is a process of partitioning a set of data (or

objects) into a set of meaningful sub-classes, called clusters.

Clustering is an unsupervised classification: no predefined classes

Related issues: space reduction, outliers detection, understanding of clusters, user engagement,

background knowledge,…

J. Korczak, UE 3

Introduction –

what is a natural grouping among these objects?

4

School EmployeesSimpson's Family MalesFemales

Clustering is subjective

What is a natural grouping among these objects?

5

How-to Hierarchical Clustering?

The number of dendrograms with

n leafs = (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of possible

of leafs dendrograms2 1

3 34 15

5 105... …

10 34,459,425

Since we cannot test all possible trees we

will have to heuristic search of all possible trees. We could do:

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the

best pair to merge into a new cluster.

Repeat until all clusters are fused together.

Top-Down (divisive): Starting with all the data in a single cluster, consider every

possible way to divide the cluster into

two. Choose the best division and recursively operate on both sides.

6

2

J. Korczak, UE 7

Problem of clustering

Given:

Data points and number of

desired clusters K

Group the data points into K clusters

Data points within clusters are

more similar than across clusters

Sample applications:

Customer profiles/segmentation

Market basket customer analysis

Clustering countries/companies

Security – suspicious transactions

Churn analysis

J. Korczak, UE 8

Problem of clustering

Given:

Data points and number of

desired clusters K

Group the data points into K clusters

Data points within clusters are

more similar than across clusters

Sample applications:

Customer segmentation

Market basket customer analysis

Clustering countries/companies

Security – suspicious transactions

Churn analysis

Getting best value of K

Try different K, looking at the change in the average

distance to centroid

Average falls rapidly until right K, then changes little

J. Korczak, UE 9

Best K

Ave distance

to centroid

K


J. Korczak, UE 10

Too few; many long distances to centroid


J. Korczak, UE 11

Too many; little improvement in ave distance

Data Types in Cluster Analysis

Financial reports - numerical and categorical data

Publications, expertises - text data

Product images, videos - multimedia data

Stock time series, sequences of transactions

Social network information

Blogs, uncertain data

J. Korczak, UE 12

3

Dimensionality Reduction

Dimensionality reduction approaches are capable of

improving learning performance, lowering computational complexity, building better generalizable models, and

decreasing required storage

Feature extraction: PCA, LDA,…

Feature selection: Information Gain, Relief, c2, …

Selection strategies: filter, wrapper and hybrid

J. Korczak, UE 13 J. Korczak, UE 14


J. Korczak, UE 17

Similarity

If q = 2, d is an Euclidien distance :

Properties

d(i,j) 0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j) d(i,k) + d(k,j)

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

J. Korczak, UE 18

Similarity measures

Distances indicate the similarity

Minkowski’s distance (no Euclidean space):

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two objects, p-

dimension and q is a positive number

If q = 1, d is a Manhattan distance

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211

||...||||),(2211 pp j

xi

xj

xi

xj

xi

xjid

4

J. Korczak, UE 19

Example: Manhattan distance

Age Salary

Person1 50 11000

Person2 70 11100

Person3 60 11122

Person4 60 11074

d(p1,p2)=120

d(p1,p3)=132

Age Salary

Person1 -2 -0,5

Person2 2 0,175

Person3 0 0,324

Person4 0 0

d(p1,p2)=4,675

d(p1,p3)=2,324

Conclusion: p1 resembles more p2 than p3

Conclusion: p1 resembles more p3 than p2

Time-series data – measures of similarity

J. Korczak, UE 20

Case B

Case A

Distance between time-series data

J. Korczak, UE 21

Measures (of magnitude, shape, subsequences):

- Euclidean distance

- DTW (Dynamic Time Wraping) distance

- Frechet distance

- Longest Common Subsequence

- … int DTWDistance(s: array [1..n], t: array [1..m]) { DTW := array [0..n, 0..m] for i := 1 to n DTW[i, 0] := infinity for i := 1 to m DTW[0, i] := infinity DTW[0, 0] := 0 for i := 1 to n

for j := 1 to m cost := d(s[i], t[j]) DTW[i, j] := cost + minimum(DTW[i-1, j ], // insertion DTW[i , j-1], // deletion DTW[i-1, j-1]) // match

return DTW[n, m] }

Distances between clusters

Single link: smallest distance between an element in one cluster

and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

Complete link: largest distance between an element in one

cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

Average: avg distance between an element in one cluster and an

element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

Centroid: distance between the centroids of two clusters, i.e.,

dist(Ki, Kj) = dist(Ci, Cj)

Medoid: distance between the medoids of two clusters, i.e.,

dist(Ki, Kj) = dist(Mi, Mj)

Medoid: a chosen, centrally located object in the cluster

J. Korczak, UE 22

J. Korczak, UE 23

Mahalanobis distance

Normalized Euclidean distance from centroid.

For point (x1,…,xk) and centroid (c1,…,ck):

1. Normalize in each dimension: yi=(xi-ci)/σi

2. Take sum of the squares of the yi’s.

3. Take the square root.

If clusters are normally distributed in d dimensions,

then after transformation, one standard deviation <√d.

i.e., 70% of the points of the cluster will have a Mahalanobis

distance <√d.

Accept a point for a cluster if its MD is < some threshold,

e.g. 4 standard deviations.

J. Korczak, UE 24

5

25

Clustering algorithms

Statistical algorithms: K-means, k-medoids,…

Hierarchical algorthms (agglomerative and divisive)

COBWEB, Birch, Cameleon, …

Density-based algorithms

DBSCAN, Optics and DenCLu, …

Grid

STING and CLIQUE (subspace clustering)

Evaluation of clusters

25 J. Korczak, UE 26

K-Means :Example

A={1,2,3,6,7,8,13,15,17}. Create 3 clusters in A.

Take randomly 3 objects, e.g. 1, 2 and 3.

C1={1}, M1=1, C2={2}, M2=2, C3={3} et M3=3

Each object is assigned to the closest cluster.

So 6 is assigned to C3 because dist(M3,6)<dist(M2,6)

and dist(M3,6)<dist(M1,6)

The result is: C1={1}, M1=1

C2={2}, M2=2

C3={3, 6, 7,8,13,15,17}, M3=69/7=9.86

J. Korczak, UE 27

dist(3,M2)<dist(3,M3)3 moves to C2. The other objects stay in C3.

C1={1}, M1=1, C2={2,3}, M2=2.5,C3={6,7,8,13,15,17} et M3= 66/6=11

dist(6,M2)<dist(6,M3)6 moves to C2. The other objects do not move.

C1={1}, M1=1, C2={2,3,6}, M2=11/3=3.67, C3={7,8,13,15,17}, M3= 12

dist(2,M1)<dist(2,M2)2 moves to C1.

dist(7,M2)<dist(7,M3)7 moves to C2. The other objects do not move. C1={1,2}, M1=1.5, C2={3,6,7}, M2=5.34, C3= {8,13,15,17}, M3=13.25



C1={1,2,3}, M1=2, C2={6,7,8}, M2=7, C3={13,15,17}, M3=15

Nothing change. End.

K-Means :Example (cdn)

J. Korczak, UE 28

Algorithm K-Means

Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K-Means

K=2

Aktualizacja

centroidów

Aktualizacj

a

centroidów

Przyporzadkowanie

obserwacjiIteracja

29

Poczatkowy zbiór

Podział na k niepustych

podzbiorów

Powtarzaj

Oblicz centroid (i.e.,

średnia) point) dla każdego

podzb.

Przydziel obserwację do

najbliższego centroidu

Aż do braku zmian

Variations of K-means

K-Medoids Clustering

K-Medians Clustering

K-Modes Clustering

Fuzzy K-Means Clustering

X-Means Clustering

Intelligent K-MeansClustering

Bisecting K-MeansClustering

Kernel K-Means Clustering

Mean Shift Clustering

Weighted K-Means Clustering

Genetic K-Means Clustering

J. Korczak, UE 30

6

Hierarchical Clustering

Use distance matrix as clustering criteria. This method

does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

aa b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

divisive

31 J. Korczak, UE 32

A cluster hierarchy here can be interpreted using the standard

binary tree terminology.

The root represents all the sets of data objects to be clustered

and this forms the apex of the hierarchy (level 0). At each

level, the child entries (or nodes) which are subsets of the

entire dataset correspond to the clusters.

This cluster hierarchy is also called a dendrogram.

Hierarchical Clustering - terminology

J. Korczak, UE 33

Agglomerative Clustering

1 2 3 4

1 0.0 0.20 0.15 0.302 0.20 0.0 0.40 0.50

3 0.15 0.40 0.0 0.10

4 0.30 0.50 0.10 0.0

(a) Dissimilarity Matrix (b) Single Link © Complete Link3 4 1 2 3 4 1 2

Group Averaged Agglomerative Clustering (GAAC) considers the

similarity between all pairs of points present in both the clusters and diminishes the drawbacks associated with single and complete link methods.

Ward: For any two clusters, Ca and Cb, the Ward’s criterion is calculated bymeasuring the increase in the value of the Sum of Squared Errors criterion

for the clustering obtained by merging them into Ca ∪ Cb.

J. Korczak, UE 34

Divisive Clustering – top-down approach

Factors affecting the performance

1. Splitting criterion: The Ward’s K-means square error criterion. The greater

reduction obtained in the difference in the SSE criterion should reflect the goodness of the split. SSE criterion can be applied to numerical data only!

2. Splitting method: The splitting method used to obtain the binary split of the

parent node is also critical since it can reduce the time taken for evaluating the Ward’s criterion. The Bisecting K-means approach can be used here

(with K = 2) to obtain good splits since it is based on the same criterion of

maximizing the Ward’s distance between the splits.3. Choosing the cluster to split: The choice of cluster chosen to split may not

be as important as the first two factors, but it can still be useful to choose

the most appropriate cluster to further split when the goal is to build a compact dendrogram.

4. Handling noise: Since the noise points present in the dataset might result

in aberrant clusters, a threshold can be used to determine the termination criteria rather splitting the clusters further.


Clustering: Summary of Drawbacks of Traditional Methods

Partition-based algorithms split large clusters

Centroid-based method splits large and non-hyperspherical

clusters

Centers of subclusters can be far apart

Minimum spanning tree algorithm is sensitive to outliers and

slight change in position

Exhibits chaining effect on string of outliers

Cannot scale up for large databases

7

37

Outline of Advanced Clustering Analysis

Probability Model-Based Clustering

Each object may take a probability to belong to a cluster

Clustering High-Dimensional Data

Curse of dimensionality: Difficulty of distance measure in high-D

space

Clustering Graphs and Network Data

Similarity measurement and clustering methods for graph and

networks

Clustering with Constraints

Cluster analysis under different kinds of constraints, e.g., that raised

from background knowledge or spatial distribution of the objects

38

Probability Model-Based Clustering

A hidden category (i.e., probabilistic cluster) is a distribution

over the data space, which can be represented using a

probability density function (or distribution function).

Ex. Categories of digital cameras

EM (Expectation Maximization) Algorithm

The k-means algorithm has two steps at each iteration:

Expectation Step (E-step): Given the current cluster centers, each

object is assigned to the cluster whose center is closest to the

object: An object is expected to belong to the closest cluster

Maximization Step (M-step): Given the cluster assignment, for

each cluster, the algorithm adjusts the center so that the sum of

distance from the objects assigned to this cluster and the new

center is minimized

The EM algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models.

E-step assigns objects to clusters according to the current fuzzy

clustering or parameters of probabilistic clusters

M-step finds the new clustering or parameters that maximize the sum of squared error (SSE) or the expected likelihood

39

EM (Expectation Maximization) Algorithm

Given a statistical model consisting of a set of observed variables X, a set

of unobserved latent variables Z, and a vector of unknown parameters Θ, the goal is to maximize the log-likelihood with respect to the parameters Θ.

1: Start with an initial guess for the parameters Θ(0)

and compute the initial log-likelihood log p(X|Θ(0)).

2: E-step: Evaluate q(t) = argmax qL(q, Θ(t)) = p(zn|xn, Θ(t)).

3: M-step: Update the parameters Θ(t+1) = argmax ΘQ (Θ, Θ(t)).

4: Compute the log-likelihood log p(X|Θ(t+1)) and check for convergence

of the algorithm. If the convergence criterion is not satisfied, then repeat

steps 2-4, otherwise, return the final parameters.

40

J. Korczak, UE 41

Main applications of the EM Algorithm

when the data indeed has missing values, due to

problems with or limitations of the observation process when optimizing the likelihood function is analytically

intractable but when the likelihood function can be

simplified by assuming the existence of values for

additional but missing (or hidden) parameters

EM is becoming a useful tool to price and manage risk of a

portfolio

Density-Based Clustering: DBSCAN

Two parameters:

Eps: maximum radius of the neighbourhood

MinPts: minimum number of points in an Eps-neighbourhood of that point

NEps(p): {q belongs to D | dist(p,q) ≤ Eps}

Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if

p belongs to NEps(q)

core point condition:

|NEps (q)| ≥ MinPts

MinPts = 5

Eps = 1 cm

p

q

42

Parameter_estimation_process_infinite_Gaussian_mixture_model.webm.240p.webm

Parameter_estimation_process_infinite_Gaussian_mixture_model.webm.240p.webm

8

Chameleon - idea

J. Korczak, UE 43

Graph partition – k-NN P and q are connected if q is among the top k closest neighbors of p

Chameleon - idea

J. Korczak, UE 44

Merging

Evaluation of Clustering Quality

Assessing Clustering Tendency

Assess if non-random structure exists in the data by measuring the probability that the data is generated by a uniform data distribution

Determine the Number of Clusters

Empirical method: # of clusters ≈√n/2

Elbow method: U the turning point in the curve of sum of within cluster variance w.r.t # of clusters

Cross validation method

Measuring Clustering Quality

Extrinsic: supervised

Compare a clustering against the ground truth using certain clustering quality measure

Intrinsic: unsupervised

Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are

45

se

46

Clustering High-Dimensional Data

Clustering high-dimensional data (How high is high-D in clustering?)

Many applications: text documents, DNA micro-array data

Major challenges:

Many irrelevant dimensions may mask clusters

Distance measure becomes meaningless—due to equi-distance

Clusters may exist only in some subspaces

Methods

Subspace-clustering: Search for clusters existing in subspaces of

the given high dimensional data space

CLIQUE, ProClus, and bi-clustering approaches

Dimensionality reduction approaches: Construct a much lower

dimensional space and search for clusters there (may construct new

dimensions by combining some dimensions in the original data)

Dimensionality reduction methods and spectral clustering

Traditional Distance Measures May Not Be

Effective on High-D Data

Traditional distance measure could be dominated by noises in many

dimensions

Ex. Which pairs of customers are more similar?

By Euclidean distance, we get,

despite Ada and Cathy look more similar

Clustering should not only consider dimensions but also attributes

(features)

Feature transformation: effective if most dimensions are relevant

(PCA & SVD useful when features are highly correlated/redundant)

Feature selection: useful to find a subspace where the data have

nice clusters47

J. Korczak, UE 48

Clustering - scalability(from Database and Machine Learning Community )

Scalable Clustering Algorithms

CLARANS – sampling database

DBSCAN – density based method

BIRCH – partitions objects hierarchically using tree structure

CLIQUE – integrates density-based and grid-based method

STING – grid-based method

CURE

ROCK – merges clusters based on their interconnectivity

COBWEB and CLASSIT

Neural networks: SOM, GNG

…

9

Density-based Clustering

49

Criterion: Density-connected points

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN estimates the density by counting the number of points in a

fixed-radius neighborhood and considers two points as connected if they lie within each other’s neighborhood.

A point is called core point if the neighborhood of radius Eps contains

at least MinPts points, i.e., the density in the neighborhood has to exceed some threshold. A point q is directly density-reachable from a

core point p if q is within the Eps-neighborhoodof p, and density-

reachability is given by the transitive closure of direct density-reachability.

Two points p and q are called density-connected if there is a third point

o from which both p and q are density-reachable.

A cluster is then a set of density-connected points which is maximal with respect to density-reachability.

J. Korczak, UE 50

DBSCAN

Example: A point is called core point if the neighborhood of radius Eps

contains at least MinPts points. In the diagram, MinPts = 4.

J. Korczak, UE 51

Point A and the other red points are core points, because the area surrounding these points in an Eps radius contain at least 4 points. Because they are all reachable

from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well.

Point N is a noise point that is neither a core point nor directly-reachable.

NN

CA

B

J. Korczak, UE 52

CURE (Clustering Using REpresentatives )

CURE is an algorithm which incorporates a novel feature of

representing a cluster using a set of well-scattered representative points. The distance between two clusters is

calculated by looking at the minimum distance between the

representative points chosen.

Stops on k cluster

Based on representative points

The classical methods generate the clusters (b)

J. Korczak, UE 53

CURE: merging representative points

shrinking representative points to the center by a factor (outliers!).

points allow to define a shape of the cluster

x

y

x

y

Conceptual Clustering - COBWEB

Conceptual clustering

A form of clustering in machine learning

Produces a classification scheme for a set of unlabeled

objects

Finds characteristic description for each concept (class)

COBWEB (Fisher’87)

A popular a simple method of incremental conceptual

learning

Creates a hierarchical clustering in the form of a

classification tree

Each node refers to a concept and contains a

probabilistic description of that concept

J. Korczak, UE 54

10

J.Korczak, UE

Clustering: Self-Organizing Maps

SOM the algorithms used to interpret and visualize high-dimensional data sets. The map consists of a grid of neurons representing all available observations (data).

Type of application :

Clustering : data classification

Vector Quantization : space discretisation

Reduction of data dimension

Data Preprocessing

Feature extraction

J.Korczak, UE

SOMs are competitive networks that provide a ‘topological’

mapping from the input space to the clusters.

Neuron – data structure

Neural network : connected neurons

Concepts

J.Korczak, UE

Unsupervised learning

Competitive learning

Rule : «winner-take-all»

Kohonen’s Self-Organizing Maps

Network with a fixed dimension (grid 2D)

Data space mapping onto the grid 2D of the network

Growing Neuronal Gaz

growing network; neurons are inserted where the error is the

highest

J.Korczak, UE

Self-Organizing Maps

Inspiration : … topographic maps in the visual cortex

The network topology is a 2-dimensional grid that does not

change during self-organization

Each neuron of SOM is linked to all neurons of the map.

Learning principle:

Generate an input and determine the winner.

The distance on the grid is used to determine how strongly a neuron

is adapted when the neuron is the winner.

Mecanism of lateral interaction : «mexican hat»

J.Korczak, UE

SOM: «mexican hat»

The weigths are updated according to «mexican hat».

excitatory action

Interaction

inhibitory action

+

_ _

Hints : - define a large neighborhood range in the beginning

- adaptation rate is a linear decreasing function

J.Korczak, UE

11

J.Korczak, UE

Kohonen’s Self-organizing Map

Map of 10*10 neurons

J.Korczak, UE

GND

J.Korczak, UE

Neural Gas et Growing Neural Gas

[Fritzke, 1994] [Martinetz, Schulten, 1991]

NG

NG algorithm sorts for each input signal the neurons of the network according to the distance of their reference vectors.

Based on this ‘rank order a certain number of units is adapted.

Both the number of adapted neurons and the adaptation strength are

decreased according to a fixed schedule.

Neurons are not interconnected.

GNG

Self-organization: starting with very few neurons new neurons are inserted

succesively.

Each new neuron is inserted near the neuron which has accumulated most

errors.

Neurons are connected dynamically: age of connections is used to delete a

connection.

J.Korczak, UE64

J.Korczak, UE

Growing Neural Gas: Algorithm

1. Initialize the set to contain two units A ={c1, c2}, t=0. Initialize the connection set.

2. Generate at random an input signal x.

3. Determine the winner s1 and the second-nearest unit s2 ,the closest to x.

4. If a connection between s1 and s2 does not exist already, create it . Set the age of the connection to 0

C = C U {(s1,s2 )}.age(s1,s2) = 0

5. Add the squared distance between the input signal and the winner to a local error variable: DEs1 = II x wsiII

2.

6. Adapt the reference vectors of the winner and its direct topological neighbors by fractions :

Dwsi = eb*(x -wsi) , Dwi = en*(x -wn)

7. Increment the age of all edges emanating from si

8. Remove edges with an age larger than amax If this results in units having no more emanating edges, remove those units as well.

9. If the number of input signals generated so far is an integer multiple of a parameter l, add a new unit r to the network and interpolate its reference vector from q and f, decrease the error variables of q and f by a fraction

10. If a stopping criterion (e.g., net size or some performance measure) is not yet fulfilled continue with step 2.

J.Korczak, UE

Growing Neural Gas: example

GNG 100 neurons max.

12

J.Korczak, UE

GND: Cover of data space

SOM Growing Neural Gas

J.Korczak, UE

GND: Growing Neural Gas

Economic Maps

Data: World Bank1992, 39 indicators of quality of life.

Oil from Italy

572 samples of olive

oil were taken from 9 Italian provinces.

SOM 20 x 20, trained

% of 8th fats contained in oils

Map 8D => 2D.

Accuracy 95-97%.

71

Why Not Semi-Supervised Clustering?

Much information (in multiple relations) is needed to judge

whether two tuples are similar

A user may not be able to provide a good training set

It is much easier for a user to specify an attribute as a hint,

such as a student’s research area

Tom Smith SC1211 TA

Jane Chang BI205 RA

Tuples to be compared

User hint

72

Comparing with Semi-Supervised Clustering

Semi-supervised clustering: User provides a training set

consisting of “similar” (“must-link) and “dissimilar” (“cannot link”) pairs of objects

User-guided clustering: User specifies an attribute as a hint, and more relevant features are found for clustering

All

tuple

s f

or

clu

ste

ring

Semi-supervised clustering

All tuples for clustering

User-guided clustering

x

13

Link analysis

Link analysis techniques are applied to data that can be

represented as nodes and links

J. Korczak, UE 73

A node (vertice): person, bank account, document,…

A link: a relationship between two bank accounts

Link analysis - measures

Degree - # of connected nodes

Closeness – average distance from the node to all other

nodes

Betweenness – ave path between pairs of nodes

Cutpoints – divide the network into innconnected

systems

Clique – small, highly-interconnected subgraph within a

larger network

Equivalence – structural and regular

J. Korczak, UE 74

Link analysis applications

Social network analysis

Which peaple are powerful?

Which people infuence other people?

How does information spread within the network?

Who is relattively isolated, and who is well connected?

…

Internet search engines

Google search engine: PageRank algorithm

Marketing

Viral marketing: „word-of-mouth” advertising

Hotmail – free email service

Fraud detection

AML systems

…J. Korczak, UE 75

Network Clustering

Networks: social networks, the web, or biological

interaction networks.

Networks can naturally be modeled as graphs.

Let G = (V,E) be a graph with a set of vertices V and a

set of edges E. Vertices represent objects, and edges

represent relationships between pairs of objects.

Intuitively, vertices sharing a lot of neighbors should

belong to the same cluster.

J. Korczak, UE 76

Social Networks

77

A social network is a social structure made up of a set of social actors

(such as individuals or organizations) and a set of the dyadic ties between these actors.

A Social Network Model

Cliques, hubs and outliers

Individuals in a tight social group, or clique, know many of the

same people, regardless of the size of the group

Individuals who are hubs know many people in different groups

but belong to no single group. Politicians, for example bridge

multiple groups

Individuals who are outliers reside at the margins of society.

Hermits, for example, know few people and belong to no group

The Neighborhood of a Vertex

78

v

Define () as the immediate

neighborhood of a vertex (i.e. the set

of people that an individual knows )

14

Network Clustering

A similarity function for pairs of vertices v and w, denoted

by sim(v,w), is based on the intersection of their sets of neighbors:

sim(v,w) = |Γ(v)∩Γ(w)| / √ |Γ(v)| · |Γ(w)|

where Γ(v) denotes the set of all (direct) neighbors of

vertex v, i.e., Γ(v) = { w|(v,w) ∈ E} ∪ {v}

The ε-neighborhood of a vertex v is given by the set of all

neighbors whose similarity exceeds the threshold of ε, i.e.,

Nε(v) = {w|wεΓ(v)∧sim(v,w) ≥ ε}

A vertex v is called a core, if its ε-neighborhood has a

cardinality of at least μ. If a vertex is not a member of any

cluster, it is either a hub or an outlier.

J. Korczak, UE 79

Sample social network dataset with feature vectors

J. Korczak, UE 80

81 82

83

Latvian political parties and donations

84

http://2.bp.blogspot.com/-ZRFN8NyQsuU/TuJ_7Y2GpOI/AAAAAAAAAc8/x19APFV7mIY/s1600/LargerLVPoliticalPartyDonations.png

http://2.bp.blogspot.com/-ZRFN8NyQsuU/TuJ_7Y2GpOI/AAAAAAAAAc8/x19APFV7mIY/s1600/LargerLVPoliticalPartyDonations.png

15

Corruption on the Cuyahoga River

85

Large Graph Mining[C.Faloutsos et.al., KDD2009]

J.Korczak, UE Wroclaw 86

Social networks

Summary

Clustering is one of the most fundamental data mining

problems because of its numerous applications to customer segmentation, target marketing, and data summarization.

Challenges

Leveraging Dimensionality Reduction Methods

High Dimensional Scenario

Scalable Techniques for Cluster Analysis

I/O Issues in Database Management

Streaming Algorithms

Big Data Framework

J. Korczak, UE 87