Overview Of Clustering Techniques D. Gunopulos, UCR.

Overview Of Clustering Techniques

D. Gunopulos, UCR

Clusteting Data

• Clustering Algorithms– K-means and K-medoids algorithms– Density Based algorithms– Density Approximation

• Spatial Association Rules (Koperski et al, 95)• Statistical techniques (Wang et al, 1997)• Finding proximity relationships (Knorr et at, 96,

Clustering Data

• The clustering problem:

Given a set of objects, find groups of similar objects

• What is similar?

Define appropriate metrics • Applications in marketing, image processing,

biology

Clustering Methods• K-Means and K-medoids algorithms:

– CLARANS, [Ng and Han, VLDB 1994] • Hierarchical algorithms

– CURE, [Guha et al, SIGMOD 1998]– BIRCH, [Zhang et al, SIGMOD 1996]– CHAMELEON, [Kapyris et al, COMPUTER, 32]

• Density based algorithms – DENCLUE, [Hinneburg, Keim, KDD 1998]– DBSCAN, [Ester et al, KDD 96]

• Clustering with obstacles, [Tung et al, ICDE 2001]• Excellent survey: [Han et al., 2000]

K-means and K-medoids algorithms

• Minimizes the sum of square distances of points to cluster representative

• Efficient iterative algorithms (O(n))

Problems with K-means type algorithms

• Clusters are approximately spherical

• High dimensionality is a problem

• The value of K is an input parameter

Hierarchical Clustering

• Quadratic algorithms

• Running time can be improved using sampling [Guha et al, SIGMOD 1998] [Kollios et al, ICDE 2001]

Density Based Algorithms

• Clusters are regions of space which have a high density of points

• Clusters can have arbitrary shapes

Dimensionality Reduction

• Reduce the dimensionality of the space, while preserving distances

• Many techniques (SVD, MDS)

• May or may not help

Dimensionality Reduction

• Dimensionality reduction does not work always

Speeding up the clustering algorithms: Data Reduction

• Data Reduction:

– approximate the original dataset using a small representation

– ideally, the representation must be stored in main memory

– summarization, compression

• The accuracy loss must be as small as possible.

• Use the approximated dataset to run the clustering algorithms

Random Sampling as a Data Reduction Method

• Random Sampling is used as a data reduction method

• Idea: Use a random sample of the dataset and run the clustering algorithm over the sample

• Used for clustering and association rule detection [Ng and Han 94][Toivonen 96][Guha et al 98]

• But:

– For datasets that contain clusters with different densities, we may miss some sparse ones

– For datasets with noise we may include significant amount of noise in our sample

A better idea: Biased Sampling

• Use biased sampling instead of random sampling

• In biased sampling, the prob that a point is included in the sample depends on the local density

• We can oversample or undersample regions in our datasets depending on the DM task at hand

Example: NorthEast Dataset

NorthEast Dataset, 130K postal addresses in North Eastern USA

Random Sample

Random Sampling fails to find the clusters

Biased Sampling

Biased Sampling finds the clusters

The Biased Sampling Technique

• Basic idea:

– First compute an approximation of the density function of the dataset

– Use the density function to define the bias for each point and perform the sampling

[Kollios et al, ICDE 2001]

[Domeniconi and Gunopulos, ICML 2001]

[Palmer and Faloutsos, SIGMOD 2000]

Density Estimation

• We use kernels to approximate the probability density function (pdf)

• We scan the dataset and we compute an initial random sample and standard deviation

• For each sample we use a kernel. The approximate pdf is the sum of all kernels

Kernel Estimator

Example of a Kernel Estimator

The sampling step

• Let f(p) the pdf value for the point

p =(x1,x2, …, xd)

• We define L(p) = f(p)where is parameter

• We compute the normalization parameter k (in one scan):

pLk )(

The sampling step (cont.)

• The sampling bias is proportional to:

Where b is the size of the sample and k the normalization factor

• In another scan we perform the sampling (two scans)

• We can combine the above two steps into one scan

The variable • If = 0 then we have uniform random sampling

• If > 0 then regions with higher density are sampled at a higher rate

• If < 0 then regions with higher density are sampled at a lower rate

• We can show that if > -1, relative densities are preserved in the sample

Bias ~ af

Biased vs Uniform random sampling

DataSet 5 clusters With 1000 Uniform RS

With 1000 Biased RS, a=-0.5

Overview Of Clustering Techniques D. Gunopulos, UCR.

Documents

Transcript of Overview Of Clustering Techniques D. Gunopulos, UCR.

UCR-2013-presentation sent to doe Library/events/2013/hbcu-ucr/UCR-2013...Task1. Development of a CFD ... impregnation solution, mol/lit (M) 1 Durationof impregnation,hr 20 ... UCR-2013-presentation

UCR Extension

Service Level Agreement - UCR

UCR Huawei

Ucr. - aera.gov.in

Welcome to UCR

Noncommon Breaks - UCR

UCR-STAR: The UCR Spatio-temporal Active Repositoryeldawy/publications/19-UCR-Star.pdf · datasets that are tagged as geospatial. Similarly, other governments, non-government organizations

UCR CKI Bulletin 2

UCR transcripts.

UCR FS0310 Ondas

Ucr summon

Automatic Subspace Clustering Of High Dimensional Data For Data Mining Application Author ： Rakesh Agrawal Johannes Gehrke Dimitrios Gunopulos Prabhakar.

UCR Biochemistry Department Chemical Hygiene Planehs.ucr.edu/laboratory/CHP/BiochemistryCHP.2012.09.10.pdf · UCR Chemical Hygiene Plan, ver 2012.09.10 page 1 of 55 UCR Biochemistry

D. Zeinalipour et al. (UCR) “A Local Search Mechanism for Peer-to-Peer Networks” Vana Kalogeraki, Dimitrios Gunopulos & Demetris Zeinalipour (University.

NERVOUS [Recovered] - UCR

Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)

E-UCR Procedures

UCR-HS Counselor Workshop

A.Mitra, A.Banerjee,W.Najjar, D. Zeinalipour-Yazti, V. Kalogeraki, D. Gunopulos