Very large data sets

Pasi Fränti

Clustering methods: Part 10

Speech and Image Processing UnitSchool of Computing

University of Eastern Finland

5.5.2014

Methods for large data sets

• Birch

• Clarans

• On-line EM

• Scalable EM

• GMG

Let’s study this(no material for the others)

Gradual model generator (GMG) [Kärkkäinen & Fränti, 2007: Pattern Recognition]

D at a B u ffer M o d el

M o d el s iz ered u ct io n

M o d el gen erat io n

G en erat edm o d el

P o s t p ro ces s in gO u t p u t m o d els

S elec tio n

EM GMG

Goal of the GMG algorithm

EM GMG

Contours of probability density distributions

Before update After update

Model update

• New data points are mapped immediately when input.• Points too far (from any model) will remain in buffer.• Buffered points are re-tested when new models created.

Selected points and a new component

Data in buffer

Generating new components• When buffer full, selected points are used to generate new

components.• Most compact k-neighborhood is selected as seed for a new

component.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Example

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Example

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Post-processing

Model before processing

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Post-processing

Model before processing Updated model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Post-processing

Model before processing Updated model + data

Literature

1. I. Kärkkäinen and P. Fränti, "Gradual model generator for single-pass clustering", Pattern Recognition, 40 (3), 784-795, March 2007.

2. P. Bradley, U. Fayyad, C. Reina, Clustering Very Large Databases Using EM Mixture Models, Proc. of the 15th Int. Conf. on Pattern Recognition, vol. 2, 2000, pp. 76-80.

3. R. Ng, J. Han, CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Trans. Knowledge & Data Engineering 14(5) (2002) 1003-1016.

4. M. Sato, S. Ishii, On-line EM Algorithm for the Normalized Gaussian Network, Neural Computation 12(2) (2000) 407-432.

5. T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1(2) (1997) 141-182.

Very large data sets

Documents

Transcript of Very large data sets

Very Large Scale Integration (VLSI) - GUCeee.guc.edu.eg/Courses/Electronics/ELCT904 Very Large Scale... · Dr. Ahmed H. Madian-VLSI 1 Very Large Scale Integration (VLSI) ... full

Managing large and complex data sets

Detecting novel associations in large data sets

Adaptive, Multiresolution Visualization of Large Data Sets ...jmk/papers/others/freitag99adaptive.pdfAdaptive, Multiresolution Visualization of Large Data Sets using a Distributed

very large database

Very large product Very large organization 2 year …download.microsoft.com/download/E/C/C/ECC5B836-10FF-4CDA...Very large product Very large organization 2 year product cycles Mission

FROM NUMBERED SETS TO TYPE THEORIES · FROM NUMBERED SETS TO TYPE THEORIES Introduction ... the specifi c are oaf X-calculus b, y a very large scientific ... investigation of its

u sing large data sets

Compressing Very Large Data Sets in Oracle Luca Canali, CERN UKOUG Conference Birmingham, December 1 st, 2009.

Radial Sets: Interactive Visual Analysis of Large ...Radial Sets: Interactive Visual Analysis of Large Overlapping Sets Bilal Alsallakh, Wolfgang Aigner, Silvia Miksch, and Helwig

using large data sets

Saharon Shelah- Borel Sets with Large Squares

Very large data sets

Visualisation of Large Scale Data Sets

Fixed Rank Kriging for Very Large Spatial Data SetsJ. R. Statist. Soc. B (2008) 70, Part 1, pp. 209-226 Fixed rank kriging for very large spatial data sets Noel Cressie The Ohio State

Indexing Large Trajectory Data Sets With SETI

Finding trends in large scale document sets

Borel Sets With Large Squeare

Clustering Techniques for Large Data Sets

Detecting Novel Associations in Large Data Sets