Clustering Theory

23
Clustering Theory Data Mining for Quality Improvement with Nonsmooth Optimization 4th International Summer School Achievements and Applications of Contemporary Informatics, Mathematics and Physics National University of Technology of the Ukraine Kiev, Ukraine, August 5-16, 2009 with Nonsmooth Optimization vs. PAM and k-Means Gerhard Gerhard Gerhard Gerhard Gerhard Gerhard Gerhard Gerhard- - - - -Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber * and Ba and Başak Akteke ak Akteke-Öztürk Öztürk Institute of Applied Mathematics Institute of Applied Mathematics Middle East Technical University, Ankara, Turkey Middle East Technical University, Ankara, Turkey * Faculty of Economics, Management and Law, University of Siegen, Germany Faculty of Economics, Management and Law, University of Siegen, Germany Center for Research on Optimization and Control, University of Aveiro, Portugal

description

AACIMP 2009 Summer School lecture by Gerhard Wilhelm Weber. "Modern Operational Research and Its Mathematical Methods" course.

Transcript of Clustering Theory

Page 1: Clustering Theory

Clustering Theory

Data Mining for Quality Improvement with Nonsmooth Optimization

4th International Summer SchoolAchievements and Applications of Contemporary Informatics, Mathematics and PhysicsNational University of Technology of the UkraineKiev, Ukraine, August 5-16, 2009

with Nonsmooth Optimizationvs. PAM and k-Means

GerhardGerhardGerhardGerhardGerhardGerhardGerhardGerhard--------Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber ** and Baand Başşak Aktekeak Akteke--ÖztürkÖztürk

Institute of Applied Mathematics Institute of Applied Mathematics Middle East Technical University, Ankara, TurkeyMiddle East Technical University, Ankara, Turkey

** Faculty of Economics, Management and Law, University of Siegen, GermanyFaculty of Economics, Management and Law, University of Siegen, Germany

Center for Research on Optimization and Control, University of Aveiro, Portugal

Page 2: Clustering Theory

Outline

• Quality Analysis

• Data Mining for Quality Analysis

• Clustering Methods

• Results and Comparison

• Decision Tree Analysis of A Cluster

• Conclusion

Page 3: Clustering Theory

Quality Analysis

• Quality is an essential requirement of – products, – processes, and – services.

• This study is a part of a project whose main focus is on quality analysis: • This study is a part of a project whose main focus is on quality analysis: relationship between input and output

• Modern quality analysis takes advantage of using tools of Data Mining.

Page 4: Clustering Theory

Data mining tools such as – decision trees (e.g. classification and regression trees (CART)), – neural networks (NN), – self-organizing maps (SOM), – support vector machines (SVM),

are highly prefered for modeling and producing rules for the output.

Data Mining for Quality Analysis

are highly prefered for modeling and producing rules for the output.

Applications of such tools are not enough such that the

industry people would prefer and make use of them for

quality analysis needs.

Page 5: Clustering Theory

• to identify the data mining approaches that can

effectively improve product and process quality in industrial organizations:– classification / prediction, – clustering and – association analysis,

Aim of Our Data Mining Studies

• to develop new data mining software and improve the existing ones for quality analysis.

• Inital study: To identify the most influential variables that cause defects on the items produced by a casting company located in Turkey.

Page 6: Clustering Theory

Our Data Set

• Our data set: 92 objects (rows),35 process variables (columns).

• Belongs to a particular product, which has high percentage of defectives collected during the first five monthsproduction period of 2006.

• Missing values: filled with the averages of the columns

Page 7: Clustering Theory

Clustering - 2 Algorithms (Model Free)

choose a randon start partition

compute centroids

minimal distance procedure

create minimal distance partition

end partition

Page 8: Clustering Theory

choose a randon start partition

test an object in all clusters

Clustering - 2 Algorithms (Model Free)

minimal distance procedure

update the centroids

end partition

exchange procedure

Page 9: Clustering Theory

Our Clustering

• The data set scaled to the interval [0,1] before the clustering analysis:

• We used k-means, PAM (Partitioning Around Medoids) and a modified k-means by Nonsmooth Analysis:

' min

max min

.ii

x xx

x x

−=−

a modified k-means by Nonsmooth Analysis:

• to understand the data set by examining the groups in the data,• to find the outliers of the data set,• our data set was not big.

• These methods use Euclidean metric by default.

Page 10: Clustering Theory

About the Methods

• PAM is more robust than k-means in the presence of noise and outliers.

• PAM minimizes a sum of dissimilarities instead of a sum of squared Euclidean distances.

• Medoids are less influenced by the presence of noise and outliers.

• A medoid can be defined as that object of a cluster, whose average distance (dissimilarity) to all the objects in the clusteris minimal.

Page 11: Clustering Theory

Nonsmooth Analysis

• k-means takes as input:the number of clusters and initial cluster centers.

• This problem can be reduced to nonsmooth optimization problem --> initial problem for the a modified k-means.

– global optimization techniques, – nonsmooth optimization algorithms and – derivative free optimization for the modified k-means algorithm.

• The minimum sum of squares problem --> nonsmooth and nonconvex optimization problem.

Page 12: Clustering Theory

k=2 cluster_1 (70 Object) – cluster_2 (22 Object) 1.113769

k=3

cluster_1 (68 Object) – cluster_2 (22 Object)

cluster_1 (68 Object) – cluster_3 (2 Object)

cluster_2 (22 Object) – cluster_3 (2 Object)

1.111567

1.593595

1.968277

cluster_1 (68 Object) – cluster_2 (6 Object) 1.44533

k-Means Results

k=4

cluster_1 (68 Object) – cluster_2 (6 Object)

cluster_1 (68 Object) – cluster_3 (2 Object)

cluster_1 (68 Object) – cluster_4 (16 Object)

cluster_2 (6 Object) – cluster_3 (2 Object)

cluster_2 (6 Object) – cluster_4 (16 Object)

cluster_3 (2 Object) – cluster_4 (16 Object)

1.44533

1.593595

1.104353

2.197992

1.055844

1.95292

Page 13: Clustering Theory

• Best result is for k=2.

• The proximities of clusters for k=3 and k=4 are higher.

• But, the results of k=3 and k=4 are artificial,

k-Means Results

• But, the results of k=3 and k=4 are artificial, one of the clusters contain only 2 objects.

• These objects are outliers.

Page 14: Clustering Theory

PAM Results

2 clusters cluster_1 (40 Objects) – cluster_2 (52 Objects) 1.2838

3 clusters

cluster_1 (33 Objects) – cluster_2 (34 Objects)

cluster_1 (33 Objects) – cluster_3 (25 Objects)

cluster_2 (34 Objects) – cluster_3 (25 Objects)

1.2838

1.2729

1.1242

4 clusters

cluster_1 (20 Objects) – cluster_2 (34 Objects)

cluster_1 (20 Objects) – cluster_3 (25 Objects)

cluster_1 (20 Objects) – cluster_4 (13 Objects)

cluster_2 (34 Objects) – cluster_3 (25 Objects)

cluster_2 (34 Objects) – cluster_4 (13 Objects)

cluster_3 (25 Objects) – cluster_4 (13 Objects)

1.2838

1.2729

1.1374

1.1242

1.5336

1.5523

Page 15: Clustering Theory

• The proximities of clusters for k=4 is higher, i.e., the clusters are better separated.

• The number of objects in the clusters are 20, 34, 25 and 13.

• This is quite natural grouping of the data.

• Best result is for k=4.

• We can say that clustering conducted by PAM is a

PAM Results

• We can say that clustering conducted by PAM is a fine tuning of the one done by k-means.

PAM

1.00 2.00 3.00 4.00

Total

k-Means

1.00

2.00

20

0

12

22

25

0

13

0

70

22

Total 20 34 25 13 92

Page 16: Clustering Theory

Modified k-Means Results

k=2 k=3 k=4

cluster_1: 61 Objects

cluster_2: 31 Objects

cluster_1: 59 Objects

cluster_2: 31 Objects

cluster_3: 2 Objects

cluster_1: 45 Objects

cluster_2: 24 Objects

cluster_3: 2 Objects

clluster_4: 21 Objects

� For k=4, k-means has 2 clusters of less than 10 objects.

Modified global k -Means

1.00 2.00

Total

k-Means

1.00

2.00

61

0

9

22

70

22

Total 61 31 92

� For k=4, k-means has 2 clusters of less than 10 objects.

� Modified k-means has only 1 cluster of less than 10 objects,

others have all more than 20.

� Best result is for k=2.

Page 17: Clustering Theory

• Modified k-means gave more natural results than k-means.

• Found clusters by this modified method are more balanced interms of objects numbers.

• As k increases, k-means give artificial results;

Modified k-Means Results

• As k increases, k-means give artificial results; however, modified global k-means gives reasonable clusters except for one cluster.

• This new algorithm can be used when k is not known a priori.

• It is easy to use and the running time of algorithm is significantly short (seconds in all of our runs).

Page 18: Clustering Theory

Studies on Found Clusters

We obtained the rule sets for k-means when k = 2,3 and 4.

These rule sets show us which values of the process variables

together characterize any regarded class of the object.

These results are meaningful for the decision maker

which is in our case the company.which is in our case the company.

Instead of rule sets it will be meaningful for you to see the

decision tree analysis of the clusters.

We applied CART (classification and regression trees )

of SPSS Clementine® 10.1 , on the group we found from

k-means for k=2.

Page 19: Clustering Theory

Results

• We chose the big cluster of 70 objects as our dataset for CART.

• We formed 7 different training sets of 60 objects randomly and 7 test sets from the remaining 10 objects.

• One output variable (i.e., response variable) which representsthe total defective items.

• We obtained 7 decision tree models from these training and test sets.

Page 20: Clustering Theory

We used two main measure to compare these models:

– Mean error (ME) – Mean absolute error (MAE) – Correlation

Results

Average 1.Model 2.Model 3.Model 4.Model 5.Model 6.Model 7.Model

Training ME 0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 Training MAE 2,8 2,6 3,1 3,0 2,5 3,2 2,4 2,8

Training correlation 0,887 0,922 0,840 0,871 0,917 0,874 0,911 0,872 Test ME -0,004 0,008 0,031 0,053 -0,064 0,002 -0,02 -0,04

Test MAE 7,74 5,2 7,7 6,9 9,5 5,5 7,7 11,7 Test correlation 0,040 -0,453 -0,046 0,555 0,146 -0,378 0,535 -0,08

Page 21: Clustering Theory

Cluster of 70 Objects Whole data set of 92 objects Training ME 0 0

Training MAE 2,8 3.23 Training korelasyonu 0,887 0.8098

Test ME -0,004 -0.21 Test MAE 7,74 6.85

Test korelasyonu 0,040 0.0757

� Our studies shows that it is better to make clustering

Results

� Our studies shows that it is better to make clustering

before building models and extracting rulesets.

� We obtained 4 most important variables for the response

variables.

� 2 of these important variables are also the most important

ones for the whole set.

Page 22: Clustering Theory

Conclusion

• When the data mining techniques used for classification /prediction cannot produce accurate results or cannot build models which are capable of predicting correctly, it is better to find the homogenous groups in the data set.

• Clustering algorithms produce highly different results, one should choose the most efficient and natural one.one should choose the most efficient and natural one.

• Modified k-Means can be preferred instead of k-Means.

Page 23: Clustering Theory

References[1] Akteke-Özturk, B., Weber, G.-W., and Kropat, E., Continuous optimization

approaches for minimum sum of squares, in the ISI Proceedings of 20th Mini-EURO Conference Continuous Optimization and Knowledge-Based Technologies (Neringa, Lithuania, May 20-23, 2008) 253-258.

[2] Bagirov, A.M., Rubinov, A.M., Soukhoroukova, N.V., and Yearwood, J., Unsupervised and supervised data classification via nonsmooth and global optimization, TOP 11, 1 (2003), 1-93.

[3] Bakır, B., Batmaz, Đ., Güntürkün, F.A., Đpekçi, Đ.A., Köksal, G., and Özdemirel, N.E., Defect Cause Modeling with Decision Tree and Regression Analysis, Proceedings of XVII. International Conference on Computer and Analysis, Proceedings of XVII. International Conference on Computer and Information Science and Engineering, Cairo, Egypt, December 08-10, 2006, Volume 17, pp. 266-269, ISBN 975-00803-7-8.

[4] Sugar, C.A. and James, G.M., Finding the number of clusters in a dataset: an information-theoretic approach, Journal of the American Statistical Association 98, 463 (2003) 750-763.

[5] Volkovich, Z., Barzily, Z., Weber, G.-W., and Toledano-Kitai, D., Cluster stabilityestimation based on a minimal spanning trees approach, Proceedingsof the Second Global Conference on Power Control and Optimization, AIPConference Proceedings 1159, Bali, Indonesia, 1-3 June 2009, Subseries: Mathematical and Statistical Physics; ISBN 978-0-7354-0696-4 (August 2009) 299-305; Hakim, A.H., Vasant, P., and Barsoum, N., guest eds..