IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In...

12
http://www.iaeme.com/IJCET/index.asp 217 [email protected] International Journal of Computer Engineering & Technology (IJCET) Volume 9, Issue 4, July-Aug 2018, pp. 217-228, Article IJCET_09_04_024 Available online at http://www.iaeme.com/IJCET/issues.asp?JType=IJCET&VType=9&IType=4 Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com ISSN Print: 0976-6367 and ISSN Online: 0976–6375 © IAEME Publication CURE IMPLEMENTATION Anchal Chauhan and Seema Maitrey Krishna Inst itute of Engineering and Technology, Uttar Pradesh, India ABSTRACT Process of data mining is extraction of relevant knowledge and the interesting patterns from large amount of the available information. There are many data mining techniques, one is clustering technique. Process of clustering is unsupervised classification of the patterns (data items, feature vector and observations) in groups i.e. clusters. This paper intends to discuss CURE hierarchical algorithm and implement CURE which is one of the clustering algorithm. Keywords: Data mining, clustering, CURE hierarchical clustering. Cite this Article: Anchal Chauhan and Seema Maitrey, Cure Implementation. International Journal of Computer Engineering & Technology, 9(4), 2018, pp. 217- 228. http://www.iaeme.com/IJCET/issues.asp?JType=IJCET&VType=9&IType=4 1. INTRODUCTION Data mining process is a kind of sorting technique that is actually used for extracting the hidden patterns from voluminous databases. Data mining is called as KDD (knowledge discovery in databases) sometimes. Main goal of mining includes fast retrieval of information or data for identifying hidden patterns and also the patterns that are not explored previously to reduce level of complexity, knowledge discovery from database, time saving. [1] Classification is supervised learning. In classification, class labels are defined previously and incoming data is categorized according to class labels. On other hand, clustering is unsupervised learning. In clustering, data is categorized in according to the similarities in to the different groups, and then groups are labelled [2]. The process of clustering can be performed through different algorithms such as partitioning, grid, density and hierarchical algorithms. Hierarchical clustering algorithms are categorised as agglomerative and divisive algorithms [3] and agglomerative is further categorised in CURE, BIRCH, ROCK and CHAMELEON [4]. This paper focuses on CURE hierarchical clustering algorithm and its implementation.

Transcript of IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In...

Page 1: IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover groups and identify interesting distributions

http://www.iaeme.com/IJCET/index.asp 217 [email protected]

International Journal of Computer Engineering & Technology (IJCET) Volume 9, Issue 4, July-Aug 2018, pp. 217-228, Article IJCET_09_04_024

Available online at http://www.iaeme.com/IJCET/issues.asp?JType=IJCET&VType=9&IType=4

Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com

ISSN Print: 0976-6367 and ISSN Online: 0976–6375

© IAEME Publication

CURE IMPLEMENTATION

Anchal Chauhan and Seema Maitrey

Krishna Inst itute of Engineering and Technology, Uttar Pradesh, India

ABSTRACT

Process of data mining is extraction of relevant knowledge and the interesting

patterns from large amount of the available information. There are many data mining

techniques, one is clustering technique. Process of clustering is unsupervised

classification of the patterns (data items, feature vector and observations) in groups i.e.

clusters. This paper intends to discuss CURE hierarchical algorithm and implement

CURE which is one of the clustering algorithm.

Keywords: Data mining, clustering, CURE hierarchical clustering.

Cite this Article: Anchal Chauhan and Seema Maitrey, Cure Implementation.

International Journal of Computer Engineering & Technology, 9(4), 2018, pp. 217-

228.

http://www.iaeme.com/IJCET/issues.asp?JType=IJCET&VType=9&IType=4

1. INTRODUCTION

Data mining process is a kind of sorting technique that is actually used for extracting the hidden

patterns from voluminous databases. Data mining is called as KDD (knowledge discovery in

databases) sometimes. Main goal of mining includes fast retrieval of information or data for

identifying hidden patterns and also the patterns that are not explored previously to reduce level

of complexity, knowledge discovery from database, time saving. [1] Classification is

supervised learning. In classification, class labels are defined previously and incoming data is

categorized according to class labels. On other hand, clustering is unsupervised learning. In

clustering, data is categorized in according to the similarities in to the different groups, and

then groups are labelled [2]. The process of clustering can be performed through different

algorithms such as partitioning, grid, density and hierarchical algorithms. Hierarchical

clustering algorithms are categorised as agglomerative and divisive algorithms [3] and

agglomerative is further categorised in CURE, BIRCH, ROCK and CHAMELEON [4]. This

paper focuses on CURE hierarchical clustering algorithm and its implementation.

Page 2: IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover groups and identify interesting distributions

Anchal Chauhan and Seema Maitrey

http://www.iaeme.com/IJCET/index.asp 218 [email protected]

Figure 1 Phases of data mining

2. RELATED WORK

Many researchers have carried out research on the CURE hierarchical clustering techniques in

past. Some papers are listed below that worked on clustering process and the CURE clustering

algorithms.

Sudipto Guha et.al [5] proposed CURE algorithm and tried to show efficiency of the CURE

on the large database. Then Qian Yuntao, Wang Qi and Shi Qing song [6] founded relation of

the shrinking scheme of the CURE algorithm, also hidden assumption of the spherical shape of

the clusters. Researcher G. Adomavicious et.al [7] proposed new approach for discovering

clusters in very large amount of the continuous arriving data as dataset and then used sampling

technique to cluster dataset. M. Kaya, R. Alhajj [8] introduced a automated method to perform

mining on fuzzy association rules by help of the genetic algorithm and the CURE algorithm.

Ogihera, Dwarakadas [9] worked on discovery of the clusters from the database updates. These

proposed method with the SPADE algorithm for the interactive and incremental frequent

sequence mining.

3. CURE CLUSTERING ALGORITHM

CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover

groups and identify interesting distributions in underlying data. Traditionally, clustering

favours the clusters having spherical shape and similar sizes, or weak in presence of outliers.

CURE is one algorithm which is very robust to the outliers, also performs well in identifying

clusters that have non spherical shapes or wide variances in the size. The CURE algorithm

achieves this through each and every cluster by certain fix number of the points which are

generated from selecting the well scattered points and then towards centre of cluster by

specified fraction. Ability of having more than the one representative point in per cluster allows

the CURE algorithm to adjust well in geometry of the non-spherical shape. The shrinking helps

in dampening effects of the outliers. For handling the large databases, CURE clustering

algorithm employs the combination of the random sampling techniques and partitioning [5].

CURE implements a novel hierarchical algorithm that adopts middle ground in between the

centroid based and the approaches based on representative object. Instead of making use of an

single centroid or the object for representing cluster, a fix number of the representative points

in the space are selected. Representative point in the clusters are generated through the

Page 3: IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover groups and identify interesting distributions

Cure Implementation

http://www.iaeme.com/IJCET/index.asp 219 [email protected]

selection of the well scattered objects and then moving and shrinking them towards cluster

centre by an specified fraction or an shrinking factor. In each of the step, two clusters with the

closest pair of the representative points are selected. Having the more than a single

representative point in per cluster allows the CURE algorithm for adjusting well in geometry

of the non-spherical shapes. Condensing and shrinking of the cluster helps in dampening effects

of the outliers. CURE algorithm is much more robust to the outliers and then helps in

identifying clusters which have non spherical shapes, also wide variances in the size. Because

of this, it scales very well for the large or voluminous databases without sacrificing the

clustering quality. The random sample which is drawn from dataset is firstly partitioned and

then each partition is clustered partially. All the partial clusters are again clustered in second

pass to get the all required clusters. It will confirm quality of the clusters produced from CURE

which is much better than those found from other algorithm. [10]

Figure 2 Overview of CURE Algorithm

4. PROPOSED WORK

In field of data mining, it’s well known that sometimes it becomes difficult to handle the

voluminous data or large amount of the data. So, we tried to take advantage of this issue and in

among many various clustering algorithms, the CURE hierarchical algorithm is considered to

be implemented as the CURE (Clustering usage Representatives) clustering technique finds the

clusters from voluminous database which is very robust to the outliers, and also determines

clusters with the non-spherical shapes. CURE algorithm is implemented by the combination of

data collection and the data reduction by use of random sampling method and partitioning

method.

Algorithm:

Input:

Table A main table

Table B join table

Column A Join column from main table

Column B Join column from join table

n number of cluster

Column C Filter column

Value Filter value

1. Join table A and table B on equivalence of column A and column B

2. Calculate count from result set and store it in a variable T= total number of rows.

3. Start random sampling by calculating size of clustered partition by dividing total number of

rows T from number of clusters n Size of cluster=total number of rows (T)/ number of clusters

(n)

Page 4: IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover groups and identify interesting distributions

Anchal Chauhan and Seema Maitrey

http://www.iaeme.com/IJCET/index.asp 220 [email protected]

4. Store this value to variable s

5. Select every nth row from result set with filter criteria build by given user input in terms of

filter column and filter value starting from 0th row.

6. Create other partition by selecting every nth row with ith starting element where i ranges from

1 to n-1.

7. Analyse all partitions and perform clustering.

8. Merge relevant partitions to get knowledge based and meaningful relevant data.

9. Repeat step 3 to 9 if further clustering is required.

10. End.

5. FLOWCHART

Page 5: IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover groups and identify interesting distributions

Cure Implementation

http://www.iaeme.com/IJCET/index.asp 221 [email protected]

6. STEPS OF ALGORITHM

STEP 1: Install software in the system by clicking install application icon. When installation

is complete, welcome page will appear on screen. Connection URL for DB based on the DB

name and the passwords and username value.

Figure 4 Initial Database Settings

STEP 2. Now we then start setting up the initial properties of this process that takes the few

inputs from user. The Main Table as the table A. Then check for checkbox, if we look for the

Join Data set for this procedure. If Join is required then we takes more input from the user as

the Join Table as table b, the join column for the table A as A column and Join column for the

table B as the column B. All of these setting get populated with connection string which is

provided in the previous step. It queries the system tables for giving list of the entire existing

table with their relationship fields; by this we can specify the initial setting for this process.

STEP 3. In the next step, we can see the count of the results set created, and then we will use

all this for applying request values for performing sampling, partitioning and the clustering

step. We then store results set in the temp table, provides random sampling request value which

are the number of clusters (n),filter value as value, filter column as column C. Also some

specified list containing conditionals that can be used to filter more data. It contains entries as

EQUAL, IN, BETWEEN etc. We have provided all settings for the random sampling procedure

and then click next for checking the partitioning results.

Page 6: IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover groups and identify interesting distributions

Anchal Chauhan and Seema Maitrey

http://www.iaeme.com/IJCET/index.asp 222 [email protected]

Figure 5 Setup Table Name

Figure 6 Setup Cure Properties

STEP 4: In step 4, we can see the different partitions that are the every nth value of result set

starting from the row number I, where the I ranges from 0 to the n-1. You can see all the result

set grouped each other which is based on value of column C provided in the previous steps in

the different partitions like the partition1, partition2 and partition 3 etc. This will show that

cluster is separated from the outliners. Also it can be seen that the filtered values based over

criteria specification. See Fig.

Page 7: IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover groups and identify interesting distributions

Cure Implementation

http://www.iaeme.com/IJCET/index.asp 223 [email protected]

Figure 7 Partition 1 results

Figure 8 Partition 2 results

STEP 5: In this step, after analysing and processing all partitions we have merge them in

single table as result. If one want to perform more clustering with the same database instance

then can go to previous setup page once again as the next step for starting furthur clustering

process.

Page 8: IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover groups and identify interesting distributions

Anchal Chauhan and Seema Maitrey

http://www.iaeme.com/IJCET/index.asp 224 [email protected]

Figure 9 Final Result Set

STEP 6: step 2 to 6 can be repeated, if more clustering is required on the same database

instance. Initial and cure settings can be provided, and process data as per the new setting to

see the more relevant and better clustering results.

Figure 10 Setup Table and Join Table Name

Page 9: IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover groups and identify interesting distributions

Cure Implementation

http://www.iaeme.com/IJCET/index.asp 225 [email protected]

Figure 11 Setup Join Clusters and CURE properties

Figure 12 Join Partition 1 Results

Page 10: IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover groups and identify interesting distributions

Anchal Chauhan and Seema Maitrey

http://www.iaeme.com/IJCET/index.asp 226 [email protected]

Figure 13 Join Partition 2 Results

Figure 14 Join Partition 3 Results

Page 11: IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover groups and identify interesting distributions

Cure Implementation

http://www.iaeme.com/IJCET/index.asp 227 [email protected]

Figure 15 Join Final Results

STEP 7: If analysis and processing is completed then process can be finished. By clicking on

the Finish Button on the Merge Results page.

Figure 16 Filtered Result based on parameter

7. CONCLUSION

In this paper we studied that the CURE clustering algorithm can determine the cluster with

non-spherical shape and the wide variance in size. CURE algorithm provides the better

execution time as compared to other algorithms in the large database from using random

sampling technique and the partitioning ways. CURE clustering algorithm works very well

when the data have the outliers. All outliers are detected firstly and then these are eliminated

in CURE hierarchical clustering algorithm. Each and every level or step is important to achieve

efficiency, scalability and as well as the concurrency improvement. So, it can be concluded that

CURE algorithm is suitable for handling the voluminous data.

Page 12: IJCET 09 04 024 - IAEME · 3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover groups and identify interesting distributions

Anchal Chauhan and Seema Maitrey

http://www.iaeme.com/IJCET/index.asp 228 [email protected]

8. FUTURE SCOPE

In future, parallel programming can be introduced with CURE algorithm through this we can

get the result with much more accuracy in very less time. In the CURE algorithm, during the

random sampling result set is break in various different partitions. As the enhancement to the

CURE hierarchical algorithm we can process these partitions in a parallel thread environment.

By this performance of CURE algorithm we can improved and can make it a very efficient

algorithm than the other hierarchical algorithm.

REFERENCES

[1] Smita, Priti Sharma, Use of Data Mining in Various Field: A Survey Paper, (May-Jun.

2014)

[2] Megha Mandloi, A Survey on Clustering Algorithms and K-Means, July-2014

[3] G.Thilagavathi, D.Srivaishnavi, N.Aparna, “A Survey on Efficient Hierarchical Algorithm

used in Clustering”, IJERT, Year: 2013.

[4] Marjan Kuchaki Rafsanjani, Zahra Asghari Varzaneh, Nasibeh Emami Chukanlo, A survey

of Hierarchical clustering algorithms”, The Journal of Mathematics and Computer Science,

Year: 2012

[5] Sudipto Guha, Rajeev Rastogi, and Keyuseok Shim, 1998. “CURE: An Efficient Clustering

Algorithm for Large Databases”. In Proc. of the 1998 ACM SIGMOD Intl. Conf. on

Management of Data, pp. 73-84.

[6] C8Qian Yuntao, Shi Qingsong, Wang Qi 20c902. “CURENS: A Hierarchical Clustering

Algorithm with New Shrinking Scheme”, ICMLC’2002, Beijing, Nov., 4-5, pp. 895-899.

[7] G. Adomavicius, J. Bockstedt, and V. Parimi. “Scalable Temporal Clustering for Massive

Multidimensional Data Streams”. Proceedings of the 18th Workshop on Information

Technology and Systems (WITS'08), Paris, France, December 2008.

[8] M. Kaya, R. Alhajj. “Genetic Algorithm Based Framework for Mining Fuzzy Association

Rules”. Fuzzy Sets and Systems, 152 (3), (2005), 587-601.

[9] Srinivasan Parthasarathy, Mohammed J. Zaki, Mitsunori Ogihara, and Sandhya

Dwarkadas, “Incremental and Interactive Sequence Mining”. Proc. in 8th ACM

International Conference Information and Knowledge Management. Nov 1999.

[10] Seema Maitrey, C.K. Jha, Rajat Gupta & Jaiveer Singh (2012), “Enhancement of CURE

Clustering Technique in Data Mining”, Proceedings in International Journal of Computer

Application, Published by Foundation of Computer Science, New-York, USA. April 2012.