Rohit 10103543

13
Application of Cluster Analysis Submitted by: Rohit Kansal 10103543, CSE

Transcript of Rohit 10103543

Page 1: Rohit 10103543

Application of Cluster Analysis

Submitted by: Rohit Kansal10103543, CSE

Page 2: Rohit 10103543

A loose definition of clustering could be: the process of organising objects into groups whose members are similar in some ways. Its task is grouping of set of objects in such a way that objects in same group are more similar to each other than to objects in other groups.

The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. Thus, cluster analysis is sometimes referred to as “unsupervised classification” and is distinct from “supervised classification”, or more commonly just “classification.

Inroduction

Page 3: Rohit 10103543

Hierarchical clustering- It is based on the core idea of objects being more related to nearby objects than to objects farther away.

Centroid-based clustering- In this method clusters are represented

by central vector, which may not necessarily be a member of the data set.

Distribution-based clustering- The clustering model most closely related to statistics is based on distribution models

Density-based clustering- Clusters are defined as areas of higher density than the remainder of the data set. Objects in these sparse areas are usually considered to be the noise or border points.

Clustering Techniques

Page 4: Rohit 10103543

Market Research: Market researchers use cluster analysis to partition the general population of consumers into market segments

Social network analysis: In the study of social network, clustering may be used to recognize communities within large groups of people.

Computer Science: Clustering is useful in software evolution as it helps to reduce legacy properties in code by reforming functionality that has become dispersed.

Applications of Clustering

Page 5: Rohit 10103543

Contour tracing is used to extract boundaries; boarder pixels of boundaries are extracted. Contour tracing is one of the many pre-processing techniques performed on digital image in order to extract information about general shape.

Contour detection is used because contour pixels are generally a small subset of the total number of pixels representing a pattern. Thus, amount of computation is reduced when run feature extracting algorithm on contour instead on whole pattern. Also, contour shares a lot of features with the original pattern hence, the feature extraction process become much more efficient.

Contour Detection

Page 6: Rohit 10103543

Clustering Algorithm Moore’s neighbor

The Moore neighbourhood of a pixel, P, is the set of 8 pixels which share a vertex or edge with that pixel.

Square Tracing

Given a digital pattern; locate a black pixel and declare it as your "start" pixel. Locating a "start" pixel can be done in a number of ways; we'll start at the bottom left corner of the grid, scan each column of pixels from the bottom going upwards starting from the leftmost column and proceeding to the right- until we encounter a black pixel.

Page 7: Rohit 10103543

After reading books, research papers on clustering and application of clustering, and reference material I gathered that though clustering is widely used in many fields, including contour detection, to represent data set into more understanding data set by removing noises and clustering useful information, it still has many drawbacks. Like application of effective clustering technique, selection of data set, number of clusters, and validation of result. Especially, in marketing segmentation result validation is neglected and when done procedure is usually ambiguous.

Clustering techniques used are very sensitive to selection of data set, number of cluster, size of data set, etc. And different technique varies accordingly in speed, time and size complexity, accuracy of final clusters.

Though there exist many algorithms and methods for contour extraction still these methods lack efficiency. Also, these methods are not universal solution; they need to be customized according to new data set. In addition, a better clustering method, that can be use with contour detection, does not exist.

Literature Survey

Page 8: Rohit 10103543

Data Selection- Selecting the appropriate variables used in the clustering process is one of the most fundamental steps because the inclusion of irrelevant variables may distort and render useless an otherwise useful segmentation solution.

Clustering algorithm selection- CA encompasses a number of different algorithms and methods for grouping objects or subjects. The increasing number of CA methods available, combined with their specific properties, has led some researchers to consider the bewilderment problem of selecting the best method in some sense. Because each technique is different and has specific properties that lead to different segmentation solutions, it is very important to carefully select the algorithm that will be used.

Inefficiency of contour extraction algorithm- In the original description of the algorithm used in Moore-Neighbour tracing, the stopping criterion is visiting the start pixel for a second time.

Practical problems in cluster analysis and contour detection

Page 9: Rohit 10103543

The basic scenario is as follows: To extract a region coordinates from a 2D grid. The value in each cell is the intensity of the area represented by that cell. If this value is zero then the area represented by that cell represent an empty area. Each connected set of cells with same intensity represents a region of that intensity. A region can have holes, this means that in an interior of a region there can be a cells of other intensity or intensity value zero. So, problem is extract each such region with a set of hole cycles.

 

Many approaches are available for the study of the data; these include representation of data in most defined form, reduction in noise, etc. While the various methods have been developed for the above mentioned purpose there still exist some complications. And sometimes these methods cannot be applied on all kind of data set; data set with varying noise, dimensions, variables.

Problem Statement

Page 10: Rohit 10103543

In contour detection, cluster analysis is used for the study and to organize of data obtained from survey. Whereas in this case, clustering algorithm is being embedded to all the objects of the data set including the objects not belonging to any cluster group. Dealing with test data set and the data set, downloaded from the UCI repository. Satisfactory results were obtained with test data set however coordinates from contour detection data set are showing ambiguity. One of the advantages of proposed algorithm is that it is effectively applicable on large data set with small dimensions. And the validation of clusters is also done effectively. This makes the method highly robust against possible attacks. Attacks such as clustering high dimensional data set can be further carried out.

Conclusion

Page 11: Rohit 10103543

Experimentation with variable data set and different algorithms will enable a better understanding of the proposed clustering scheme.

 

In contour detection cluster analysis, I applied clustering algorithm to the test data set thus forming clusters with distance as similarity measure. Variations of this approach can be considered. For example, instead of applying my algorithm any other existing method can be used.

 

The clustering technique for test data set was extended for testing validation and stability of clusters. Various type of attacks performed can be carried out to test the robustness of the scheme.

 

Also, besides proposed clustering method another method can be used to carry out the clustering, contour extraction and validation effectively.

Future Work

Page 12: Rohit 10103543

Difficulty in comparing quality of the clusters produced (e.g. for different initial partitions or values of K affect outcome).

Fixed number of clusters can make it difficult to predict what K should be.

 

Different initial partitions can result in different final clusters. It is helpful to return the program using the same as well as different K values, to compare the results achieved.

  

Euclidean distance measures can unequally weight underlying factors. If there are two highly overlapping data then algorithm will not be able to resolve that there are two clusters.

 

Output file generated may contain mixed coordinates of holes and pixels of different intensity.

 

Limitations

Page 13: Rohit 10103543

Thank You