EFFECTIVE TERM BASED TEXT CLUSTERING ALGORITHMS

8/8/2019 EFFECTIVE TERM BASED TEXT CLUSTERING ALGORITHMS

1/29

EFFECTIVE TERM BASED TEXT CLUSTERINGALGORITHMS

NIBAS P.P

EPAHECS033

Government Engineering College

Sreekrishnapuram

Palakkad

November 25, 2010
http:///reader/full/page1http:///reader/full/findhttp:///reader/full/gobackhttp:///reader/full/page29http:///reader/full/page1http:///reader/full/page29http:///reader/full/page29http:///reader/full/page1http:///reader/full/page1http:///reader/full/page29http:///reader/full/page29http:///reader/full/page1http:///reader/full/page1http:///reader/full/page2http:///reader/full/page1http:///reader/full/page1http:///reader/full/page1http:///reader/full/page2http:///reader/full/page1http:///reader/full/page1


2/29

CONTENTS

INTRODUCTION

REQUIREMENT OF INFORMATION RETRIEVAL

DOCUMENT PREPROCESSING

TEXT CLUSTERING ATTRIBUTES SELECTIONPROBLEM DEFINITION

FTC (Frequent Term-based Clustering)

CLUSTERING ALGORITHMS

APPLICATION

CONCLUSION

REFERENCE


3/29

INTRODUCTION

In every industry, almost all the documents on paper havetheir electronic copies.This is because electronic format provides:a) safer storage

b) smaller sizec) quick access to documents

Text clustering methods can be used to group large sets oftext documents.

Document clustering is the automatic organization ofdocuments into clusters or groups. So grouping is based onthe principle of maximizing intra-cluster similarity andminimizing inter-cluster similarity.


4/29

REQUIREMENT OF INFORMATION RETRIEVAL

To improve the result of information retrieval for documentclustering and the requirements of information retrieval is stated asfollows:

The document model preserves the sequential relationshipbetween words in the document.

Associating a meaningful label to each final Cluster isessential.

Overlapping between documents should be allowed.The high dimensionality of text document should be reduced.


5/29

DOCUMENT PREPROCESSING

All text clustering methods require several steps of

preprocessing of data.Non-textual information such as HTML tags and punctuationare removed from the documents.

Mostly the contexts of the documents are represented bynouns.


6/29

Contd...

Based on this, following assumptions were made to achievedocument dimension reduction:

Elimination of words which possess less than 3 characters.

Elimination of general words.Elimination of adverbs and adjectives.

Elimination of verbs.

To achieve frequent term generation

For small document, each line is treated as a record.

For large document, each paragraph is treated as a record.


7/29

TEXT CLUSTERING ATTRIBUTES SELECTION

Text clustering is performed in two stages:

Frequent term set generation.

Grouping of frequent term documents.Frequent term set generation is characterised by the attributeminimum support threshold.

Grouping of frequent term documents is characterised by the

attribute matching threshold.


8/29

Contd...

Minimum Support ThresholdThe document database is reduced, based on the value calledminimum support threshold.If the minimum support threshold takes less value, then thedimension reduction is less. Inorder to get more reduction in

size the value of minimum support should be high.

Matching Threshold

The grouping of documents is carried out by finding the matchof frequent terms between the documents which is measured

by a value called matching threshold.Matching is the ratio of number of common terms betweendocuments to the total number of terms.For low matching threshold value ,the grouping of document ishigh and for high matching threshold value ,the grouping ofdocument is less.


9/29

PROBLEM DEFINITION

Let D = {d1, d2, d3, . . . , dn} be the set of text documents.

T be the set of all terms occurring in the documents of D.

d1 = {t11, t12, . . . , t1m}, d2 ={t21, t22, . . . , t2m} be aset of frequent terms in document d1 and d2.

Let F={f1,f2,...fk} be the set of all frequent term sets in Dwith respect to min-support, where min-support be a realnumber.

The cover of each element fi of F can be regarded as a cluster.


10/29

Contd...

Let the clustering of D in m sets be defined as R ={C1, C2,C3, . . . , Cm} such that each cluster Ci contains atleast onedocument. Ci= NULL,i= 1 . . . . m.


11/29

FTC(Frequent Term-based Clustering)

Problems of text clustering such as:

Very high dimensionality of the data.Understandability of the clustering descriptions.

So a frequent term based approach of clustering has beenintroduced.

Frequent Term based Clustering (FTC) is a text clusteringtechnique which uses frequent term sets and dramatically

decreases the dimensionality of the document vector space.


12/29

CLUSTERING ALGORITHMS

Algorithms for effective Text clustering are:1. Min-match Cluster Algorithm2. Max-match cluster algorithm3. Min-Max match cluster algorithm


13/29

Min-match Cluster Algorithm

Let A and B be two frequent term sets of documents d1 andd2 represented as vectors.

Matching denoted as min(Vm) and defined as the number ofcommon elements between vector A and B to number of

elements in the minimum of two sets.

Example


14/29

Algorithm

D: Document databaseFTL: frequent term listCL: Cluster listFT: frequent termsMin-Cluster(CL,FTL,D)

1. For each FT i in FTL do2. t1 = ith index frequent terms3. Initialise high percent matching = -1 and cluster index= -14. For each FT j in FTL do

5. if (i= j) then t2 = jth index frequent words6. if (t1.length < t2.length) then total terms = t1.length7. Else total terms=t2.length End if8. match= Calculate matching terms between vector i and j usingBinary Search


15/29

9. matching percent = match * 100 / total terms10. if (matching percent> matching threshold) and(high percent matching matching percent) thenhigh percent matching = matching percent and cluster index = j11. End if12. End if

13. Next loop (j)14. if (cluster index = -1) then15. Add frequent term list(cluster index) to frequent term list(i)16. Add Cluster list(cluster index) to Cluster list(i)17. Remove Cluster list(cluster index)from Cluster list

18. Remove frequent term list(cluster index)fromfrequent term list19. End if20. Next loop (i)


16/29

Contd...

In this algorithm,step 2 select a vector as a comparable vector.step 5 to 7 is used to find out the minimum vector from thetwo input vectors specified in step 2 & 5 and assign its lengthas minimum vector count.

In step 8, the matching terms between two vectors arecalculated by using binary search concept.

In step 9, matching percentage between vectors is calculatedusing minimum vector count.

In step 10, the highest matching vector between the twovectors is selected and updates the value of highest matchvector.

step 5 to 11 is repeated until the comparable vector has tocompare all the remaining vectors.


17/29

Contd...

In steps 15 and 16, if the highest match vector is found, then :a) Its frequent terms are added to the terms of comparablevector selected in step 2.

b) Add the highest match cluster to the comparable cluster(step 16).

In steps 17 and 18, remove the highest match cluster from thecluster list (step 17).

Remove the highest match cluster terms from the frequentterm list (step 18).


18/29

Max-match cluster algorithm

Let A and B be two frequent term sets of documents d1 andd2 represented as vectors.

Matching denoted as max(Vm) and defined as the number ofcommon elements between vector A and B to number ofelements in the maximum of two sets.

Example


19/29

Algorithm

D: document databaseFTL: frequent term listCL: Cluster listFT: frequent termsMax-Cluster(CL,FTL,D)

1. For each FT i in FTL do2. t1 = ith index frequent words3. Initialise high percent matching = -1 and cluster index= -14. For each FT j in FTL do5. if (i= j) then t2 = jth index frequent words

6. if (t1.length


20/29

9. matching percent = match * 100 / total terms10. if (matching percent>matching threshold) and(high percent matching< matching percent) thenhigh percent matching = matching percent and cluster index = j11. End if12. End if

13. Next loop (j)14. if (cluster index= -1) then15. Add frequent term list(cluster index) to frequent term list(i)16. Add Cluster list(cluster index) to Cluster list(i)17. Remove Cluster list(cluster index)from Cluster list

18. Remove frequent term list(cluster index)fromfrequent term list19. End if20. Next loop (i)


21/29

Contd..

Here the only difference is that here we find the maximum

vector count of two input vectors.

Rest of the steps are same as illustrated in the previousalgorithm.


22/29

Min-Max match cluster algorithm

The matching is denoted by min-max(Vm) and is defined as thenumber of matching terms multiplied by 2 to the number ofelements of two sets

Example


23/29

Algorithm

D: document databaseFTL: frequent term list (set contains set of Frequent Terms)CL: Cluster list (set contains set of Input Files Names)FT: frequent termst1, t2: Frequent Term Set

Min-MaxCluster (CL,FTL,D)1. For each FT i in FTL do2. t1 = ith index frequent words3. Initialise high percent matching = -1 and cluster index= -14. For each FT j in FTL do

5. if (i= j) then t2 = jth index frequent words6. t3 = ith FTL UNION jth FTL7. total terms = t3.length8. match= Calculate matching terms between vector i and j usingBinary Search


24/29

9. matching percent = match * 2* 100 / total terms10. if (matching percent> matching threshold) and

(high percent matching< matching percent) thenhigh percent matching = matching percent and cluster index = j11. End if12. End if

13. Next loop (j)14. if (cluster index= -1) then15. Add frequent term list(cluster index) to frequent term list(i)16. Add Cluster list(cluster index) to Cluster list(i)17. End if

18. Remove Cluster list(cluster index)from Cluster list19. Remove frequent term list(cluster index)fromfrequent term list20. Next loop (i)


25/29

Contd...

Here the first difference is that we are considering the total

number of items present in the all sets.

Another main difference is that we multiply the numerator bythe number of vectors.


26/29

APPLICATION

Document clustering has wide application in areas such as :

web miningIt is the process of discovering patterns from the web.

search engineIt is designed to search for information on the world wide web.

information retrievalIt is the science of searching for documents,for information

within documents,for metadata about documents as well assearching relational database and world wide web.


27/29

CONCLUSION

For effective text clustering, three new clustering algorithmswere proposed.

All the three algorithms are compared with the standard FTCalgorithm to show their competency.

The developed three algorithms perform better cluster qualitythan FTC algorithm.


28/29

References

1 Ponmuthuramalingam P et. al,Effective Term Based TextClustering Algorithms,(IJCSE) Vol. 02, No. 05, 2010,1665-1673

2 Beil F., Ester M. and Xu X.,Frequent Term-based Text

Clustering,Proceedings of the 8th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, 2002,436-442

3 Dubes R.C and Jain A.K,Algorithms for Clustering

Data,Prentice Hall,Englewood Cliffs N.J,U.S.A,1988.4 Fung B.C.M,Wang K and Ester M,Hierarchial Document

Clustering using Frequent Item sets,Proceedings of SIAMInternational Conference on Data Mining,2003,180-304


29/29

THANK YOU.

EFFECTIVE TERM BASED TEXT CLUSTERING ALGORITHMS

Documents

Transcript of EFFECTIVE TERM BASED TEXT CLUSTERING ALGORITHMS