C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.
-
Upload
suzan-hill -
Category
Documents
-
view
214 -
download
0
Transcript of C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.
![Page 1: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/1.jpg)
C.Watters CS6403 1
Clustering
![Page 2: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/2.jpg)
C.Watters CS6403 2
Clustering
• What
• Why
• How
• Results
![Page 3: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/3.jpg)
C.Watters CS6403 3
Clustering
• Assign items to groups based on some calculation of degree of likeness between items
• Groups are not known before hand
• Uses multivariate analysis techniques
• Feature set determination critical
![Page 4: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/4.jpg)
C.Watters CS6403 4
Example
• News data
• Sports, World news, Entertainment etc
• Short items, items with photos, items with names
![Page 5: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/5.jpg)
C.Watters CS6403 5
Why
• Improve efficiency of retrieval
• Improve effectiveness of retrieval
• Ranking of retrieved results
• Visualization of results
• Karnaugh and SOM (self organizing maps)
• Discovery of content
• Discovery of relationships
![Page 6: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/6.jpg)
C.Watters CS6403 6
How
• Put items into groups so that members have a high degree of association within the group
• AND items have low degree of association with items in other groups
• Association for IR documents?
• Feature set?
![Page 7: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/7.jpg)
C.Watters CS6403 7
Feature Sets for IR Clustering
• Term occurrences
• Citations
• Names
• Structure (tags)
• Co-occurences (thesaurus construction)
![Page 8: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/8.jpg)
C.Watters CS6403 8
Problems
• Choosing the best feature set
• Choosing the similarity measure
• Evaluation of results
• Updates
• Searching clusters
![Page 9: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/9.jpg)
C.Watters CS6403 9
Measures of Similarity
• Need to quantify the degree of association of an item with others
• Generally want a measure that is normalized by document vector length
• Not clear that weighted document terms are better than binary ones in clustering
![Page 10: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/10.jpg)
C.Watters CS6403 10
General Measures
• Dice coefficient
• Jaccard Coefficient
• Cosine Coefficient
![Page 11: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/11.jpg)
C.Watters CS6403 11
Dice Coefficient
• Binary weights
C= Terms in common, A terms in i, and B terms in j
![Page 12: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/12.jpg)
C.Watters CS6403 12
Jaccard Coefficient
• Binary Weights
C= Terms in common, A terms in i, and B terms in j
![Page 13: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/13.jpg)
C.Watters CS6403 13
Cosine Coefficient
• Binary weights
C= Terms in common, A terms in i, and B terms in j
![Page 14: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/14.jpg)
C.Watters CS6403 14
Now what?
• Need to be able to compare any doc to any other doc
• Need?11 12 13 14 15
21 22 23 24 25
31 32 33 34 35
41 42 43 44 45
51 52 53 54 55
Doc-Doc Similarity Matrix
![Page 15: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/15.jpg)
C.Watters CS6403 15
Generating Similarity Matrix
• Use inverted file
• Documents with no terms in common do not need similarity calculation
• Generally generate only one row at a time as needed
![Page 16: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/16.jpg)
C.Watters CS6403 16
Algorithms
• Problem: sort N things into M groups, where M=[1,N]
• Choice of algorithm determines– M– membership
![Page 17: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/17.jpg)
C.Watters CS6403 17
General Classes of Algorithms
• Hierarchical
•Non-hierarchical
No overlap
Centroid
Nested groups
Pairwise connections made
![Page 18: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/18.jpg)
C.Watters CS6403 18
Evaluation of results
• Was method appropriate for data set
• Do the clusters represent the data well
• Are the docs in the right cluster
![Page 19: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/19.jpg)
C.Watters CS6403 19
How to test?
• Overlap test Run a known query set and evaluate against known results
• Randomly select docs and judge relevance to group members
• Examine distribution of docs in groups
• Density test = term occurrences
• docs x unique terms
![Page 20: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/20.jpg)
C.Watters CS6403 20
Concepts to keep in mind
• Cluster hypothesis
• Nearest neighbour
• centroid
![Page 21: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/21.jpg)
C.Watters CS6403 21
Cluster Hypothesis
• Associations between documents are related to the relevance of documents to queries
• Van Rijsbergen, 1979
![Page 22: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/22.jpg)
C.Watters CS6403 22
Nearest Neighbour
• Find the document most similar to the given one
• This one is most likely closely related
• Works with terms, citations, & clusters
![Page 23: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/23.jpg)
C.Watters CS6403 23
Centroids
• Representative of a cluster
• May be a document from that cluster
• May be a composite of doc features from that cluster
• Why: query-centroid calculations– higher level representations of data set– build ontologies and thesauri
![Page 24: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/24.jpg)
C.Watters CS6403 24
Visualization of Clusters
• Kohonen Maps
• Star maps
• SOM (self organizing maps)
• Etc
![Page 25: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/25.jpg)
C.Watters CS6403 25
Samples
![Page 26: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/26.jpg)
C.Watters CS6403 26
Cluster Map
![Page 27: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/27.jpg)
C.Watters CS6403 27
Starfield
![Page 28: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.](https://reader035.fdocuments.us/reader035/viewer/2022062713/56649f435503460f94c635d7/html5/thumbnails/28.jpg)
C.Watters CS6403 28