Web search results clustering Web search results clustering is a version of document clustering,...
-
Upload
brook-paul -
Category
Documents
-
view
219 -
download
0
Transcript of Web search results clustering Web search results clustering is a version of document clustering,...
![Page 1: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/1.jpg)
Web search results clustering
• Web search results clustering is a version of document clustering, but…
• Billions of pages• Constantly changing• Data mainly unstructured and heterogeneous• Additional information to consider (i.e. links,
click-through data, etc.)
![Page 2: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/2.jpg)
Some requirements
• Fast– Immediate response to query
• Flexible– Web content changes constantly– overlapping
• User-oriented– Main goal is to aid the user in finding information– Meaningful labels– Visualization: GUI
![Page 3: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/3.jpg)
Architecture of Web clustering engine
![Page 4: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/4.jpg)
Main issues
• Online or offline clustering? • What to use as input
– Entire documents– Snippets– Structure information (links)– Other data (i.e. click-through)– Use stop word lists, stemming, etc.
• How to define similarity?– Content (i.e. vector-space model)– Link analysis– Usage statistics
• How to group similar documents?• How to label the groups?
![Page 5: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/5.jpg)
Clustering algorithms
• Flat or hierarchical? • Overlapping?• Hard or soft? (One object to one cluster or
multiple clusters)• Incremental?• Predefined cluster number?• Requiring explicit similarity measure? Distance
measure?
![Page 6: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/6.jpg)
Clustering algorithms• Distance-based
– Hierarchical• Agglomerative Hierarchical Clustering (AHC)
– Flat• K-means (can be fuzzy)• Single-pass (incremental)
• Other– Suffix Tree Clustering (Grouper)– Query directed clustering– Self-organizing (Kohonen) maps (neural networks)– Latent Semantic Indexing (LSI) (reducing the dimensionality of the
vector-space)
![Page 7: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/7.jpg)
Clustering algorithms
• Data Centric algorithms– E.g. Scatter/Gather– Vector space model, k-means or HAC– Cluster labels is not good enough
• Description-Aware algorithms– STC
• Description-Centric Algorithms– E.g Vivisimo, Lingo, SRC– Good labels.
![Page 8: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/8.jpg)
Research Prototype systems
![Page 9: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/9.jpg)
Commercial systems
![Page 10: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/10.jpg)
Scatter/Gather from Lin and Pantel (2002) and Hearst and Pederson (1996)
![Page 11: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/11.jpg)
KeySRC
![Page 12: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/12.jpg)
![Page 13: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/13.jpg)
![Page 14: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/14.jpg)
KartOO
![Page 15: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/15.jpg)
Others
• More at http://www.folden.info/searchengineclustertechnology.shtml
![Page 16: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/16.jpg)
Scatter/Gather
• (Cutting et. al. 1992)• Designed for browsing• Based on two novel clustering algorithms– Buckshot – fast for online clustering– Fractionation – accurate for offline initial
clustering of the entire set
![Page 17: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/17.jpg)
Grouper
• (Zamir and Etzioni 1997, 1999)• Online• Operates on query result snippets• Clusters together documents with large
common subphrases• Suffix Tree Clustering (STC): linear, incremental,
overlapping, can be extended to hierarchical• STC induces labeling
![Page 18: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/18.jpg)
Iwona Białynicka-Birula - Clustering Web Search Results
STC algorithm• Step 1: Cleaning
– Stemming– Sentence boundary identification– Punctuation elimination
• Step 2: Suffix tree construction– Produces base clusters (internal nodes)– Base clusters are scored based on size and phrase score (which
depends on length and word „quality”) • Step 3: Merging base clusters
– Highly overlapping clusters are merged
![Page 19: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/19.jpg)
Carrot2
• (Stefanowski and Weiss 2003)• http://www.cs.put.poznan.pl/dweiss/carrot/• Component framework• Allows substituting components for
– Input (i.e. snippets from other search engines)– Filter
• Stemming• Distance measure• Clustering
– Output
![Page 20: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/20.jpg)
Vivísimo
• Commercial• http://www.vivisimo.com/• Online• Hierarchical• Conceptual
![Page 21: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/21.jpg)
Iwona Białynicka-Birula - Clustering Web Search Results
Other• Mapuccino (IBM)
– (Maarek et. al. 2000)– http://www.alphaworks.ibm.com/tech/mapuccino– Relatively efficient AHC (O(n2))– Similarity based on vector-space model
• (Su et. al. 2001)– Only usage statistics used as input– Recursive Density Based Clustering
• SHOC– (Zhang and Dong 2004)– Grouper-like– Key phrase discovery
![Page 22: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/22.jpg)
Problems
• The performance is far from being perfect– Incompleteness of clusters,
• Hard to tell why some cluster are generated, some are missing
– Different cluster granularity, • Some clusters are very specific, some are very broad
– Inconsistency: the contents and the label, • lack of intra and inter-cluster consistency.
– Label expressiveness– Lack of evaluation, data sets
![Page 23: Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d135503460f949e6f6e/html5/thumbnails/23.jpg)
Future research trends
To extract more powerful features: hyperlinks, external info, temporal attributes
To generate more expressive or effective descriptions of clusters
To improve the accuracy of the hierarchy structureTo consider user characteristics, web usage dataIntegration with ontologyBetter visualisation of the clustersTo apply to Mobile searchXML documents clustering