Web clustring engine

20
WEB CLUSTERING ENGINES

Transcript of Web clustring engine

Page 1: Web clustring engine

WEB CLUSTERING ENGINES

Page 2: Web clustring engine

Search Engine?• Search engines are an invaluable tool for

retrieving information from the Web. In response to a user query, they return a list of results ranked in order of relevance to the query.

• Eg: Google, Yahoo etc.

Page 3: Web clustring engine

• Google (Flat Ranked Search Engine)

Flat Ranked VS Clustered

Page 4: Web clustring engine

Northern Lights (Clustered Search Engine)

Page 5: Web clustring engine

Why Web Clustering Engines?

• Conventional Engines are not much efficient in ‘Ambiguous’ queries.

• The search results returned by conventional search engines on query will be mixed together in the list irrelevant items occurs.

Page 6: Web clustring engine

• This systems group the results returned by a search engine into a hierarchy of labeled clusters (also called categories).

Web clustering engines:1. Northern Light - predefined set of clusters2. Credo Reference3. Kartoo4. Eyeplorer

Page 7: Web clustring engine

Main advantages of the cluster hierarchy

• It makes for shortcuts to the items that relate to the same meaning.

• It allows better topic understanding.

Page 8: Web clustring engine

• Short input data description.• Meaningful labels.• Selection of similarity measure.• Grouping of objects into clusters.• Computational efficiency.• Unknown number of clusters.

Issues in Implementation Of clusters

Page 9: Web clustring engine

Architecture & Techniques

Page 10: Web clustring engine

1.Search Results Acquisition• Provides input for the rest of the system.• Based on the query, the acquisition component

must deliver 50 to 500 results, each of which should contain a title, a contextual snippet, and the URL

• The source of search results can be any public search engines, such as Google,Yahoo etc.

• Fetching results from other search engines.

Page 11: Web clustring engine

2.Preprocessing of Search results

• Primary aim is to convert the search results into ‘features’

steps: i.Language identification ii.Tokenization iii.Stemming iv.Selection features

Page 12: Web clustring engine

ii.Tokenization:Text of each search result gets split into a sequence of basic independent units called tokens represent by word, number or symbol.

Page 13: Web clustring engine

iii.Stemming:Remove the inflectional prefixes and suffixes of each word to reduce different grammatical form of the word to a common base form called a ‘stem’.

Eg: connected,connecting & interconnection

↓ ↓ ↓ ‘connect’

Page 14: Web clustring engine

iv.Selection features:•Extract features for each search result present in the input.•Features are atomic entities by which we can describe an object and represent its most important characteristic to an algorithm.•Features vary from single word to tuples of word.

Page 15: Web clustring engine

How can represent a feature/text?

• Vector Space Model(VSM)• Document d is represented in the VSM as a vector

[wt0 , wt1 , . . .wtn] where t0, t1, . . . tn is a set of words/features and wti is the weight/importance of feature tiEg: d→“Polly had a dog and the dog had Polly”

vsm representation

Page 16: Web clustring engine

3.Cluster Construction & Labelling

• The set of search results along with their features are input to the clustering algorithm,

for building the clusters and labeling. Three types of Algorithms: 1.Data Centric Algorithms

2. Description aware3. Description centric

Page 17: Web clustring engine

Data Centric Clustering Algorithm

• It has initial clustering of a collection of documents in a set of k clusters(scatter)

• At Query time the user selected clusters of interest(gather) and the system re-clustered those documents.

• Process repeats until a small cluster with relevant documents is found

Page 18: Web clustring engine

Difficulties in Data centric algorithms

• All these algorithms are not incremental in nature - each document arrives from the web, we “clean” it and add it to the available model.

• Missing of meaningful labels.

Page 19: Web clustring engine

4.Visualization of Clustered Results

• One prominent approach is based on hierarchical folders• Clusty, CREDO, Lingo3G - hierarchical folder visualization

approach• Grokker - Nesting ,zooming approach• KartOO - Graph based interfaces

Page 20: Web clustring engine

THANK YOU