Motivation

1

Motivation

• Web query is usually two or three words long.– Prone to ambiguity– Example

• “keyboard” – Input device of computer– Musical instruments

• How can we ease the documents selection process for a user?

2

Motivation

• Use clusters to represent each topic!– User can quickly

disambiguate the query or drill down into a specific topic

3

Motivation

• Moreover, with precomputed clustering of the corpus, the search for documents similar to a query can be computed efficiently.– Cluster pruning

• We will learn how to cluster a collection of documents into groups.

4

Cluster pruning: preprocessing

• Pick N docs at random: call these leaders

• For every other doc, pre-compute nearest leader–Docs attached to a leader: its followers;–Likely: each leader has ~ N followers.

5

Cluster pruning: query processing

• Process a query as follows:–Given query Q, find its nearest

leader L.

–Seek K nearest docs from among L’s followers.

6

Visualization

Query

Leader Follower

7

What is Cluster Analysis?

• Cluster: a collection of data objects– Similar to one another within the same cluster

– Dissimilar to the objects in other clusters

• Cluster analysis– Finding similarities between data according to the

characteristics found in the data and grouping similar data objects into clusters

• Unsupervised learning: no predefined classes

8

Quality: What Is Good Clustering?

• A good clustering method will produce high

quality clusters with

– high intra-class similarity

– low inter-class similarity

• The quality of a clustering result depends on

both the similarity measure used by the method

and its implementation

9

Similarity measures

• Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)

• The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal, ratio-scaled, and vector variables.

• Weights should be associated with different variables based on applications and data semantics.

• It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective.

10

Vector Objects

• Vector objects: keywords in documents.

• Cosine measure (similarity)

11

Cluster and between the clusters

12

Centroid, Radius and Diameter of a Cluster (for numerical data sets)

• Centroid: the “middle” of a cluster

• Radius: square root of average distance from any point

of the cluster to its centroid

• Diameter: square root of average mean squared distan

ce between all pairs of points in the cluster

N

tNi ip

mC)(

1

N

mciptN

imR

2)(1

)1(

2)(11

NNiqt

iptN

iNi

mD

13

Typical Alternatives to Calculate the Similarity between Clusters

• Single link: largest similarity between an element in one

cluster and an element in the other.

• Complete link: smallest similarity between an element i

n one cluster and an element in the other

• Average: avg similarity between an element in one clust

er and an element in the other

• Centroid: distance between the centroids of two clusters,

i.e., dis(Ki, Kj) = dis(Ci, Cj)

14

Documents Clustering

15

Hierarchical Clustering

• Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

16

Hierarchical agglomerative clustering (HAC)

• HAC is widely used in document clustering

17

Nearest Neighbor, Level 2, k = 7 clusters.

From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt

18


19


20


21


22


23

Nearest Neighbor, Level 8, k = 1 cluster.

24

Calculate the similarity between all possible

combinations of two profiles

Two most similar clusters are grouped together to form

a new cluster

Calculate the similarity between the new cluster and

all remaining clusters.

Hierarchical Clustering

Keys• Similarity• Clustering

25

HAC• The hierarchical merging

process leads to a tree called a dendrogram.– The earlier mergers happe

n between groups with a large similarity

– This value becomes lower and lower for later merges.

– The user can cut across the dendrogram at a suitable level to get any desired number of clusters

26

Partitioning Algorithms: Basic Concept

• Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance

• Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion– Global optimal: exhaustively enumerate all partitions

• Heuristic methods: k-means– k-means (MacQueen’67): Each cluster is represented by the center

of the cluster• Hard assignment

• Soft assignment

21 )( mimKmt

km tC

mi

27

K-Means with hard assignment

• Given k, the k-means algorithm is implemente

d in four steps:

– Partition objects into k nonempty subsets

– Compute seed points as the centroids of the cluste

rs of the current partition (the centroid is the center,

i.e., mean point, of the cluster)

– Assign each object to the cluster with the nearest s

eed point

– Go back to Step 2, stop when no more new assign

ment

28

The K-Means Clustering Method

• Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

29

K-means with “soft” assignment

• Each cluster c is represented as a vector c in term space.– It is not necessarily the centroid of some docume

nts.

• The goal of soft k-means is to find a c so as to minimize the quantization error

d

cc d2)(min

30


• A simple strategy is to reduce the errors among the mean vectors and the documents that they are closed to iteratively.

• We can repeatedly through the documents, and for each document d, accumulate a “correction” c for the c that is closest to d:

• After scanning once through all documents, all cs are updated in a batch – c <- c +c

• is called the learning rate.

d

c

cotherwise

dtoclosestisifd

,0

),(

31


• The contribution from d need not be limited to only that c that is closest to it.– The contribution can be shared among many

clusters, the portion for cluster c being directly related to the current similarity between c and d.

– For example

)(||/1

||/12

2

cc

c dd

d

32

Comments on the K-Means Method

• Strength: Relatively efficient: O(tkn), where n is #

objects, k is # clusters, and t is # iterations. Norm

ally, k, t << n.

• Comment: Often terminates at a local optimum.

33

Dimensionality reduction• A significant fraction of the running time is spent

in computing similarities between documents and clusters.– The time taken for one similarity calculation is

proportional to the total number of nonzero components of the two vectors involved.

• For example, the total number of unique terms in the Reuters collection is over 30,000!

• A simple way to reduce the running time is to truncate document to a fixed number of the largest magnitude coordinates.– Subspace projection

34

Dimensionality reduction

• Projections to orthogonal subspaces, a subset of dimensions, may not reveal clustering structure in the best possible way.

• Since there are usually many ways to express a given concept (synonymy), and most words have multiple meanings (polysemy)

35

Latent Semantic Indexing (LSI)

• Goal– A better approach would allow users to retrieve

information on the basis of a conceptual topic or meaning of a document.

– LSI tries to overcome the problems of lexical matching by using statistically derived conceptual indices instead of individual words for retrieval.

– The latent semantic space has fewer dimensions than the original space.

• LSI is thus a method for dimensionality reduction.

36

Latent Semantic Indexing (LSI)

• Introduction– Singular value decomposition (SVD) is used

to estimate the structure in word usage across documents.

– Performance data shows that these statistically derived vectors are more robust indicators of meaning than individual terms.

37

Basic concepts of LSI

• LSI is a technique that projects queries and documents into a space with “latent” semantic dimensions.

• In the latent semantic space, a query and a document can have high cosine similarity even if they do not share any terms– As long as their terms are semantically similar

in a sense.

38


• Latent semantic indexing is the application of a particular mathematical technique, called Singular Value Decomposition or SVD, to a word-by-document matrix.

• SVD takes a matrix A and represents it as in a lower dimensional space such that the “distance” between the two matrices as measured by the 2-norm is minimized:

^

A

2

^

|||| AA

39


• The projection into the latent semantic space is chosen such that the representations in the original space are changed as little as possible when measured by the sum of the squares of the differences.– SVD (and hence LSI) is a least-squares method.

40


• SVD project an n-dimensional space onto a k-dimensional space where n > > k.– In applications (word-document matrices), n is

the number of word types in the collection.– Values of k that are frequently chosen are 100

and 150.

41


• There are many different mappings from high dimensional to low-dimensional spaces.

• Latent Semantic Indexing chooses the mapping that is optimal in the sense that it minimizes the distance Δ .– This setup has the consequence that the

dimensions of the reduced space correspond to the axes of greatest variation.

42


• The SVD projection is computed by decomposing term document matrix At ×d into the product of three matrices, Tt

×n , Sn×n , and Dd×n – At ×d = Tt×n Sn×n (Dd×n)T

– t is the number of terms, d is the number of documents, n = min(t,d)

– T and D have orthonormal columns, i.e. TTT = DTD = I – S =diag(σ1, σ2,… σn), σi ≥ σj ≥ 0 for 1≤ i ≤ j ≤ n

• SVD can be viewed as a method for rotating the axes of the n-dimensional space such that the first axis runs along the direction of largest variation among the documents, the second dimension runs along the direction with the second largest variation and so forth.

43

Basic concepts of LSI• SVD

– T and D represent terms and documents in this new space.

– S contains the singular values of A in descending order. – The ith singular value indicates the amount of variation

along the ith axis.

• By restricting the matrixes T, S and D to their first k < n rows – We can obtain t ×k = Tt×k Sn×k (Dd×k)T

– Which is the best square approximation of A by a matrix of rank k in the sense defined in the equation

^

A

2

^

|||| AA

44

Basic concepts of LSI• Choosing the number of dimensions (k) is an interesting

problem.– While a reduction in k can remove much of the noise – Keeping too few dimensions or factors may loose important

information.

• Using a test database of medical abstracts, LSI performance can improve considerably after 10 or 20 dimensions, peaks between 70 and 100 dimensions, and then begins to diminish slowly. – This pattern of performance (initial large increase and slow

decrease to word-based performance) is observed with other datasets as well.

– Eventually performance must approach the level of performance attained by standard vector methods, since with k = n factors Aˆ will exactly reconstruct the original term by document matrix A.

45


• Document–to-document similarities

• Term-to-term similarities

TTT DDSAAAA 2ˆˆ

TTT TTSAAAA 2ˆˆ

46

Basic concepts of LSI• Query to document similarities

– Idea• User query is represented as a vector in k-dimensional space and then can

be compared to the documents in the k-dimensional space– q is simply the vector of words in the users query, multiplied by the appr

opriate term weights.

• The sum of these k-dimensional terms vectors is reflected by the term qTTt×k, and the right multiplication by S-1

k×k differentially weights the separate dimensions.

• Thus, the query vector is located at the weighted sum of its constituent term vectors.

• The query vector can then be compared to all existing document vectors, and the documents ranked by their similarity (nearness) to the query.

47

Advantages of LSI

• True (latent) dimensions– Synonymy

• Synonymy refers to the fact that the same underlying concept can be described using different terms.

• In LSI, all documents that are related to a topic are all likely to be represented by a similar weighted combination of indexing variables.

48

Advantages of LSI

– Polysemy• Polysemy describes words that have more than one meaning,

which is common property of language. • Large numbers of polysemous words in the query can reduc

e the precision of a search significantly. • By using a reduced representation in LSI, one hopes to remo

ve some "noise" from the data, which could be described as rare and less important usages of certain terms.

– This would work only when the real meaning is close to the average meaning.

– Since the LSI term vector is just a weighted average of the different meanings of the term, when the real meaning differs from the average meaning, LSI may actually reduce the quality of the search.

49

Advantages of LSI

• Robust with noisy input– Because LSI does not depend on literal keyword matc

hing, it is especially useful when the text input is noisy, as in OCR (Optical Character Reader), open input, or spelling errors.

– For example,• There are scanning errors and a word (Dumais) is misspelled

(as Duniais)• If correctly spelled context words also occur in documents th

at contained “Duniais”, then Dumais will probably be near Duniais in the k-dimensional space

Motivation

Documents

Transcript of Motivation