Motivation

49
1 Motivation • Web query is usually two or three words long. – Prone to ambiguity – Example • “keyboard” – Input device of computer – Musical instruments • How can we ease the documents selection process for a user?

description

Motivation. Web query is usually two or three words long. Prone to ambiguity Example “keyboard” Input device of computer Musical instruments How can we ease the documents selection process for a user?. Motivation. Use clusters to represent each topic! - PowerPoint PPT Presentation

Transcript of Motivation

Page 1: Motivation

1

Motivation

• Web query is usually two or three words long.– Prone to ambiguity– Example

• “keyboard” – Input device of computer– Musical instruments

• How can we ease the documents selection process for a user?

Page 2: Motivation

2

Motivation

• Use clusters to represent each topic!– User can quickly

disambiguate the query or drill down into a specific topic

Page 3: Motivation

3

Motivation

• Moreover, with precomputed clustering of the corpus, the search for documents similar to a query can be computed efficiently.– Cluster pruning

• We will learn how to cluster a collection of documents into groups.

Page 4: Motivation

4

Cluster pruning: preprocessing

• Pick N docs at random: call these leaders

• For every other doc, pre-compute nearest leader–Docs attached to a leader: its followers;–Likely: each leader has ~ N followers.

Page 5: Motivation

5

Cluster pruning: query processing

• Process a query as follows:–Given query Q, find its nearest

leader L.

–Seek K nearest docs from among L’s followers.

Page 6: Motivation

6

Visualization

Query

Leader Follower

Page 7: Motivation

7

What is Cluster Analysis?

• Cluster: a collection of data objects– Similar to one another within the same cluster

– Dissimilar to the objects in other clusters

• Cluster analysis– Finding similarities between data according to the

characteristics found in the data and grouping similar data objects into clusters

• Unsupervised learning: no predefined classes

Page 8: Motivation

8

Quality: What Is Good Clustering?

• A good clustering method will produce high

quality clusters with

– high intra-class similarity

– low inter-class similarity

• The quality of a clustering result depends on

both the similarity measure used by the method

and its implementation

Page 9: Motivation

9

Similarity measures

• Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)

• The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal, ratio-scaled, and vector variables.

• Weights should be associated with different variables based on applications and data semantics.

• It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective.

Page 10: Motivation

10

Vector Objects

• Vector objects: keywords in documents.

• Cosine measure (similarity)

Page 11: Motivation

11

Cluster and between the clusters

Page 12: Motivation

12

Centroid, Radius and Diameter of a Cluster (for numerical data sets)

• Centroid: the “middle” of a cluster

• Radius: square root of average distance from any point

of the cluster to its centroid

• Diameter: square root of average mean squared distan

ce between all pairs of points in the cluster

N

tNi ip

mC)(

1

N

mciptN

imR

2)(1

)1(

2)(11

NNiqt

iptN

iNi

mD

Page 13: Motivation

13

Typical Alternatives to Calculate the Similarity between Clusters

• Single link: largest similarity between an element in one

cluster and an element in the other.

• Complete link: smallest similarity between an element i

n one cluster and an element in the other

• Average: avg similarity between an element in one clust

er and an element in the other

• Centroid: distance between the centroids of two clusters,

i.e., dis(Ki, Kj) = dis(Ci, Cj)

Page 14: Motivation

14

Documents Clustering

Page 15: Motivation

15

Hierarchical Clustering

• Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

Page 16: Motivation

16

Hierarchical agglomerative clustering (HAC)

• HAC is widely used in document clustering

Page 17: Motivation

17

Nearest Neighbor, Level 2, k = 7 clusters.

From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt

Page 18: Motivation

18

Nearest Neighbor, Level 3, k = 6 clusters.

Page 19: Motivation

19

Nearest Neighbor, Level 4, k = 5 clusters.

Page 20: Motivation

20

Nearest Neighbor, Level 5, k = 4 clusters.

Page 21: Motivation

21

Nearest Neighbor, Level 6, k = 3 clusters.

Page 22: Motivation

22

Nearest Neighbor, Level 7, k = 2 clusters.

Page 23: Motivation

23

Nearest Neighbor, Level 8, k = 1 cluster.

Page 24: Motivation

24

Calculate the similarity between all possible

combinations of two profiles

Two most similar clusters are grouped together to form

a new cluster

Calculate the similarity between the new cluster and

all remaining clusters.

Hierarchical Clustering

Keys• Similarity• Clustering

Page 25: Motivation

25

HAC• The hierarchical merging

process leads to a tree called a dendrogram.– The earlier mergers happe

n between groups with a large similarity

– This value becomes lower and lower for later merges.

– The user can cut across the dendrogram at a suitable level to get any desired number of clusters

Page 26: Motivation

26

Partitioning Algorithms: Basic Concept

• Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance

• Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion– Global optimal: exhaustively enumerate all partitions

• Heuristic methods: k-means– k-means (MacQueen’67): Each cluster is represented by the center

of the cluster• Hard assignment

• Soft assignment

21 )( mimKmt

km tC

mi

Page 27: Motivation

27

K-Means with hard assignment

• Given k, the k-means algorithm is implemente

d in four steps:

– Partition objects into k nonempty subsets

– Compute seed points as the centroids of the cluste

rs of the current partition (the centroid is the center,

i.e., mean point, of the cluster)

– Assign each object to the cluster with the nearest s

eed point

– Go back to Step 2, stop when no more new assign

ment

Page 28: Motivation

28

The K-Means Clustering Method

• Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

Page 29: Motivation

29

K-means with “soft” assignment

• Each cluster c is represented as a vector c in term space.– It is not necessarily the centroid of some docume

nts.

• The goal of soft k-means is to find a c so as to minimize the quantization error

d

cc d2)(min

Page 30: Motivation

30

K-means with “soft” assignment

• A simple strategy is to reduce the errors among the mean vectors and the documents that they are closed to iteratively.

• We can repeatedly through the documents, and for each document d, accumulate a “correction” c for the c that is closest to d:

• After scanning once through all documents, all cs are updated in a batch – c <- c +c

• is called the learning rate.

d

c

cotherwise

dtoclosestisifd

,0

),(

Page 31: Motivation

31

K-means with “soft” assignment

• The contribution from d need not be limited to only that c that is closest to it.– The contribution can be shared among many

clusters, the portion for cluster c being directly related to the current similarity between c and d.

– For example

)(||/1

||/12

2

cc

c dd

d

Page 32: Motivation

32

Comments on the K-Means Method

• Strength: Relatively efficient: O(tkn), where n is #

objects, k is # clusters, and t is # iterations. Norm

ally, k, t << n.

• Comment: Often terminates at a local optimum.

Page 33: Motivation

33

Dimensionality reduction• A significant fraction of the running time is spent

in computing similarities between documents and clusters.– The time taken for one similarity calculation is

proportional to the total number of nonzero components of the two vectors involved.

• For example, the total number of unique terms in the Reuters collection is over 30,000!

• A simple way to reduce the running time is to truncate document to a fixed number of the largest magnitude coordinates.– Subspace projection

Page 34: Motivation

34

Dimensionality reduction

• Projections to orthogonal subspaces, a subset of dimensions, may not reveal clustering structure in the best possible way.

• Since there are usually many ways to express a given concept (synonymy), and most words have multiple meanings (polysemy)

Page 35: Motivation

35

Latent Semantic Indexing (LSI)

• Goal– A better approach would allow users to retrieve

information on the basis of a conceptual topic or meaning of a document.

– LSI tries to overcome the problems of lexical matching by using statistically derived conceptual indices instead of individual words for retrieval.

– The latent semantic space has fewer dimensions than the original space.

• LSI is thus a method for dimensionality reduction.

Page 36: Motivation

36

Latent Semantic Indexing (LSI)

• Introduction– Singular value decomposition (SVD) is used

to estimate the structure in word usage across documents.

– Performance data shows that these statistically derived vectors are more robust indicators of meaning than individual terms.

Page 37: Motivation

37

Basic concepts of LSI

• LSI is a technique that projects queries and documents into a space with “latent” semantic dimensions.

• In the latent semantic space, a query and a document can have high cosine similarity even if they do not share any terms– As long as their terms are semantically similar

in a sense.

Page 38: Motivation

38

Basic concepts of LSI

• Latent semantic indexing is the application of a particular mathematical technique, called Singular Value Decomposition or SVD, to a word-by-document matrix.

• SVD takes a matrix A and represents it as in a lower dimensional space such that the “distance” between the two matrices as measured by the 2-norm is minimized:

^

A

2

^

|||| AA

Page 39: Motivation

39

Basic concepts of LSI

• The projection into the latent semantic space is chosen such that the representations in the original space are changed as little as possible when measured by the sum of the squares of the differences.– SVD (and hence LSI) is a least-squares method.

Page 40: Motivation

40

Basic concepts of LSI

• SVD project an n-dimensional space onto a k-dimensional space where n > > k.– In applications (word-document matrices), n is

the number of word types in the collection.– Values of k that are frequently chosen are 100

and 150.

Page 41: Motivation

41

Basic concepts of LSI

• There are many different mappings from high dimensional to low-dimensional spaces.

• Latent Semantic Indexing chooses the mapping that is optimal in the sense that it minimizes the distance Δ .– This setup has the consequence that the

dimensions of the reduced space correspond to the axes of greatest variation.

Page 42: Motivation

42

Basic concepts of LSI

• The SVD projection is computed by decomposing term document matrix At ×d into the product of three matrices, Tt

×n , Sn×n , and Dd×n – At ×d = Tt×n Sn×n (Dd×n)T

– t is the number of terms, d is the number of documents, n = min(t,d)

– T and D have orthonormal columns, i.e. TTT = DTD = I – S =diag(σ1, σ2,… σn), σi ≥ σj ≥ 0 for 1≤ i ≤ j ≤ n

• SVD can be viewed as a method for rotating the axes of the n-dimensional space such that the first axis runs along the direction of largest variation among the documents, the second dimension runs along the direction with the second largest variation and so forth.

Page 43: Motivation

43

Basic concepts of LSI• SVD

– T and D represent terms and documents in this new space.

– S contains the singular values of A in descending order. – The ith singular value indicates the amount of variation

along the ith axis.

• By restricting the matrixes T, S and D to their first k < n rows – We can obtain t ×k = Tt×k Sn×k (Dd×k)T

– Which is the best square approximation of A by a matrix of rank k in the sense defined in the equation

^

A

2

^

|||| AA

Page 44: Motivation

44

Basic concepts of LSI• Choosing the number of dimensions (k) is an interesting

problem.– While a reduction in k can remove much of the noise – Keeping too few dimensions or factors may loose important

information.

• Using a test database of medical abstracts, LSI performance can improve considerably after 10 or 20 dimensions, peaks between 70 and 100 dimensions, and then begins to diminish slowly. – This pattern of performance (initial large increase and slow

decrease to word-based performance) is observed with other datasets as well.

– Eventually performance must approach the level of performance attained by standard vector methods, since with k = n factors Aˆ will exactly reconstruct the original term by document matrix A.

Page 45: Motivation

45

Basic concepts of LSI

• Document–to-document similarities

• Term-to-term similarities

TTT DDSAAAA 2ˆˆ

TTT TTSAAAA 2ˆˆ

Page 46: Motivation

46

Basic concepts of LSI• Query to document similarities

– Idea• User query is represented as a vector in k-dimensional space and then can

be compared to the documents in the k-dimensional space– q is simply the vector of words in the users query, multiplied by the appr

opriate term weights.

• The sum of these k-dimensional terms vectors is reflected by the term qTTt×k, and the right multiplication by S-1

k×k differentially weights the separate dimensions.

• Thus, the query vector is located at the weighted sum of its constituent term vectors.

• The query vector can then be compared to all existing document vectors, and the documents ranked by their similarity (nearness) to the query.

Page 47: Motivation

47

Advantages of LSI

• True (latent) dimensions– Synonymy

• Synonymy refers to the fact that the same underlying concept can be described using different terms.

• In LSI, all documents that are related to a topic are all likely to be represented by a similar weighted combination of indexing variables.

Page 48: Motivation

48

Advantages of LSI

– Polysemy• Polysemy describes words that have more than one meaning,

which is common property of language. • Large numbers of polysemous words in the query can reduc

e the precision of a search significantly. • By using a reduced representation in LSI, one hopes to remo

ve some "noise" from the data, which could be described as rare and less important usages of certain terms.

– This would work only when the real meaning is close to the average meaning.

– Since the LSI term vector is just a weighted average of the different meanings of the term, when the real meaning differs from the average meaning, LSI may actually reduce the quality of the search.

Page 49: Motivation

49

Advantages of LSI

• Robust with noisy input– Because LSI does not depend on literal keyword matc

hing, it is especially useful when the text input is noisy, as in OCR (Optical Character Reader), open input, or spelling errors.

– For example,• There are scanning errors and a word (Dumais) is misspelled

(as Duniais)• If correctly spelled context words also occur in documents th

at contained “Duniais”, then Dumais will probably be near Duniais in the k-dimensional space