Advanced Multimedia
-
Upload
holmes-soto -
Category
Documents
-
view
24 -
download
2
description
Transcript of Advanced Multimedia
Advanced Multimedia
Text ClusteringTamara Berg
Reminder - Classification
• Given some labeled training documents• Determine the best label for a test (query)
document
What if we don’t have labeled data?
• We can’t do classification.
What if we don’t have labeled data?
• We can’t do classification.• What can we do?
– Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters.
What if we don’t have labeled data?
• We can’t do classification.• What can we do?
– Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters.
– Often similarity is assessed according to a distance measure.
What if we don’t have labeled data?
• We can’t do classification.• What can we do?
– Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters.
– Often similarity is assessed according to a distance measure.
– Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.
Any of the similarity metrics we talked about before (SSD, angle between vectors)
Document Clustering
Clustering is the process of grouping a set ofdocuments into clusters of similar documents.
Documents within a cluster should be similar.
Documents from different clusters should bedissimilar.
Source: Hinrich Schutze
Source: Hinrich Schutze
Source: Hinrich Schutze
Source: Hinrich Schutze
Source: Hinrich Schutze
Source: Hinrich Schutze
Google newsFlickr Clusters
Source: Hinrich Schutze
How to cluster Documents
Reminder - Vector Space Model
Documents are represented as vectors in term space
A vector distance/similarity measure between two documents is used to compare documents
Slide from Mitch Marcus
Document Vectors:One location for each word.
nova galaxy heat h’wood film rolediet fur
10 5 3
5 10 10 8 7 9 10 5
10 10 9 10
5 7 9 6 10 2 8
7 5 1 3
ABCDEFGHI
“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)
Slide from Mitch Marcus
Document Vectors
nova galaxy heat h’wood film rolediet fur
10 5 3
5 10 10 8 7
9 10 5 10 10 9 10
5 7 9 6 10 2 8
7 5 1 3
ABCDEFGHI
Document ids
Slide from Mitch Marcus
TF x IDF Calculation
)/log(* kikik nNtfw
€
Tk = term k in document Ditf ik = frequency of term Tk in document Diidfk = inverse document frequency of term Tk in CN = total number of documents in the collection Cnk = the number of documents in C that contain Tk
idfk = log Nkn( )
Slide from Mitch Marcus
W1 W2 W3 … WnA
Features
F1 F2 F3 … FnA
Define whatever features you like:Length of longest string of CAP’sNumber of $’sUseful words for the task…
Similarity between documents
A = [10 5 3 0 0 0 0 0];G = [5 0 7 0 0 9 0 0];E = [0 0 0 0 0 10 10 0];
Sum of Squared Distances (SSD) =
SSD(A,G) = ?SSD(A,E) = ?SSD(G,E) = ?Which pair of documents are the most similar?
€
(X ii=1
n
∑ −Yi)2
Source: Hinrich Schutze
source: Dan Klein
K-means clustering
• Want to minimize sum of squared Euclidean distances between points xi and their nearest cluster centers mk
k
ki
ki mxMXDcluster
clusterinpoint
2)(),(
source: Svetlana Lazebnik
K-means clustering
• Want to minimize sum of squared Euclidean distances between points xi and their nearest cluster centers mk
k
ki
ki mxMXDcluster
clusterinpoint
2)(),(
source: Svetlana Lazebnik
source: Dan Klein
source: Dan Klein
source: Dan Klein
Convergence of K Means
• K-means converges to a fixed point in a finite number of iterations.
Proof:
Source: Hinrich Schutze
Convergence of K Means
• K-means converges to a fixed point in a finite number of iterations.
Proof:• The sum of squared distances (RSS) decreases during
reassignment.
Source: Hinrich Schutze
Convergence of K Means
• K-means converges to a fixed point in a finite number of iterations.
Proof:• The sum of squared distances (RSS) decreases during
reassignment.• (because each vector is moved to a closer centroid)
Source: Hinrich Schutze
Convergence of K Means
• K-means converges to a fixed point in a finite number of iterations.
Proof:• The sum of squared distances (RSS) decreases during
reassignment.• (because each vector is moved to a closer centroid)• RSS decreases during recomputation.
Source: Hinrich Schutze
Convergence of K Means
• K-means converges to a fixed point in a finite number of iterations.
Proof:• The sum of squared distances (RSS) decreases during
reassignment.• (because each vector is moved to a closer centroid)• RSS decreases during recomputation.• Thus: We must reach a fixed point.
Source: Hinrich Schutze
Convergence of K Means
• K-means converges to a fixed point in a finite number of iterations.
Proof:• The sum of squared distances (RSS) decreases during
reassignment.• (because each vector is moved to a closer centroid)• RSS decreases during recomputation.• Thus: We must reach a fixed point.• But we don’t know how long convergence will take!
Source: Hinrich Schutze
Convergence of K Means
• K-means converges to a fixed point in a finite number of iterations.
Proof:• The sum of squared distances (RSS) decreases during
reassignment.• (because each vector is moved to a closer centroid)• RSS decreases during recomputation.• Thus: We must reach a fixed point.• But we don’t know how long convergence will take!• If we don’t care about a few docs switching back and forth,
then convergence is usually fast (< 10-20 iterations).
Source: Hinrich Schutze
source: Dan Klein
source: Dan Klein
Source: Hinrich Schutze
Source: Hinrich Schutze
Hierarchical clustering strategies
• Agglomerative clustering• Start with each point in a separate cluster• At each iteration, merge two of the “closest” clusters
• Divisive clustering• Start with all points grouped into a single cluster• At each iteration, split the “largest” cluster
source: Svetlana Lazebnik
source: Dan Klein
source: Dan Klein
Divisive Clustering
• Top-down (instead of bottom-up as in Agglomerative Clustering)
• Start with all docs in one big cluster• Then recursively split clusters• Eventually each node forms a cluster on its
own.
Source: Hinrich Schutze
Flat or hierarchical clustering?
• For high efficiency, use flat clustering (e.g. k means)
• For deterministic results: hierarchical clustering• When a hierarchical structure is desired:
hierarchical algorithm• Hierarchical clustering can also be applied if K
cannot be predetermined (can start without knowing K)
Source: Hinrich Schutze
For Thurs
• Read Chapter 6 of textbook