Advanced Information- Retrieval Models
description
Transcript of Advanced Information- Retrieval Models
![Page 1: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/1.jpg)
1
Advanced Information-Retrieval Models
Hsin-Hsi Chen
![Page 2: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/2.jpg)
2
Mathematical Models for IR
• Mathematical Models for Information Retrieval– Boolean Model
Compare Boolean query statements with the term sets used to identify document content.
– probabilistic modelCompute the relevance probabilities for the documents of a collection.
– vector-space modelRepresent both queries and documents by term sets, and compute global similarities between queries and documents.
![Page 3: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/3.jpg)
3
Basic Vector Space Model• Term vector representation of
documents Di=(ai1, ai2, …, ait)queries Qj=(qj1, qj2, …, qjt)
• t distinct terms are used to characterize content.
• Each term is identified with a term vector T.
• T vectors are linearly independent.
• Any vector is represented as a linear combination of the T term vectors.
• The rth document Dr can be represented as a document vector, written as
D a Tr r i
i
t
i
1
![Page 4: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/4.jpg)
4
Document representation in vector spacea document vector in a two-dimensional vector space
![Page 5: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/5.jpg)
5
Similarity Measure
• measure by product of two vectorsx • y = |x| |y| cos
• document-query similarity
• how to determine the vector components and term correlations?
D Q a q T Tr s r s i
i j
t
ji j‧ ‧
, 1
D a Tr r i
i
t
i
1
Q q
j
t
s sj jT
1
term vector:document vector:
![Page 6: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/6.jpg)
6
Similarity Measure (Continued)
• vector components
T T T T
A
D
D
D
a a a
a a a
a a a
t
n
t
t
n n nt
1 2 3
1
2
11 12 1
21 22 2
1 2
![Page 7: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/7.jpg)
7
Similarity Measure (Continued)
• term correlations Ti • Tj are not availableassumption: term vectors are orthogonal
Ti • Tj =0 (ij) Ti • Tj =1 (i=j)
• Assume that terms are uncorrelated.
• Similarity measurement between documents
sim D Q a qr s r s
i j
t
i j( ),
,
1
sim D D a ar s r s
i j
t
i j( ),
,
1
![Page 8: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/8.jpg)
8
Sample query-documentsimilarity computation
• D1=2T1+3T2+5T3 D2=3T1+7T2+1T3
Q=0T1+0T2+2T3
• similarity computations for uncorrelated termssim(D1,Q)=2•0+3 •0+5 •2=10sim(D2,Q)=3•0+7 •0+1 •2=2
![Page 9: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/9.jpg)
9
Sample query-documentsimilarity computation (Continued)
• T1 T2 T3
T1 1 0.5 0T2 0.5 1 -0.2T3 0 -0.2 1
• similarity computations for correlated termssim(D1,Q)=(2T1+3T2+5T3) • (0T1+0T2+2T3 )
=4T1•T3+6T2 •T3 +10T3 •T3 =-6*0.2+10*1=8.8
sim(D2,Q)=(3T1+7T2+1T3) • (0T1+0T2+2T3 )=6T1•T3+14T2 •T3 +2T3 •T3 =-14*0.2+2*1=-0.8
![Page 10: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/10.jpg)
10
Advantages of similarity coefficients
• The documents can be arranged in decreasing order of corresponding similarity with the query.
• The size of the retrieved set can be adapted to the users’ requirement by retrieving only the top few items.
• Items retrieved early in a search may help generated improved query formulations using relevance feedback.
![Page 11: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/11.jpg)
11
| |
| |
| | | |
| |
| |
| |
| | | |
| |
min(| |,| |)|
X Y
X Y
X Y
X Y
X Y
X Y
X Y
X Y
X Y
2
Inner Product
Dice coefficient
Jaccard's coefficient
Cosine coefficient
Jaccard coefficient
Association Measures
![Page 12: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/12.jpg)
12
Vector Modifications
• How to generate query statement that can reflect information need?
• How to generate improved query formulations?relevance-feedback process
• Basic ideas– Documents relevant to a particular query resemble each
other.
– The reformulated query is expected to retrieve additional relevant items that are similar to the originally identified relevant item.
![Page 13: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/13.jpg)
13
relevance-feedback process
• Maximize the average query document similarity for the relevant documents.
• Minimize the average query-document similarity for the nonrelevant documents.
where R and N-R are the assumed number of relevant and nonrelevant documents w.r.t. queries.
Q kR
DD N R
DD
opti
i
i
i
1 1| | NonrelRel | |
![Page 14: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/14.jpg)
14
relevance-feedback process (Continued)
• problemthe sets of relevant and nonrelevant documents w.r.t. the queries are not known.
• Approximation
where R’ and N’ are subsets of R relevant items and N-R nonrelevant documents identified by user
( ) ( )
' '| '| | '|
i ii
D R
i
D N
Q QR
DN
Di i
1 1 1
( ) ( )
' '
i ii
D R
i
D N
Q Q D Di i
1
![Page 15: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/15.jpg)
15
The parameters of and
• equal weight =0.5 and =0.5• positive relevance-feedback =1 and =0
![Page 16: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/16.jpg)
16
The parameters of and (Continued)
• “dec hi” method: use all relevant information, but subtract only the highest ranked nonrelevant document
• feedback with query splittingsolve problems: (1) the relevant documents identified do not form a tight cluster; (2) nonrelevant documents are scattered among certain relevant ones
homogeneousrelevant items
homogeneousrelevant items
![Page 17: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/17.jpg)
17
Residual Collection with Partial Rank Freezing
• The previously retrieved items identified as relevant are kept “frozen”; and the previously retrieved nonrelevant items are simple removed from the collection.
Assume 10 documents are relevant.
![Page 18: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/18.jpg)
18
![Page 19: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/19.jpg)
19
Evaluating relevance feedback: Test-and-control collection evaluation
relevance-feedback
Refer to CLIR
![Page 20: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/20.jpg)
20
Document-Space Modification
• The relevance feedback process improve query formulation without modifying document vectors.
• Document-space modification improves document indexes.
• The documents will resemble each other more closely than before, and can be retrieved easily by a similar query.
![Page 21: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/21.jpg)
21
Relevant documents resemble each otherthan before.Nonrelevant documents are shifted awayfrom the query.
![Page 22: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/22.jpg)
22
Document-Space Modification (Continued)
• Add the terms from the query vector to the documents previously identified as relevant, and subtract the query terms from the documents previously identified as nonrelevant.
• This operation can remove unimportant items from the collection, or shift them to an auxiliary portion.
• Problemrelevance assessments by uses are subjective
![Page 23: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/23.jpg)
23
Document-Space Modification (Continued)
• Only small modifications of the term weights are allowed at each iteration.
• Document-space-modification methods are difficult to evaluate in the laboratory, where no users are available to dynamically control the space modifications by submitting queries.
![Page 24: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/24.jpg)
24
Automatic Document Classification
• Searching vs. Browsing• Disadvantages in using inverted index files
– information pertaining to a document is scattered among many different inverted-term lists
– information relating to different documents with similar term assignment is not in close proximity in the file system
• Approaches– inverted-index files (for searching) +
clustered document collection (for browsing)
– clustered file organization (for searching and browsing)
![Page 25: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/25.jpg)
25
HypercentroidSupercentroidsCentroidsDocuments
Typical Clustered File Organization
complete space
superclustersclusters
![Page 26: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/26.jpg)
26
Search Strategy for Clustered Documents
Centroids Documents Typical Search path
Highest-level centroid
Supercentroids
Centroids
Documents
![Page 27: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/27.jpg)
27
Cluster Generation VS Cluster Search
• Cluster structure is generated only once.• Cluster maintenance can be carried out at relatively
infrequent intervals.• Cluster generation process may be slower and more
expensive.• Cluster search operations may have to be performed
continually.• Cluster search operations must be carried out
efficiently.
![Page 28: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/28.jpg)
28
Hierarchical Cluster Generation• Two strategies
– pairwise item similarities
– heuristic methods
• Models– Divisive Clustering (top down)
• The complete collection is assumed to represent one complete cluster.• Then the collection is subsequently broken down into smaller pieces.
– Agglomerative Clustering (bottom up)• Individual item similarities are used as a starting point.• A gluing operation collects similar items, or groups, into larger group.
![Page 29: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/29.jpg)
29
T T T T
A
D
D
D
a a a
a a a
a a a
t
n
t
t
n n nt
1 2 3
1
2
11 12 1
21 22 2
1 2
Term clustering: from column viewpointDocument clustering: from row viewpoint
![Page 30: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/30.jpg)
30
A Naive Program for Hierarchical Agglomerative Clustering
1. Compute all pairwise document-document similarity coefficients. (N(N-1)/2 coefficients)2. Place each of N documents into a class of its own.3. Form a new cluster by combining the most similar pair of current clusters i and j; update similarity matrix by deleting the rows and columns corresponding to i and j; calculate the entries in the row corresponding to the new cluster i+j.4. Repeat step 3 if the number of clusters left is great than 1.
![Page 31: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/31.jpg)
31
How to Combine Clusters?
• Single-link clustering– Each document must have a similarity exceeding a
stated threshold value with at least one other document in the same class.
– similarity between a pair of clusters is taken to be the similarity between the most similar pair of items
– each cluster member will be more similar to at least one member in that same cluster than to any member of another cluster
![Page 32: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/32.jpg)
32
How to Combine Clusters? (Continued)
• Complete-link Clustering– Each document has a similarity to all other
documents in the same class that exceeds the the threshold value.
– similarity between the least similar pair of items from the two clusters is used as the cluster similarity
– each cluster member is more similar to the most dissimilar member of that cluster than to the most dissimilar member of any other cluster
![Page 33: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/33.jpg)
33
How to Combine Clusters? (Continued)
• Group-average clustering– a compromise between the extremes of single-
link and complete-link systems– each cluster member has a greater average
similarity to the remaining members of that cluster than it does to all members of any other cluster
![Page 34: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/34.jpg)
34
Example for Agglomerative Clustering
Step Pair Similarity Step Pair Similarity
1 AF 0.9 9 BC 0.42 AE 0.8 10 DE 0.43 BF 0.8 11 AB 0.34 BE 0.7 12 CD 0.35 AD 0.6 13 EF 0.36 AC 0.5 14 CF 0.27 BD 0.5 15 DF 0.18 CE 0.5
A-F (6 items) 6(6-1)/2 (15) pairwise similarities
decreasing order
![Page 35: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/35.jpg)
35
1. AF 0.9 A F
A B C D E FA . .3 .5 .6 .8 .9B .3 . .4 .5 .7 .8C .5 .4 . .3 .5 .2D .6 .5 .3 . .4 .1E .8 .7 .5 .4 . .3F .9 .8 .2 .1 .3 .
2. AE 0.8
A F
AF B C D E AF . .8 .5 .6 .8 B .8 . .4 .5 .7 C .5 .4 . .3 .5 D .6 .5 .3 . .4 E .8 .7 .5 .4 .
E
0.9
0.9
0.8
Single-link Clustering
sim(AF,X)=max(sim(A,X),sim(F,X))
sim(AEF,X)=max(sim(AF,X),sim(E,X))
![Page 36: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/36.jpg)
36
Single-link Clustering(Continued)
3. BF 0.8
A F
AEF B C D AEF . .8 .5 .6 B .8 . .4 .5 C .5 .4 . .3 D .6 .5 .3 . E
0.9
0.8
B
4. BE 0.7
A FE
0.9
0.8
B
ABEF C D ABEF . .5 .6 C .5 . .3 D .6 .3 .
sim(ABEF,X)=max(sim(AEF,X), sim(B,X))
Note E and B are on thesame level.
sim(ABDEF,X)=max(sim(ABEF,X)) sim(D,X))
![Page 37: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/37.jpg)
37
Single-link Clustering (Continued)
5. AD 0.6
A FE
0.9
0.8
B
D
6. AC 0.5
A F
ABDEF C ABDEF . .5 C .5 .
E0.9
0.8
B
D
C
0.6
0.6
0.5
![Page 38: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/38.jpg)
38
Single-Link Clusters
• Similarity level 0.7 (i.e., similarity threshold)
• Similarity level 0.5 (i.e., similarity threshold)
E A F B E.8 .9 .8 .7
C D
E A F B E.8 .9 .8 .7
D
C
.5
.6
![Page 39: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/39.jpg)
39
Complete-link cluster generation
1. AF 0.9
A F
A B C D E FA . .3 .5 .6 .8 .9B .3 . .4 .5 .7 .8C .5 .4 . .3 .5 .2D .6 .5 .3 . .4 .1E .8 .7 .5 .4 . .3F .9 .8 .2 .1 .3 .
0.9
2. AE 0.8 (A,E)(A,F)
new
checkEF
3. BF 0.8 checkAB
(A,E)(A,F)(B,F)
StepNumber
CheckOperations
SimilarityPair
CompleteLink Structure &
Pairs Covered
Similarity Matrix
sim(AF,X)=min(sim(A,X), sim(F,X))
![Page 40: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/40.jpg)
40
Complete-link cluster generation (Continued)
4. BE 0.7 new
B E
0.7 AF B C D E AF . .3 .2 .1 .3 B .3 . .4 .5 .7 C .2 .4 . .3 .5 D .1 .5 .3 . .4 E .3 .7 .5 .4 .
5. AD 0.6 checkDF
(A,D)(A,E)(A,F)(B,E)(B,F)
6. AC 0.6 checkCF
(A,C)(A,D)(A,E)(A,F)(B,E)(B,F)
7. BD 0.5 checkDE
(A,C)(A,D)(A,E)(A,F)(B,D)(B,E)(B,F)
StepNumber
SimilarityPair
CheckOperations
CompleteLink Structure &
Pairs Covered
Similarity Matrix
![Page 41: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/41.jpg)
41
Complete-link cluster generation (Continued)
8. CE 0.5
B EC
0.7
0.4
checkBC
AF BE C D AF . .3 .2 .1 BE .3 . .4 .4 C .2 .4 . .3 D .1 .4 .3 .
9. BC 0.4 checkCE0.5
10. DE 0.4 CheckBD0.5DE
(A,C)(A,D)(A,E)(A,F)(B,C)(B,D)(B,E)(B,F)(C,E)(D,E)
11. AB 0.3 CheckAC0.5AE0.8BF0.8CF , EF
(A,B)(A,C)(A,D)(A,E)(A,F)(B,C)(B,D)(B,E)(B,F)(C,E)(D,E)
StepNumber
SimilarityPair
CheckOperations
CompleteLink Structure &
Pairs Covered
Similarity Matrix
(A,C)(A,D)(A,E)(A,F)(B,D)(B,E)(B,F)(C,E)
(in the checklist)
![Page 42: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/42.jpg)
42
Complete-link cluster generation (Continued)
B EC
0.7
0.4 D
0.3 AF BCE D AF . .2 .1 BCE .2 . .3 D .1 .3 .
12. CD 0.3 CheckBD0.5DE0.4
13. EF 0.3 CheckBF0.8CFDF
(A,B)(A,C)(A,D)(A,E)(A,F)(B,C)(B,D)(B,E)(B,F)(C,D)(C,E)(D,E)(E,F)
14. CF 0.2 CheckBF0.8EF0.3DF
(A,B)(A,C)(A,D)(A,E)(A,F)(B,C)(B,D)(B,E)(B,F)(C,D)(C,E)(C,F)(D,E)(E,F)
StepNumber
SimilarityPair
CheckOperations
CompleteLink Structure &
Pairs CoveredSimilarity Matrix
![Page 43: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/43.jpg)
43
Complete-link cluster generation(Continued)
B EC
0.7
0.4 D
AF BCDE AF . .1 BCDE .1 .
A F
0.3
0.1
0.915. DF 0.1 last pair
![Page 44: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/44.jpg)
44
Complete link clusters
Similarity level 0.7
A F0.9 B E0.7
C D
Similarity level 0.4
A F0.9 B E0.7
D C0.4 0.5
Similarity level 0.3
A F0.9 B E
D
C
0.5 0.4
0.30.7
0.4 0.5
![Page 45: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/45.jpg)
45
The Behavior of Single-Link Cluster
• The single-link process tends to produce a small number of large clusters that are characterized by a chaining effect.
• Each element is usually attached to only one other member of the same cluster at each similarity level.
• It is sufficient to remember the list of previously clustered single items.
![Page 46: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/46.jpg)
46
The Behavior of Complete-Link Cluster
• Complete-link process produces a much larger number of small, tightly linked groupings.
• Each item in a complete-link cluster is guaranteed to resemble all other items in that cluster at the stated similarity level.
• It is necessary to remember the list of all item pairs previously considered in the clustering process.
![Page 47: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/47.jpg)
47
The Behavior of Complete-Link Cluster(Continued)
• The complete-link clustering system may be better adapted to retrieval than the single-link clusters.
• A complete-link cluster generation is more expensive to perform than a comparable single-link process.
![Page 48: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/48.jpg)
48
How to Generate Similarity
Di=(di1, di2, ..., dit) document vector for Di
Lj=(lj1, lj2, ..., ljnj) inverted list for term Tj
lji denotes document identifier of ith document listed under term Tj
nj denote number of postings for term Tj
for j=1 to t (for each of t possible terms) for i=1 to nj (for all nj entries on the jth list) compute sim(Dlji,Dlj,i+k) i+1<=k<=nj end for end for
![Page 49: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/49.jpg)
49
Similarity without Recomputation
for j=1 to N (fore each document in collection) set S(j)=0, 1<=j<=N for k=1 to nj (for each term in document) take up inverted list Lk
for i=1 to nk (for each document identifier on list) if i<j or if Sji=1 take up next document Di
else compute sim(Dj,Di) set Sji=1 end for end forend for
![Page 50: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/50.jpg)
50
Heuristic Clustering Methods
• Hierarchical clustering strategies– use all pairwise similarities between items
– the clustering-generation are relatively expensive
– produce a unique set of well-formed clusters for each set of data, regardless of the order in which the similarity pairs are introduced into the clustering process
• Heuristic clustering methods– produce rough cluster arrangements at relatively little
expense
![Page 51: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/51.jpg)
51
Single-Pass Clustering Heuristic Methods
• Item 1 is first taken and placed into a cluster of its own.
• Each subsequent item is then compared against all existing clusters.
• It is placed in a previously existing cluster whenever it is similar to any existing cluster.– Compute the similarities between all existing centroids and the
new incoming item.
– When an item is added to an existing cluster, the corresponding centroid must then be appropriately updated.
• If a new item is not sufficiently similar to any existing cluster, the new item forms a cluster of its own.
![Page 52: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/52.jpg)
52
Single-Pass Clustering Heuristic Methods(Continued)
• Produce uneven cluster structures.
• Solutions
– cluster splitting: cluster sizes
– variable similarity thresholds: the number of clusters, and the overlap among clusters
• Produce cluster arrangements that vary according to the order of individual items.
![Page 53: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/53.jpg)
53
Cluster Splitting
Addition of one more item to cluster A
Splitting cluster A into two pieces A’ and A’’
Splitting superclusters Sinto two pieces S’ and S’’
![Page 54: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/54.jpg)
54
Cluster Searching
• Cluster centroidthe average vector of all the documents in a given cluster
• strategies– top down
the query is first compared with the highest-level centroids
– bottom uponly the lowest-level centroids are stored, the higher-level cluster structure is disregarded
![Page 55: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/55.jpg)
55
Top-down entire-clustering search
1. Initialized by adding top item to active node list;
2. Take centroid with highest-query similarity from active node list;
if the number of singleton items in subtree headed by that
centroid is not larger than number of items wanted,
then retrieve these singleton items and eliminate the
centroid from active node list;
else eliminate the centroid with highest query similarity
from active node list and add its sons to active node
list;
3. if number of retrieved number wanted then stop
else repeat step 2
![Page 56: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/56.jpg)
56
Active node list Number of single Retrieveditems in subtree items
(1,0.2) 14 (too big)(2,0.5), (4,0.7), (3,0) 6 (too big)(2,0.5), (8,0.8), (9,0.3),(3,0) 2 I, J(2,0.5), (9,0.3), (3,0) 4 (too big)(5,0.6), (6,0.5), (9,0.3), (3,0) 2 A,B
![Page 57: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/57.jpg)
57
Bottom-up Individual-Cluster Search
Take a specified number of low-level centroids if there are enough singleton items in those clusters to equal the number of items wanted, then retrieve the number of items wanted in ranked order; else add additional low-level centroids to list and repeat test
![Page 58: Advanced Information- Retrieval Models](https://reader034.fdocuments.us/reader034/viewer/2022051418/56815044550346895dbe442c/html5/thumbnails/58.jpg)
58
Active centroid list: (8,.8), (4,.7), (5,.6)Ranked documents from clusters: (I,.9), (L,.8), (A,.8), (K,.6), (B,.5), (J,.4), (N,.4), (M,.2)Retrieved items: I, L, A