Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found...

13
Duygu Durmuş 1 Computer Engineering Department Bilkent University CS533 – Information Retrieval Systems Homework 2 Question 1 Solution: Minimum number of target clusters to be accessed is 1 for best-case. Maximum number of target clusters to be accessed is number of relevant documents (3) for worst case. n c = 4 (number of clusters) m = 30 (number of documents) k = 3 (number of relevant documents to be accessed) |C 1 | = 2 |C 2 | = 4 |C 3 | = 8 |C 4 | = 16 m 1 = m - |C 1 | = 30 – 2 = 28 m 2 = m - |C 2 | = 30 - 4 = 26 m 3 = m - |C 3 | = 30 – 8 = 22 m 4 = m - |C 4 | = 30 – 16 = 14 In order to find expected number of target clusters(n tr ), Yao’s formula is given below: = [1 − ∏ − + 1 −+1 =1 ] = ∑ =1 1 = [1 − ∏ 1 − + 1 −+1 3 =1 ]= [1− 28 − 1 + 1 30 − 1 + 1 28 − 2 + 1 30 − 2 + 1 28 − 3 + 1 30 − 3 + 1 ] = 0.19 2 = [1 − ∏ 2 − + 1 −+1 3 =1 ]= [1− 26 − 1 + 1 30 − 1 + 1 26 − 2 + 1 30 − 2 + 1 26 − 3 + 1 30 − 3 + 1 ] = 0.36 3 = [1 − ∏ 3 − + 1 −+1 3 =1 ]= [1− 22 − 1 + 1 30 − 1 + 1 22 − 2 + 1 30 − 2 + 1 22 − 3 + 1 30 − 3 + 1 ] = 0.62 4 = [1 − ∏ 4 − + 1 −+1 3 =1 ]= [1− 14 − 1 + 1 30 − 1 + 1 14 − 2 + 1 30 − 2 + 1 14 − 3 + 1 30 − 3 + 1 ] = 0.91 = ∑ 4 =1 = 1 + 2 + 3 + 4 = 0.19 + 0.36 + 0.62 + 0.91 = 2.08 ≅ 2

Transcript of Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found...

Page 1: Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity

Duygu Durmuş

1

Computer Engineering Department Bilkent University

CS533 – Information Retrieval Systems Homework 2

Question 1 Solution: Minimum number of target clusters to be accessed is 1 for best-case. Maximum number of target clusters to be accessed is number of relevant documents (3) for worst case. nc = 4 (number of clusters) m = 30 (number of documents) k = 3 (number of relevant documents to be accessed) |C1| = 2 |C2| = 4 |C3| = 8 |C4| = 16 m1 = m - |C1| = 30 – 2 = 28 m2 = m - |C2| = 30 - 4 = 26 m3 = m - |C3| = 30 – 8 = 22 m4 = m - |C4| = 30 – 16 = 14 In order to find expected number of target clusters(ntr), Yao’s formula is given below:

𝑃𝑗 = [1 − ∏𝑚𝑗 − 𝑖 + 1

𝑚 − 𝑖 + 1

𝑘

𝑖=1

]

𝑛𝑡𝑟 = ∑ 𝑃𝑗

𝑛𝑐

𝑗=1

𝑃1 = [1 − ∏𝑚1 − 𝑖 + 1

𝑚 − 𝑖 + 1

3

𝑖=1

] = [1 −28 − 1 + 1

30 − 1 + 1∗

28 − 2 + 1

30 − 2 + 1∗

28 − 3 + 1

30 − 3 + 1] = 0.19

𝑃2 = [1 − ∏𝑚2 − 𝑖 + 1

𝑚 − 𝑖 + 1

3

𝑖=1

] = [1 −26 − 1 + 1

30 − 1 + 1∗

26 − 2 + 1

30 − 2 + 1∗

26 − 3 + 1

30 − 3 + 1] = 0.36

𝑃3 = [1 − ∏𝑚3 − 𝑖 + 1

𝑚 − 𝑖 + 1

3

𝑖=1

] = [1 −22 − 1 + 1

30 − 1 + 1∗

22 − 2 + 1

30 − 2 + 1∗

22 − 3 + 1

30 − 3 + 1] = 0.62

𝑃4 = [1 − ∏𝑚4 − 𝑖 + 1

𝑚 − 𝑖 + 1

3

𝑖=1

] = [1 −14 − 1 + 1

30 − 1 + 1∗

14 − 2 + 1

30 − 2 + 1∗

14 − 3 + 1

30 − 3 + 1] = 0.91

𝑛𝑡𝑟 = ∑ 𝑃𝑗

4

𝑗=1

= 𝑃1 + 𝑃2 + 𝑃3 + 𝑃4 = 0.19 + 0.36 + 0.62 + 0.91 = 2.08

≅ 2 𝑡𝑎𝑟𝑔𝑒𝑡 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠

Page 2: Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity

Duygu Durmuş

2

a. How can we use the concepts of the C3M in data stream environments?

Clustering is beneficial for both searching and browsing of very large documents databases. Since periodic update of clusters are needed due to the nature of the document databases (addition and deletion of documents because of coming and leaving documents) in data stream environments which enables cluster maintenance. An incremental clustering method can be suggested which is an extension of C3M. It is required to handle database dynamism on the generated clustering structures. C2ICM (Cover-Coefficient-based Incremental Clustering Methodology) is an extension of C3M. C2ICM provides a measure of similarities among documents similar to C3M concept. First, this concept is used to determine the number of clusters and cluster-seeds. Then, non-seed documents are assigned to the clusters which are initiated by the seed documents. Thus, in order to form initial clusters, we can benefit from C3M (Cover-Coefficient-based Clustering Methodology) concept. Incremental clustering starts with a set of documents and these documents are clustered using C3M algorithm. After this, clusters are updated due to dynamically changing environments in data stream environments using the C2ICM algorithm given in “Incremental clustering for dynamic information processing” paper mentioned in the assignment description [1].

b. How can we modify the single link clustering approach for data streams? It is possible or not? Please explain.

We can modify the single link clustering approach for data stream environment, so it is possible. It means that single link clustering should be modified in a way which considers a dynamically changing document environment of data stream. Thus, the coming and leaving objects are taken into account and the dynamic addition and deletion update in clusters appear in the latest time window. Single link clustering defines the similarity as the maximum similarity between any two documents. In order to modify the single-link clustering algorithm for this purpose, the incoming document should be considered for addition. The similarity between the incoming document and every other existing document in the environment is found.

Page 3: Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity

Duygu Durmuş

3

Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity values of the current clusters, it means that a new subtree is needed. Thus, a new subtree is created with this similarity value between the existing and incoming document under the existing subtree. Otherwise, if this similarity value is less than the similarity values of the current clusters, it means that a new branch is needed. Therefore, a new branch is added which connects subtree with the incoming document with given similarity value. These steps can be applied repeatedly for each incoming document similar to C2ICM algorithm.

a. Is this problem similar to document snippet generation by search engines for their

search results? The cluster labeling is a similar problem to the document snippet generation by search engines for their search results. It is because the document snippet generation aims to summarize a document by several ways to obtain a description created for a page as a snippet. Similarly, clustering labeling problem is based on generating multilevel summaries of a set of documents like snippets returned by a search engine. The document snippet generation by search engines is based on queries and search results. Search engines may use the meta description created for a page as a snippet which is a snippet that summarizes of page’s content (similar to the problem of the multilevel summaries of a set of documents).

b. Find two papers from literature on cluster label generation. Explain each one separately with two sentences.

• “Cluster Generation and Cluster Labeling for Web Snippets: A Fast and Accurate

Hierarchical Solution” by Filippo Geraci, Marco Pellegrini, Marco Maggini and Fabrizio Sebastiani.

This paper describes a meta-search engine that groups the Web Snippets returned by auxiliary search engines into disjoint labelled clusters where labelling step is highly dependent on cluster found. Clustering generation is achieved is by performing a modified version of the furthest-point-first algorithm(M-FPF) and then clustering labeling aiming at extracting from the set of snippets assigned to each cluster a sequence of words highly

Page 4: Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity

Duygu Durmuş

4

descriptive of the corresponding group items is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure [2].

• “Automatically Labeling Hierarchical Clusters” by Pucktada Treeratpituk and Jamie Callan

This paper describes a simple algorithm which assigns labels to hierarchical clusters automatically similar to the labels created by Open Directory Project (ODP). This algorithm evaluates candidate labels using the information from the cluster, the parent cluster and corpus statistics with the help of a trainable threshold enabling assignment of high-quality labels to each cluster [3].

c. Suggest a method of yours cluster labeling briefly. Give a step by step explanation. Explain your intuition and explain why you would expect that it would work.

First, the data objects should be divided into groups. Then, the groups should be labelled with the best labels derived from the data which is explained below step by step. I expect it to work because the possible labels are extracted from the documents in the cluster, they are scored based on their occurrences in documents in cluster and distances to centroid of clusters. They are evaluated by the mutual information which measure the importance of absence/presence of the term in the document, so it shows how important the word is for this cluster and able to summarize the cluster. Clustering Procedure

• Discover clusters with a chosen method such as k-means clustering where each cluster is represented by the centroid of the cluster’s documents

Cluster Labelling Procedure • Find the list of terms except stop words and punctuation in each cluster • Score each term based on summation of occurrences of the terms in the cluster

and the distance between the cluster centroid and each term. • Select the highest scored ones as important terms in cluster • Extract candidate labels for each cluster by using the top-k important terms in each

cluster • Evaluate and score the candidate labels for each cluster with mutual information

where labels closer to the cluster content are chosen • Choose the best labels for each cluster based on the evaluation and scores

d. Can we do user-oriented cluster labeling? Under what condition(s) such a thing

would make sense? The problem with cluster labeling is that it is difficult to pick descriptive and human-readable cluster labels. After extracting candidate labels and choosing the best labels, if the chosen labels for clusters are not descriptive enough for users, then it is required to consider user-oriented cluster labeling. Also, clusters might contain many documents with wide-range of terms. In this case, cluster labels have the potential to mislead the user and it might be hard to obtain descriptive labels which requires user-oriented cluster labeling.

Page 5: Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity

Duygu Durmuş

5

a. Which problem you think may give the users of machine learning a false happiness?

Overfitting occurs when the model fits the training data too well [4]. In this case, the model memorizes the training data and loses its generalizability. It gives good accuracy results for training set. However, when it comes to test set, it is a false happiness because the reason why the model gives good results for training set is that it fits perfectly to the training data. Since the model loses its generalizability, it gives bad accuracies for a new data set. Unbalanced class distributions might cause a false happiness. It is because of the accuracy paradox. When the user looks at the overall accuracy, it seems perfect. However, when the class accuracies are considered they are not as good as overall accuracy. The reason behind is that the model is biased towards the majority class while results in good classification of majority class and bad classification of minority classes.

b. Do we have problems similar to overfitting and underfitting in clustering? Please explain. If you use a resource for your explanation, please cite that work(s).

Overfitting occurs when the model fits the data too well. Thus, the model only gives good accuracies for the training data but since it loses its ability to generalize and memorizes noises and details in the training data, it has a negative impact on the performance of the model on new data. Underfitting occurs when the model is not able to model the training data and generalize to new data. It is not a good model since it does not have the ability to generalize. In addition to negative impact on new data, it also has a poor performance on the training data. The ideal case is selecting a model at a good spot between overfitting and underfitting. There are also similar problems to overfitting and underfitting in clustering. In the case of overfitting, the worst case for clustering is that: While clustering the documents, each document is assigned to one cluster, so the cluster number is the number of documents. It means that the clusters overfits the documents and the clustering lose its generalizability and memorize the documents. In this case, clusters should be merged with respect to their similarities of documents.

Page 6: Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity

Duygu Durmuş

6

In the case of underfitting, the worst case for clustering is that: While clustering the documents, all documents are assigned to one cluster, so the cluster number is 1. It means that the cluster underfits the documents and the clustering again lose its generalizability and it is not able to cluster the documents effectively. In this case, cluster should be separated with respect to their dissimilarities of documents.

Question 5 Solution:

𝑺 = [

𝟏. 𝟎𝟎 𝟎. 𝟔𝟕 𝟎. 𝟓𝟎 𝟎. 𝟐𝟎− 𝟏. 𝟎𝟎 𝟎. 𝟖𝟎 𝟎. 𝟏𝟎− − 𝟏. 𝟎𝟎 𝟎. 𝟎𝟎− − − 𝟏. 𝟎𝟎

]

The similarity matrix for four documents are given above. Maximal Marginal Relevance (MMR) is defined as follows:

𝑴𝑴𝑹 = 𝒂𝒓𝒈 𝐦𝐚𝐱𝒅𝒊 ∈ 𝑹\𝑺

[𝛌 ∗ 𝐬𝐢𝐦(𝐝𝐢, 𝐪) − (𝟏 − 𝛌) ∗ 𝐦𝐚𝐱𝒅𝒋 ∈ 𝑺

{𝒔𝒊𝒎(𝒅𝒊, 𝒅𝒋)}]

Similarities of these documents to a given query:

sim (q, d1) = 0.70

sim (q, d2) = 0.40

sim (q, d3) = 0.60

sim (q, d4) = 0.80

Page 7: Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity

Duygu Durmuş

7

At first, set S is always empty. The document which is most similar to the given query is chosen as the first element of the set. In this example, d4 is picked as the first element to set S. After each part, the corresponding diversity values is calculated by the following formula:

𝑑𝑖𝑣(𝑆) = 1 − 1

(|𝑆|2

)∑ 𝑠(𝑑𝑖, 𝑑𝑗)

∀𝑑𝑖,𝑑𝑗

a. Use λ = 1.00

When λ = 1.00, the diversity term is always zero due to the multiplication and the relevance term dominates the calculation. Thus, only the similarity of documents with respect to the query is considered.

𝑴𝑴𝑹(λ = 1) = 𝒂𝒓𝒈 𝐦𝐚𝐱𝒅𝒊 ∈ 𝑹\𝑺

[𝟏 ∗ 𝐬𝐢𝐦(𝐝𝐢, 𝐪) − (𝟏 − 𝟏) ∗ 𝐦𝐚𝐱𝒅𝒋 ∈ 𝑺

{𝒔𝒊𝒎(𝒅𝒊, 𝒅𝒋)}]

𝑴𝑴𝑹(λ = 1) = 𝒂𝒓𝒈 𝐦𝐚𝐱𝒅𝒊 ∈ 𝑹\𝑺

[𝟏 ∗ 𝐬𝐢𝐦(𝐝𝐢, 𝐪) − (𝟎) ∗ 𝐦𝐚𝐱𝒅𝒋 ∈ 𝑺

{𝒔𝒊𝒎(𝒅𝒊, 𝒅𝒋)}]

𝑴𝑴𝑹(λ = 1) = 𝒂𝒓𝒈 𝐦𝐚𝐱𝒅𝒊 ∈ 𝑹\𝑺

[𝐬𝐢𝐦(𝐝𝐢, 𝐪)]

At first, S is empty. Then, d4 is picked because the similarity of d4 to a given query is maximum, sim (q, d4) = 0.80. The similarity of d1 is second highest with sim (q, d1) = 0.70. Thus, S = {d4, d1} and R\S = {d2, d3}. The corresponding diversity value is given below:

𝑑𝑖𝑣(𝑆) = 1 − 𝑠(𝑑4, 𝑑1) = 1 − 0.20 = 0.80

b. Use λ = 0

When λ = 0, the relevance term is always zero due to multiplication and the diversity term dominates the calculation. Thus, only the diversity term is considered.

𝑴𝑴𝑹( λ = 0) = 𝒂𝒓𝒈 𝐦𝐚𝐱𝒅𝒊 ∈ 𝑹\𝑺

[𝟎 ∗ 𝐬𝐢𝐦(𝐝𝐢, 𝐪) − (𝟏 − 𝟎) ∗ 𝐦𝐚𝐱𝒅𝒋 ∈ 𝑺

{𝒔𝒊𝒎(𝒅𝒊, 𝒅𝒋)}]

𝑴𝑴𝑹( λ = 0) = 𝒂𝒓𝒈 𝐦𝐚𝐱𝒅𝒊 ∈ 𝑹\𝑺

[𝟎 ∗ 𝐬𝐢𝐦(𝐝𝐢, 𝐪) − 𝟏 ∗ 𝐦𝐚𝐱𝒅𝒋 ∈ 𝑺

{𝒔𝒊𝒎(𝒅𝒊, 𝒅𝒋)}]

𝑴𝑴𝑹( λ = 0) = 𝒂𝒓𝒈 𝐦𝐚𝐱𝒅𝒊 ∈ 𝑹\𝑺

[− 𝐦𝐚𝐱𝒅𝒋 ∈ 𝑺

{𝒔𝒊𝒎(𝒅𝒊, 𝒅𝒋)}]

𝑴𝑴𝑹( λ = 0) = 𝒂𝒓𝒈 𝐦𝐢𝐧𝒅𝒊 ∈ 𝑹\𝑺

𝐦𝐚𝐱 𝒅𝒋 ∈ 𝑺

{𝒔𝒊𝒎(𝒅𝒊, 𝒅𝒋)}

At first, S is empty. Then, d4 is picked because the similarity of d4 to a given query is maximum, sim (q, d4) = 0.80. In order to maximize the diversity term in MMR, we have to minimize the negative similarity by comparing documents with S = {d4} which are given below and select the smallest similarity among them which makes the equation maximum.

sim (d4, d1) = 0.20

sim (d4, d2) = 0.10

sim (d4, d3) = 0

Page 8: Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity

Duygu Durmuş

8

The minimum similarity is between d4 and d3. Therefore, d3 is picked. The resulting set is S = {d4, d3}. The corresponding diversity value is given below:

𝑑𝑖𝑣(𝑆) = 1 − 𝑠(𝑑4, 𝑑3) = 1 − 0 = 1

c. Use λ = 0.5

When λ = 0.5, both diversity and relevance terms are considered equally.

𝑴𝑴𝑹( λ = 0.5) = 𝒂𝒓𝒈 𝐦𝐚𝐱𝒅𝒊 ∈ 𝑹\𝑺

[𝟎. 𝟓 ∗ 𝐬𝐢𝐦(𝐝𝐢, 𝐪) − (𝟏 − 𝟎. 𝟓) ∗ 𝐦𝐚𝐱𝒅𝒋 ∈ 𝑺

{𝒔𝒊𝒎(𝒅𝒊, 𝒅𝒋)}]

𝑴𝑴𝑹( λ = 0.5) = 𝒂𝒓𝒈 𝐦𝐚𝐱𝒅𝒊 ∈ 𝑹\𝑺

[𝟎. 𝟓 ∗ 𝐬𝐢𝐦(𝐝𝐢, 𝐪) − 𝟎. 𝟓 ∗ 𝐦𝐚𝐱𝒅𝒋 ∈ 𝑺

{𝒔𝒊𝒎(𝒅𝒊, 𝒅𝒋)}]

At first, S is empty. Then, d4 is picked because the similarity of d4 to a given query is maximum which means it is the most relevant document to query, sim (q, d4) = 0.80. Then, consider the documents in R\S = {d1, d2, d3} in MMR and choose the set S with higher MMR.

S = {d4}

R\S = {d1, d2, d3}

Consider d1:

S = {d1, d4}

sim (q, d1) = 0.70

sim (d4, d1) = 0.20

𝑴𝑴𝑹(𝐒) = (𝟎. 𝟓) ∗ (𝟎. 𝟕𝟎) − (𝟎. 𝟓) ∗ (𝟎. 𝟐𝟎) = 𝟎. 𝟐𝟓

Consider d2:

S = {d1, d2}

sim (q, d2) = 0.40

sim (d4, d2) = 0.10

𝑴𝑴𝑹(𝐒) = (𝟎. 𝟓) ∗ (𝟎. 𝟒𝟎) − (𝟎. 𝟓) ∗ (𝟎. 𝟏𝟎) = 𝟎. 𝟏𝟓

Consider d3:

S = {d1, d3}

sim (q, d3) = 0.60

sim (d4, d3) = 0

Page 9: Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity

Duygu Durmuş

9

𝑴𝑴𝑹(𝐒) = (𝟎. 𝟓) ∗ (𝟎. 𝟔𝟎) − (𝟎. 𝟓) ∗ (𝟎) = 𝟎. 𝟑

Based on the considerations above, d3 is picked since it has the highest MRR.

Thus, S = {d3, d4}. The corresponding diversity value is given below:

𝑑𝑖𝑣(𝑆) = 1 − 𝑠(𝑑4, 𝑑3) = 1 − 0 = 1

d. Use MMR to select the best three documents for λ = 0.5

S = {d3, d4} is selected from part c.

R\S = {d1, d2}

Consider d1:

max {sim (d1, d3), sim (d1, d4)} = max {0.50,0.20} = 0.50 => sim (d1, d3)

sim (q, d1) = 0.70

𝑴𝑴𝑹(𝐒) = (𝟎. 𝟓) ∗ (𝟎. 𝟕𝟎) − (𝟎. 𝟓) ∗ (𝟎. 𝟓𝟎) = 𝟎. 𝟏𝟎

Consider d2:

max {sim (d2, d3), sim (d2, d4)} = max {0.80,0.10} = 0.80 => sim (d2, d3)

sim (q, d2) = 0.40

𝑴𝑴𝑹(𝐒) = (𝟎. 𝟓) ∗ (𝟎. 𝟒𝟎) − (𝟎. 𝟓) ∗ (𝟎. 𝟖𝟎) = −𝟎. 𝟐𝟎

Therefore, we should select d1 since it has highest MMR and add it to set S so it becomes S = {d1, d3, d4}.

𝒅𝒊𝒗(𝑺) = 𝟏 −[𝒔(𝒅𝟒, 𝒅𝟑) + 𝒔(𝒅𝟏, 𝒅𝟑) + 𝒔(𝒅𝟏, 𝒅𝟒)]

𝟑= 𝟏 − (𝟎. 𝟓 + 𝟎. 𝟐 + 𝟎)/𝟑 = 𝟎. 𝟕𝟔𝟔

e. How can we use an approach like MMR for a task rather than summarization and ranking? In your answer, you may replace relevance and diversity with other concepts.

The purpose of Maximal Marginal Relevance is to improve user satisfaction by providing useful and beneficial information to user allowing the user to minimize redundancy. It merges both relevance and diversity. The users demonstrate different preference for MMR in navigation and for locating the relevant candidate documents more quickly and pure-relevance ranking when looking at related documents within that band based on the paper [5]. Most of users discovered the differential utility of diversity and relevance-only search. Based on the results obtained from a, b and c, it can be possible to say that decreasing λ constant, increases the diversity of resulting set S while increasing the λ

Page 10: Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity

Duygu Durmuş

10

constant, increases the accuracy and relevance. Therefore, the tradeoff between the relevance and diversity can be used for some other concepts.

The majority of people they preferred the method for giving in their opinion about interesting topics. In financial section, most people prefer MMR method for search task. In finance, this tradeoff can be between risk and return. In statistics and machine learning, the tradeoff can be between bias and variance where high bias, low variance cause underfitting and low bias, high variance cause overfitting for machine learning. In economics, the tradeoff can be in terms of opportunity cost of a choice, which means that sacrificing that must be made to obtain a certain service rather than others that could be obtained using the same sources. MMR can be used to merge these concepts for different areas.

Question 6 Solution: Given search engines A, B, C and D and ranking provided by them for the documents a,b, c, d, e and f: A = {b, a, c, d} B = {b, d, a, f} C = {b, d, c, a} D = {a, c, d, e}

a. Reciprocal rank

𝒓(𝒅𝒊) =𝟏

∑𝟏

𝒑𝒐𝒔𝒊𝒕𝒊𝒐𝒏(𝒅𝒊𝒋)𝒋

where di is the document i and dij is the rank of document I in the system j. If the document does not exist in the ranking list, then the corresponding term in denominator is zero. The reciprocal rank formula above is used to calculate rank for each document.

Page 11: Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity

Duygu Durmuş

11

A = {b, a, c, d} B = {b, d, a, f} C = {b, d, c, a} D = {a, c, d, e} For document a:

𝑟(𝑎) = 1

12 +

13 +

14 +

11

= 0.48

For document b:

𝑟(𝑏) = 1

11 +

11 +

11 + 0

= 0.33

For document c:

𝑟(𝑐) = 1

13 + 0 +

13 +

12

= 0.85

For document d:

𝑟(𝑑) = 1

14 +

12 +

12 +

13

= 0.63

For document e:

𝑟(𝑒) = 1

0 + 0 + 0 +14

= 4

For document f:

𝑟(𝑓) = 1

0 +14

+ 0 + 0= 4

Rank documents in increasing order: 𝑟(𝑏) = 0.33 𝑟(𝑎) = 0.48 𝑟(𝑑) = 0.63 𝑟(𝑐) = 0.85

𝑟(𝑒) = 4 𝑟(𝑓) = 4

According to Reciprocal Rank, b > a > d > c > e = f

b. Borda Count Borda Count aims to rank documents based on the number of total documents. The highest rank gets n vote where n is the total number of documents. For example, in this example, the number of documents is 6 so the highest rank is 6 and the next score gets one vote less (such as 5, 4, …). The Borda Count of a document (BC) is calculated by summing up the values of document for each search engine. A = {b, a, c, d} B = {b, d, a, f} C = {b, d, c, a} D = {a, c, d, e} There are 6 documents in total (a, b, c, d, e, f) so BC score of the top document is 6 and each subsequent gets one vote less.

Page 12: Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity

Duygu Durmuş

12

For document a: BC(a) = BCA(a) + BCB(a) + BCC(a) + BCD(a) BC(a) = 5 + 4 + 3 + 6 = 18 For document b: BC(b) = BCA(b) + BCB(b) + BCC(b) + BCD(b) BC(b) = 6 + 6 + 6 + 0 = 18 For document c: BC(c) = BCA(c) + BCB(c) + BCC(c) + BCD(c) BC(c) = 4 + 0 + 4 + 5 = 13 For document d: BC(d) = BCA(d) + BCB(d) + BCC(d) + BCD(d) BC(d) = 3 + 5 + 5 + 4 = 17 For document e: BC(e) = BCA(e) + BCB(e) + BCC(e) + BCD(e) BC(e) = 0 + 0 + 0 + 3 = 3 For document f: BC(f) = BCA(f) + BCB(f) + BCC(f) + BCD(f) BC(f) = 0 + 3 + 0 + 0 = 3

According to Borda Count, a = b > d > c > e = f c. Condorcet

Condorcet aims to count how many times each document wins, loses or tie with respect to another document. After counting, these values are placed in a matrix and the total number of win, loss and tie for each document is calculated from that table to get the ranking. A = {b, a, c, d} B = {b, d, a, f} C = {b, d, c, a} D = {a, c, d, e} A = b>a>c>d B = b>d>a>f C = b>d>c>a D = a>c>d>e Each non-diagonal entry (i, j) of the matrix shows the number of votes i over j.

a b c d e f a - (1,3,0) (3,1,0) (2,2,0) (4,0,0) (4,0,0) b (3,1,0) - (3,1,0) (3,1,0) (3,1,0) (3,0,1) c (1,3,0) (1,3,0) - (2,2,0) (3,0,1) (3,1,0) d (2,2,0) (1,3,0) (2,2,0) - (4,0,0) (4,0,0) e (0,4,0) (1,3,0) (0,3,1) (0,4,0) - (1,1,2) f (0,4,0) (0,3,1) (1,3,0) (0,4,0) (1,1,2) -

Table 1: Condorcet Win-Lose-Tie table for documents (win, lose, tie)

Page 13: Duygu Durmuşcanf/CS533/CS533Fall2018/...Duygu Durmuş 3 Then, the best value of similarity is found which is in this case maximum. If this similarity value is higher than the similarity

Duygu Durmuş

13

*Cell [i, j] shows the number of wins, losses and ties of document i over document j, respectively. Reminder:

• If both documents exist in search engine, the one with higher precedence wins and the other one loses.

• If both documents exist in search engine and they have same precedence, it is a tie.

• If both of documents does not exist in search engine, it is a tie. • If one of documents does not exist in search engine, the other one exists in search

engine wins.

Win Lose Tie a 3 1 1 b 5 0 0 c 2 2 1 d 2 1 2 e 0 4 1 f 0 4 1

Table 2: Condorcet Count Results

According to Condorcet, b > a > c = d > e = f

References

[1] F. Can, Incremental clustering for dynamic information processing, ACM Trans on Information Systems.

[2] F. Geraci, M. Maggini and F. Sebastiani, Cluster Generation and Cluster Labeling for Web Snippets: A Fast and Accurate Hierarchical Solution.

[3] P. Treeratpituk and J. Callan, Automatically Labeling Hierarchical Clusters, ACM.

[4] P. Domingos, A few useful things to know about machine learning, Com. of the ACM, 2012.

[5] J. G. Carbonell and J. Goldstein, The use of MMR, diversity-based reranking for reordering documents and producing summaries, SIGIR Conference, 1998.

[6] F. Can and N. Nuray, Automatic ranking of information retrieval systems using data fusion, Information Processing and Management, 2006.