Score-based ranking of the documents
description
Transcript of Score-based ranking of the documents
Score-based ranking of the documents
Submitted By: Kriti Khanna(9910103499)
F4, CSE, 4th year
OUTLINE
• Introduction• Literature Survey• Objective• Flowchart• Implementation• Tools and techniques• References
INTRODUCTION• Information Retrieval.• Ranking• Weight• Score
Information Retrieval
• We obtain information resources relevant to an information need from a collection of information resources.
• It is used to reduce information overload.• Best applications : web search engines, public
libraries use IR systems to provide access to books, journals and other documents.
Abstract Model of IR
Brief working of IR system
• User enters the query in his own language.• Query development function converts the user
query into formal query in order to harmonize it with the system's vocabulary of retrieval commands. It is 1 of the important intermediary step that takes place inside the database.
• Retrieved data is the complete or incomplete data which later on is being sorted to generate the final resultset.
Ranking
• To rank matching documents according to their relevance to a given search query.
• We do it by assigning a numerical score to each document based on a ranking function, which incorporates features of the document, the query, and the overall document collection.
Some simple ranking functions• Constant ranking function : the same score is assigned to all
documents.• Term frequency ranking function : counting the number of
times that each query term occurs in the document, then summing these.
• The tf-idf ranking function : computing the product of the term frequency and inverse document frequency for each query term, then summing these.
• Okapi BM25 : finding the idf of each query term, then summing these.
• Machine-learned ranking formulas, obtained automatically from training data by machine learning methods.
Score Calculation
• Score calculation for each document is done by multiplying the weights of each document and the query weight, then summing these.
Literature Survey
List of sources• Paper1 : Document similarity search
based on manifold ranking of Tex-Tiles.
• Paper2 : TextTiling-Segmenting Text into Multi-paragraph Subtopic Passages.
List of sources
• Paper 3 : Comparison of rank-based vs score based aggregation for ensemble gene selection.
• Paper 4 : Several methods of ranking retrieval systems with partial relevance judgment.
Document similarity search based on manifold ranking of Tex-Tiles
• In this paper ranking of documents is done by using the tiling concept.
• Conclusion : it improves the retrieval performances based on different retrieval functions.
• Authors : Xiaojun Wan, Jianwu Yang, and Jianguo Xiao.
• Place : Institute of Computer Science and Technology, Peking University, Beijing 100871, China.
TextTiling-Segmenting Text into Multi-paragraph Subtopic Passages• In this paper textiling is used to divide
each document into sub topics is being implemented.
• Conclusion : this technique has been useful for many text analysis tasks, including information retrieval and summarization.
• Authors : Marti A. Hearst
Comparison of rank-based vs score based aggregation for ensemble gene selection
• In this paper there is comparison of rank based and score based aggregation using different techniques (RF, MI, Dev, GM, ROC, PRC, S2N) by applying these techniques on different datasets, subsets.
• Conclusion : these 2 aggregation approaches work differently on different rankers.
• Authors : David J. ittman, Taghi M. Khoshgoftaar, Randall Wald, and Amri Napolitano
Several methods of ranking retrieval systems with partial relevance judgment.
• This paper demonstrates that precision and recall undergo certain shortcomings when ranking is done with partial relevance judgment.
• conclusion : with partial relevance judgment, the evaluated results can be significantly different from the results with complete relevance judgment.
• Authors : Shengli Wu and Sally McClean.
Objective• It aims to find documents similar to a query
document in a text corpus and return a ranked list of similar documents.
• Ranking is done by calculating the query-document score.
Problem statement
• Documents are ranked based on standard score calculation i.e using the tf-idf concept.
• Formula for weighted tf : {1+log base 10 of (tf), tf > 0
0, otherwise }.• Formula for idf : log base 10 of (N/df). • Another way of ranking the documents is also
being studied i.e textiling. Further a precision recall graph will be plotted.
Steps involved
• Collection of files • Determining term frequency • Determining document frequency • (query, document ) set • Score calculation based on 4 different
techniques.
Design till now
Control flow graph
Description of functions
• Main : It calls all other functions by making objects of the subclasses.
• remWord : It is used to check if program is reading the files.
• deleteWords : It is used to delete the list of stop words from all the files and store the unique words of all files in a separate file.
Description of flowchart functions• countWords : It reads the unique terms from
the file and store them in a form of map along with their frequency.
• documentFreqVector : It makes a document vector. Corresponding to each term and document it sets 1s or 0s.
Weight Calculation
• It differs in documents and queries.• We use ddd.qqq notation to depict this
calculation.• Example: lnc.ltn
document: logarithmic tf, no df weighting, cosine normalization
query: logarithmic tf, idf, no normalization
Weight Varients
ApproachesAnc.btn and anc.ltn approaches
ApproachesNnc.btn and nnc.ltn apporaches
Tools and techniques• NetBeans : it is an integrated development environment (IDE) for
developing primarily with Java, but also with other languages, in particular PHP, C/C++, and HTML5. It is also an application platform framework for Java desktop applications and others. The NetBeans IDE is written in Java and can run on Windows, OS X, Linux, Solaris and other platforms supporting a compatible JVM. The NetBeans Platform allows applications to be developed from a set of modular software components called modules.
• Java : it is a computer programming language that is concurrent, class-based, object-oriented, and specifically designed to have as few implementation dependencies as possible. It is intended to let application developers "write once, run anywhere" (WORA), meaning that code that runs on one platform does not need to be recompiled to run on another. Java applications are typically compiled to bytecode (class file) that can run on any Java virtual machine (JVM) regardless of computer architecture. Java is, as of 2014, one of the most popular programming languages in use, particularly for client-server web applications,
References• Wan,X. Yang, J. Xiao, J. (2001) Document Similarity Search Based on
Manifold-Ranking of TextTiles. Institute of Computer Science and Technology, Peking University, Beijing 100871, China.
• Hearst, M.A. TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages. Xerox PARC, California, USA.
• Dittman, DJ. Khoshgoftaar, TM. Wald, R. Napolitano, A. (2013). Comparison of Rank-Based vs. Score-Based Aggregation for Ensemble Gene Selection. Florida Atlantic University, Boca Raton, FL 33431.
• Wu, S. McClean, S. Several methods of ranking retrieval systems with partial relevance judgment. School of computing and mathematics, University of Ulster, UK.