Fuzzy Logic Information Retrieval
-
Upload
prabhjot-singh -
Category
Documents
-
view
3 -
download
0
description
Transcript of Fuzzy Logic Information Retrieval
-
MINOR PROJECT SYNOPSIS
Implementation of the Fuzzy Logic based IR system for TREC dataset.
Submitted in partial fulfilment of the requirements
For the award of degree of
Bachelor of Technology
In
Computer Science Engineering
TEAM MEMBERS: PROJECT GUIDE:
Prabhjot Singh (03311502711)
Sumit Dhawan (04311502711)
Shubham Agarwal (04011502711) Mrs Narina Thakur
BHARATI VIDYAPEETHS COLLEGE OF ENGINEERING
A-4, PASCHIM VIHAR, ROHTAK ROAD, NEW DELHI- 110063
AFFILIATED TO
GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY, DELHI
-
ABSTRACT
Relevance, evaluation, and information needs are various key issues associated with InformationRetrieval. Relevance is the relational value of a given user query to the documents within thedatabase. Relevance of a document is normally based on a document ranking algorithm. Thesealgorithms define how relevant a document is to a user query by using functions that definerelations between the query given and the documents collected in the index.
Information needs is how the user interacts with the information retrieval system. The data withinthe system should be able to be accessed easily and in a way that is convenient to the user.Retrieving too much information might be inconvenient in certain systems, also in other systemsnot returning all relevant information may be unacceptable.Reserach reveals the effectiveness offuzzy logic to handle uncertainty and vagueness of queries and documents.
In present project, a fuzzy based similarity measure is implemented using TREC data collection.The performance of proposed similarity measure is evaluated and compared with Cosine similaritymeasures on the basis of Precision-Recall curves for individual query.
-
INDEX1. Objective.12. Introduction.....23. Fuzzy Logic based information system..3
3.1 TREC Dataset...............................4 3.2 Information Retrieval...6
3.3 Fuzzy Logic......7 3.4 Hard Science with If/then rules....8 3.5 Propositional Fuzzy Logic...........10 3.6 Information Retrieval on the web........11
4. Evaluation of performance of IR systems.12-175. References18
-
1. OBJECTIVE
This project implements an improved IR system using fuzzy logic based similarity
measure . Fuzzy logic based similarity measure will implementation will improve IR
systems performance. The performance of fuzzy logic based similarity measure is
compared with Cosine similarity measures on TREC data collection.
-
2. INTRODUCTION
In last couple of decades, various methods have been suggested to improve the performance of
Information Retrieval (IR) System. An IR system generally deals with retrieval of relevant
documents against user defined queries. Baeza-Yates defined a general Information Retrieval model
as a quadruple [D, Q, F, R (q i , d j )], where D is a set composed of logical views for the documents
in the collection, Q is a set composed of logical views for the user information needs expressed as
queries, F is a framework for modeling document representations, queries and their relationships
and R (q i , d j ) is a similarity measure/ranking function which associates a real number with a
query q i Q and a document representation d j D. Such ranking defines an ordering among the
documents with regard to the query q i . Therefore similarity measure plays an important role to
develop a quality IR system.
An IR system evaluates the relevancy using some representations of a document and a query. There
are different models for representation documents and queries. Each model has its pros and cons.
The Boolean model was the first model which was adopted by most of the earlier systems and even
today some of the commercial systems use this model, which makes use of the concepts of Boolean
logic and set theories.
The documents and queries are a collection of terms and each term from the document is indexed.
The presence and absence of a term in a document is represented by 1 and 0 respectively. For the
term matching of document and query we maintain an inverted index of the terms i.e. for each term
we must store a list of documents that contain the term. However, the Boolean model has some
major limitations like binary decision criterion without any notion of grading scale and overloading
of documents. While some researchers have tried to overcome the weaknesses of the Boolean model
by building refinements to the existing Boolean model, others have approached IR with a different
search strategy called the Vector Space model.
-
The Vector Space Model, as the name implies, represents documents and queries internally in the
form of vectors. In the vector space model all queries and documents are represented as vectors in |
V|-dimensional space, where V is the set of all distinct terms in the collection (the vocabulary).
Some of the advantages of the Vector Space Model are that it is simple and fast model, that it can
handle weighted terms, that it produces a ranked list as output and that the indexing process is
automated which means a significantly lighter workload for the administrator of the collection.
Also, it is easy to modify individual vectors, which is essential for the query expansion technique
and logic based similarity measure. Therefore, vector space model is used as a base model in this
paper.
An IR system needs to calculate the similarity of the query and the particular document in order to
decide relevancy of that document with the query. When a document retrieval system is used to
query a collection of documents with n terms, the system computes a vector D (d i1 , d i2 ... d in ) of
size n for each document. The vectors are filled with the weights and similarly, a vector Q (W q1 ,
W q2 ... W qn ) is constructed for the terms found in the query. In recent years, some efforts have
made to construct a effective similarity measure for enhancing the performance of IR System.
In Fan presented similarity functions as trees and a classical generational scheme. Pathak et al have
proposed the idea of combined similarity measure in which they have proposed a linear
combination of various similarity measures and then optimize the weight of each similarity measure
using GA. Mehran Sahami proposes a novel method for measuring the similarity between short text
snippets by leveraging web search results to provide greater context for the short texts.
Vincent Schickel-Zuber et al., present a novel approach that allows similarities to be asymmetric
while still using only information contained in the structure of the ontology. Torra et al. presented a
method to calculate similarity between words based on dictionaries using Fuzzy graphs in reference.
Chen presented in reference a new similarity measure based on the geometric mean averaging
operator to handle the similarity problems of generalized fuzzy numbers. Usharani et al. proposed a
genetic algorithm based method for finding similarity of web document based on cosine similarity.
In the past, most popular similarity measures used in IR Systems are Cosine, Euclidean, Jaccard and
Okapi.
-
In the present project, a Fuzzy Logic based Similarity Measure, is proposed for vector space IR
model. The performance of proposed similarity measure would be evaluated and compared with
above mentioned similarity measures on the basis of 226 Y. Gupta et al.
3. Proposed Fuzzy Logic Based Similarity Measure
The IR System retrieves documents based on a query given by the user. In most of the cases, both
queries and documents are vague or imprecise and usually expressed in Natural Language (NL).
Sometimes user may change his query during information retrieval process and/or he may not be
conscious of his exact needs of information [1].
Therefore, to handle this uncertainty, vagueness and impreciseness, Fuzzy Logic is very suitable.
Fuzzy logic is based on Fuzzy Set theory and membership functions .
Documents retrieved by a query are evaluated by the rules of Fuzzy Inference System (FIS). Vector
Space Model is used as a base model due to its advantages over other models. In this FIS, we have
used three input variables: term frequency (tf), inverse document frequency (idf), overlap and one
output variable: relevance. These input variables are very useful to determine the relevancy of
document against a particular query. TF indicates that the number of occurrences of a term in each
document of the corpus[2].
IDF can be given as log (N/n), where N is the total number of documents in corpus and n is the
number of documents contains the term. Overlap reflects that many of the terms of the query are
found in documents. Mamdani type fuzzy inference system is used in FLBSM with the help of
Matlab Fuzzy Logic Toolbox.
-
The range of input variables tf, idf and output variable relevance are represented by LOW,
MEDIUM and HIGH, while the range of input variable overlap is represented by LOW and HIGH.
In this paper, triangular membership function is being used to map input space to a degree ofmembership of fuzzy set. The details of the membership functions for input and output variables of
FLBSM are shown in Fig. 1 [3].
Fig 1: membership functions for input and output variables of FLBSM
Fuzzy rules are derived from tf.idf weighting scheme i.e. if a query term has high tf and high idf in
a document, , then relevance is likely to be high. If many of the terms of the query are found in the
document (overlap), then relevance is likely to be high. It is known that if the rules tha at penalize
low features are added, the performance of the system is increased. So the following rules are
constructed for each of the query term:
If (tf is High) and (idf is High) then (relevance is High).
If (tf is Medium) and (id f is Medium) then (relevance is Medium).
If (tf is Low) and (idf is Low) then (relevance is Low).
Two fuzzy rules are also de efined for overlap as follows:
If (overlap is High) then (relevance is High).
If (overlap is Low) then (relevance is Low).
-
3.1 Trec Dataset
The Text REtrieval Conference (TREC) is an on-going series of workshops focusing on a list of
different information retrieval (IR) research areas, or tracks. It is co-sponsored by the National
Institute of Standards and Technology (NIST) and the Intelligence Advanced Research Projects
Activity (part of the office of the Director of National Intelligence), and began in 1992 as part of the
TIPSTER Text program. Its purpose is to support and encourage research within the information
retrieval community by providing the infrastructure necessary for large-scale evaluation of text
retrieval methodologies and to increase the speed of lab-to-product transfer of technology.
Each track has a challenge wherein NIST provides participating groups with data sets and test
problems. Depending on track, test problems might be questions, topics, or target extractable
features. Uniform scoring is performed so the systems can be fairly evaluated. After evaluation of
the results, a workshop provides a place for participants to collect together thoughts and ideas and
present current and future research work [4].
TREC systems often provide a baseline for further research. Examples include:
Hal Varian, Chief Economist at Google, says Better data makes for better science. The
history of information retrieval illustrates this principle well," and describes TREC's
contribution.
TREC's Legal track has influenced the e-Discovery community both in research and in
evaluation of commercial vendors.
The IBM researcher team building IBM Watson (aka DeepQA), which beat the world's best
Jeopardy! players, used data and systems from TREC's QA Track as baseline performance
measurements.
-
3.2 Information Retrieval
Information retrieval is the activity of obtaining information resources relevant to an information
need from a collection of information resources. Searches can be based on metadata or on full-text
(or other content-based) indexing.
Automated information retrieval systems are used to reduce what has been called "information
overload". Many universities and public libraries use IR systems to provide access to books,
journals and other documents. Web search engines are the most visible IR applications.
An information retrieval process begins when a user enters a query into the system. Queries are
formal statements of information needs, for example search strings in web search engines. In
information retrieval a query does not uniquely identify a single object in the collection. Instead,
several objects may match the query, perhaps with different degrees of relevancy.
An object is an entity that is represented by information in a database. User queries are matched
against the database information. Depending on the application the data objects may be, for
example, text documents, images, audio, mind maps or videos. Often the documents themselves are
not kept or stored directly in the IR system, but are instead represented in the system by document
surrogates or metadata.
Most IR systems compute a numeric score on how well each object in the database matches the
query, and rank the objects according to this value. The top ranking objects are then shown to the
user. The process may then be iterated if the user wishes to refine the query.
For effectively retrieving relevant documents by IR strategies, the documents are typically
transformed into a suitable representation. Each retrieval strategy incorporates a specific model for
its document representation purposes.
-
Fig
2:
Categorisation of IR models
Applications of IR include:
Digital libraries
Information filtering
Recommender systems
Media search
Blog search
Image retrieval
Speech retrieval
Video retrieval
To measure ad hoc information retrieval effectiveness in the standard way, we need a test collection
consisting of three things:
A document collection
A test suite of information needs, expressible as queries
A set of relevance judgments, standardly a binary assessment of either relevant or
nonrelevant for each query-document pair.
-
3.3 Fuzzy Logic
Fuzzy logic is a form of many-valued logic; it deals with reasoning that is approximate
rather than fixed and exact. Compared to traditional binary sets, fuzzy logic variables may
have a truth value that ranges in degree between 0 and 1.
Fuzzy logic has been extended to handle the concept of partial truth, where the truth value
may range between completely true and completely false. Furthermore, when linguistic
variables are used, these degrees may be managed by specific functions.
The term "fuzzy logic" was introduced with the 1965 proposal of fuzzy set theory by Lotfi
A. Zadeh. Fuzzy logic has been applied to many fields, from control theory to artificial
intelligence. Fuzzy logics had, however, been studied since the 1920s, as infinite-valued
logics - notably by ukasiewicz and Tarski.
A basic application might characterize subranges of a continuous variable. For instance, a
temperature measurement for anti-lock brakes might have several separate membership
functions defining particular temperature ranges needed to control the brakes properly. Each
function maps the same temperature value to a truth value in the 0 to 1 range. These truth
values can then be used to determine how the brakes should be controlled.
Fig 3: Fuzzy logic
temperature
-
3.4 Hard science with IF-THEN rules
Fuzzy set theory defines fuzzy operators on fuzzy sets. The problem in applying this is that the
appropriate fuzzy operator may not be known. For this reason, fuzzy logic usually uses IF-THEN
rules, or constructs that are equivalent, such as fuzzy associative matrices.
Rules are usually expressed in the form:
IF variable IS property THEN action
For example, a simple temperature regulator that uses a fan might look like this:
IF temperature IS very cold THEN stop fan
IF temperature IS cold THEN turn down fan
IF temperature IS normal THEN maintain level
IF temperature IS hot THEN speed up fan
There is no "ELSE" all of the rules are evaluated, because the temperature might be "cold" and
"normal" at the same time to different degrees.
The AND, OR, and NOT operators of boolean logic exist in fuzzy logic, usually defined as the
minimum, maximum, and complement; when they are defined this way, they are called the Zadeh
operators. So for the fuzzy variables x and y:
NOT x = (1 - truth(x))
x AND y = minimum(truth(x), truth(y))
x OR y = maximum(truth(x), truth(y))
There are also other operators, more linguistic in nature, called hedges that can be applied. These
are generally adverbs such as "very", or "somewhat", which modify the meaning of a set using a
mathematical formula.
-
3.5 Propositional fuzzy logics
The most important propositional fuzzy logics are:
Monoidal t-norm-based propositional fuzzy logic MTL is an axiomatization of logic where
conjunction is defined by a left continuous t-norm, and implication is defined as the residuum of the
t-norm. Its models correspond to MTL-algebras that are prelinear commutative bounded integral
residuated lattices.
Basic propositional fuzzy logic BL is an extension of MTL logic where conjunction is defined by a
continuous t-norm, and implication is also defined as the residuum of the t-norm. Its models
correspond to BL-algebras.
ukasiewicz fuzzy logic is the extension of basic fuzzy logic BL where standard conjunction is the
ukasiewicz t-norm. It has the axioms of basic fuzzy logic plus an axiom of double negation, and
its models correspond to MV-algebras.
Gdel fuzzy logic is the extension of basic fuzzy logic BL where conjunction is Gdel t-norm. It
has the axioms of BL plus an axiom of idempotence of conjunction, and its models are called G-
algebras.
Product fuzzy logic is the extension of basic fuzzy logic BL where conjunction is product t-norm. It
has the axioms of BL plus another axiom for cancellativity of conjunction, and its models are called
product algebras.
Fuzzy logic with evaluated syntax (sometimes also called Pavelka's logic), denoted by EV, is a
further generalization of mathematical fuzzy logic. While the above kinds of fuzzy logic have
traditional syntax and many-valued semantics, in EV is evaluated also syntax. This means that
each formula has an evaluation. Axiomatization of EV stems from ukasziewicz fuzzy logic. A
generalization of classical Gdel completeness theorem is provable in EV.
-
3.6 Information Retrieval On the Web
Retrieving information from the web can prove to be difficult because of the size and abstractness
of data contained on the web. Approximations for 2011 estimated the web to be as large as 50
billion web pages or more. Web retrieval is made increasingly difficult when adding in factors such
as word ambiguity (where a single word can take on multiple meanings), and the large amount of
typographical errors contained within web information. It is estimated that one in every two-
hundred words, on an average web site, will contain a textual error.
There are several key issues involving information retrieval. These issues are relevance, evaluation,
and information needs. However, these are not the only issues involving information retrieval.
Other issues such as performance, scalability and occurrences of paging update are other common
information retrieval issues.
Relevance is the relational value of a given user query to the documents within the database.
Relevance of a document is normally based on a document ranking algorithm. These algorithms
define how relevant a document is to a user query by using functions that define relations between
the query given and the documents collected in the index.
The evaluation of the feedback given by the information retrieval system is another issue with
information retrieval. The behavior of the system may not meet the expectations of the user or the
documents returned from the system may not all be relevant to a query. Depending on the system
and the user, the results of a query should be in a format that most fits the data being searched and
returned.
Web information retrieval is an area open for many research opportunities. The larger problems
with web information retrievalrelevance, evaluation, and information needsamongst others, are
still important topics that require attention.
-
Fuzzy information retrieval has proven to be a suitable solution for many areas involving
information retrieval that may have data that can be uncertain, such as the web. The individual
sections of the system were developed. This includes a crawler system that obeys standard internet
etiquette rules, and an indexing application that stores all information retrieved from the web in the
form of a standard inverted index. Each section entered its gathered data to the database.
Information needs is how the user interacts with the information retrieval system. The data within
the system should be able to be accessed easily and in a way that is convenient to the user.
Retrieving too much information might be inconvenient in certain systems, also in other systems
not returning all relevant information may be unacceptable.
-
4. Evaluation of Performance of Information Retrieval System
In the past, various researchers have used following parameters to evaluate the performance of IR
Systems:
1. Precision: It is a fraction of documents that are relevant among the entire retrieved document.
Practically it gives accuracy of result.
Precision=|Ra|/|A| (1)
where,
Ra : Set of relevant doc cuments retrieved
A: Set of documents retrieved
2. Recall: A fraction of the documents that is retrieved and relevant among all relevant documents
is defineed as recall. Practically it gives coverage of result.
Recall =|Ra|/|R| (2)
Ra : Set of relevant doocuments retrieved
R: Set of all relevant documents
3. Precision-Recall Curve: This curve is based upon the value of precision and recall where the x-
axis is recall and y-axis is precision. Instead of using precision and recall on at each rank posiition ,
the curve is commonly plotted using 11 standard recall level 0%, 10%, 20% ...........100%.
Moreover, average similarity value of documents for individual query and average number of
retrieved relevant documents can also be used as parameters to check the performance of IR
System. If the values for both of these parameters are high then the performance of IR System will
be good.
-
5. REFERENCES
1. Yates, R.B., Berthier, R.: Modern Information retrieval. Addisson Wesley (1999)
2. Cooper, W.S.: Getting beyond Boole. Information Processing and Management 24, 243248
(1988)
3. Harman, D.: Ranking Algorithms. Information retrieval: data structures and algorithms,
pp. 363392. Prentice-Hall (1992)
4. Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of infor-
mation by computer. Addison Wesley (1998)