Fuzzy Logic Information Retrieval

MINOR PROJECT SYNOPSIS

Implementation of the Fuzzy Logic based IR system for TREC dataset.

Submitted in partial fulfilment of the requirements

For the award of degree of

Bachelor of Technology

In

Computer Science Engineering

TEAM MEMBERS: PROJECT GUIDE:

Prabhjot Singh (03311502711)

Sumit Dhawan (04311502711)

Shubham Agarwal (04011502711) Mrs Narina Thakur

BHARATI VIDYAPEETHS COLLEGE OF ENGINEERING

A-4, PASCHIM VIHAR, ROHTAK ROAD, NEW DELHI- 110063

AFFILIATED TO

GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY, DELHI

ABSTRACT

Relevance, evaluation, and information needs are various key issues associated with InformationRetrieval. Relevance is the relational value of a given user query to the documents within thedatabase. Relevance of a document is normally based on a document ranking algorithm. Thesealgorithms define how relevant a document is to a user query by using functions that definerelations between the query given and the documents collected in the index.

Information needs is how the user interacts with the information retrieval system. The data withinthe system should be able to be accessed easily and in a way that is convenient to the user.Retrieving too much information might be inconvenient in certain systems, also in other systemsnot returning all relevant information may be unacceptable.Reserach reveals the effectiveness offuzzy logic to handle uncertainty and vagueness of queries and documents.

In present project, a fuzzy based similarity measure is implemented using TREC data collection.The performance of proposed similarity measure is evaluated and compared with Cosine similaritymeasures on the basis of Precision-Recall curves for individual query.

INDEX1. Objective.12. Introduction.....23. Fuzzy Logic based information system..3

3.1 TREC Dataset...............................4 3.2 Information Retrieval...6

3.3 Fuzzy Logic......7 3.4 Hard Science with If/then rules....8 3.5 Propositional Fuzzy Logic...........10 3.6 Information Retrieval on the web........11

4. Evaluation of performance of IR systems.12-175. References18

1. OBJECTIVE

This project implements an improved IR system using fuzzy logic based similarity

measure . Fuzzy logic based similarity measure will implementation will improve IR

systems performance. The performance of fuzzy logic based similarity measure is

compared with Cosine similarity measures on TREC data collection.

2. INTRODUCTION

In last couple of decades, various methods have been suggested to improve the performance of

Information Retrieval (IR) System. An IR system generally deals with retrieval of relevant

documents against user defined queries. Baeza-Yates defined a general Information Retrieval model

as a quadruple [D, Q, F, R (q i , d j )], where D is a set composed of logical views for the documents

in the collection, Q is a set composed of logical views for the user information needs expressed as

queries, F is a framework for modeling document representations, queries and their relationships

and R (q i , d j ) is a similarity measure/ranking function which associates a real number with a

query q i Q and a document representation d j D. Such ranking defines an ordering among the

documents with regard to the query q i . Therefore similarity measure plays an important role to

develop a quality IR system.

An IR system evaluates the relevancy using some representations of a document and a query. There

are different models for representation documents and queries. Each model has its pros and cons.

The Boolean model was the first model which was adopted by most of the earlier systems and even

today some of the commercial systems use this model, which makes use of the concepts of Boolean

logic and set theories.

The documents and queries are a collection of terms and each term from the document is indexed.

The presence and absence of a term in a document is represented by 1 and 0 respectively. For the

term matching of document and query we maintain an inverted index of the terms i.e. for each term

we must store a list of documents that contain the term. However, the Boolean model has some

major limitations like binary decision criterion without any notion of grading scale and overloading

of documents. While some researchers have tried to overcome the weaknesses of the Boolean model

by building refinements to the existing Boolean model, others have approached IR with a different

search strategy called the Vector Space model.

The Vector Space Model, as the name implies, represents documents and queries internally in the

form of vectors. In the vector space model all queries and documents are represented as vectors in |

V|-dimensional space, where V is the set of all distinct terms in the collection (the vocabulary).

Some of the advantages of the Vector Space Model are that it is simple and fast model, that it can

handle weighted terms, that it produces a ranked list as output and that the indexing process is

automated which means a significantly lighter workload for the administrator of the collection.

Also, it is easy to modify individual vectors, which is essential for the query expansion technique

and logic based similarity measure. Therefore, vector space model is used as a base model in this

paper.

An IR system needs to calculate the similarity of the query and the particular document in order to

decide relevancy of that document with the query. When a document retrieval system is used to

query a collection of documents with n terms, the system computes a vector D (d i1 , d i2 ... d in ) of

size n for each document. The vectors are filled with the weights and similarly, a vector Q (W q1 ,

W q2 ... W qn ) is constructed for the terms found in the query. In recent years, some efforts have

made to construct a effective similarity measure for enhancing the performance of IR System.

In Fan presented similarity functions as trees and a classical generational scheme. Pathak et al have

proposed the idea of combined similarity measure in which they have proposed a linear

combination of various similarity measures and then optimize the weight of each similarity measure

using GA. Mehran Sahami proposes a novel method for measuring the similarity between short text

snippets by leveraging web search results to provide greater context for the short texts.

Vincent Schickel-Zuber et al., present a novel approach that allows similarities to be asymmetric

while still using only information contained in the structure of the ontology. Torra et al. presented a

method to calculate similarity between words based on dictionaries using Fuzzy graphs in reference.

Chen presented in reference a new similarity measure based on the geometric mean averaging

operator to handle the similarity problems of generalized fuzzy numbers. Usharani et al. proposed a

genetic algorithm based method for finding similarity of web document based on cosine similarity.

In the past, most popular similarity measures used in IR Systems are Cosine, Euclidean, Jaccard and

Okapi.

In the present project, a Fuzzy Logic based Similarity Measure, is proposed for vector space IR

model. The performance of proposed similarity measure would be evaluated and compared with

above mentioned similarity measures on the basis of 226 Y. Gupta et al.

3. Proposed Fuzzy Logic Based Similarity Measure

The IR System retrieves documents based on a query given by the user. In most of the cases, both

queries and documents are vague or imprecise and usually expressed in Natural Language (NL).

Sometimes user may change his query during information retrieval process and/or he may not be

conscious of his exact needs of information [1].

Therefore, to handle this uncertainty, vagueness and impreciseness, Fuzzy Logic is very suitable.

Fuzzy logic is based on Fuzzy Set theory and membership functions .

Documents retrieved by a query are evaluated by the rules of Fuzzy Inference System (FIS). Vector

Space Model is used as a base model due to its advantages over other models. In this FIS, we have

used three input variables: term frequency (tf), inverse document frequency (idf), overlap and one

output variable: relevance. These input variables are very useful to determine the relevancy of

document against a particular query. TF indicates that the number of occurrences of a term in each

document of the corpus[2].

IDF can be given as log (N/n), where N is the total number of documents in corpus and n is the

number of documents contains the term. Overlap reflects that many of the terms of the query are

found in documents. Mamdani type fuzzy inference system is used in FLBSM with the help of

Matlab Fuzzy Logic Toolbox.

The range of input variables tf, idf and output variable relevance are represented by LOW,

MEDIUM and HIGH, while the range of input variable overlap is represented by LOW and HIGH.

In this paper, triangular membership function is being used to map input space to a degree ofmembership of fuzzy set. The details of the membership functions for input and output variables of

FLBSM are shown in Fig. 1 [3].

Fig 1: membership functions for input and output variables of FLBSM

Fuzzy rules are derived from tf.idf weighting scheme i.e. if a query term has high tf and high idf in

a document, , then relevance is likely to be high. If many of the terms of the query are found in the

document (overlap), then relevance is likely to be high. It is known that if the rules tha at penalize

low features are added, the performance of the system is increased. So the following rules are

constructed for each of the query term:

If (tf is High) and (idf is High) then (relevance is High).

If (tf is Medium) and (id f is Medium) then (relevance is Medium).

If (tf is Low) and (idf is Low) then (relevance is Low).

Two fuzzy rules are also de efined for overlap as follows:

If (overlap is High) then (relevance is High).

If (overlap is Low) then (relevance is Low).

3.1 Trec Dataset

The Text REtrieval Conference (TREC) is an on-going series of workshops focusing on a list of

different information retrieval (IR) research areas, or tracks. It is co-sponsored by the National

Institute of Standards and Technology (NIST) and the Intelligence Advanced Research Projects

Activity (part of the office of the Director of National Intelligence), and began in 1992 as part of the

TIPSTER Text program. Its purpose is to support and encourage research within the information

retrieval community by providing the infrastructure necessary for large-scale evaluation of text

retrieval methodologies and to increase the speed of lab-to-product transfer of technology.

Each track has a challenge wherein NIST provides participating groups with data sets and test

problems. Depending on track, test problems might be questions, topics, or target extractable

features. Uniform scoring is performed so the systems can be fairly evaluated. After evaluation of

the results, a workshop provides a place for participants to collect together thoughts and ideas and

present current and future research work [4].

TREC systems often provide a baseline for further research. Examples include:

Hal Varian, Chief Economist at Google, says Better data makes for better science. The

history of information retrieval illustrates this principle well," and describes TREC's

contribution.

TREC's Legal track has influenced the e-Discovery community both in research and in

evaluation of commercial vendors.

The IBM researcher team building IBM Watson (aka DeepQA), which beat the world's best

Jeopardy! players, used data and systems from TREC's QA Track as baseline performance

measurements.

3.2 Information Retrieval

Information retrieval is the activity of obtaining information resources relevant to an information

need from a collection of information resources. Searches can be based on metadata or on full-text

(or other content-based) indexing.

Automated information retrieval systems are used to reduce what has been called "information

overload". Many universities and public libraries use IR systems to provide access to books,

journals and other documents. Web search engines are the most visible IR applications.

An information retrieval process begins when a user enters a query into the system. Queries are

formal statements of information needs, for example search strings in web search engines. In

information retrieval a query does not uniquely identify a single object in the collection. Instead,

several objects may match the query, perhaps with different degrees of relevancy.

An object is an entity that is represented by information in a database. User queries are matched

against the database information. Depending on the application the data objects may be, for

example, text documents, images, audio, mind maps or videos. Often the documents themselves are

not kept or stored directly in the IR system, but are instead represented in the system by document

surrogates or metadata.

Most IR systems compute a numeric score on how well each object in the database matches the

query, and rank the objects according to this value. The top ranking objects are then shown to the

user. The process may then be iterated if the user wishes to refine the query.

For effectively retrieving relevant documents by IR strategies, the documents are typically

transformed into a suitable representation. Each retrieval strategy incorporates a specific model for

its document representation purposes.

Fig

2:

Categorisation of IR models

Applications of IR include:

Digital libraries

Information filtering

Recommender systems

Media search

Blog search

Image retrieval

Speech retrieval

Video retrieval

To measure ad hoc information retrieval effectiveness in the standard way, we need a test collection

consisting of three things:

A document collection

A test suite of information needs, expressible as queries

A set of relevance judgments, standardly a binary assessment of either relevant or

nonrelevant for each query-document pair.

3.3 Fuzzy Logic

Fuzzy logic is a form of many-valued logic; it deals with reasoning that is approximate

rather than fixed and exact. Compared to traditional binary sets, fuzzy logic variables may

have a truth value that ranges in degree between 0 and 1.

Fuzzy logic has been extended to handle the concept of partial truth, where the truth value

may range between completely true and completely false. Furthermore, when linguistic

variables are used, these degrees may be managed by specific functions.

The term "fuzzy logic" was introduced with the 1965 proposal of fuzzy set theory by Lotfi

A. Zadeh. Fuzzy logic has been applied to many fields, from control theory to artificial

intelligence. Fuzzy logics had, however, been studied since the 1920s, as infinite-valued

logics - notably by ukasiewicz and Tarski.

A basic application might characterize subranges of a continuous variable. For instance, a

temperature measurement for anti-lock brakes might have several separate membership

functions defining particular temperature ranges needed to control the brakes properly. Each

function maps the same temperature value to a truth value in the 0 to 1 range. These truth

values can then be used to determine how the brakes should be controlled.

Fig 3: Fuzzy logic

temperature

3.4 Hard science with IF-THEN rules

Fuzzy set theory defines fuzzy operators on fuzzy sets. The problem in applying this is that the

appropriate fuzzy operator may not be known. For this reason, fuzzy logic usually uses IF-THEN

rules, or constructs that are equivalent, such as fuzzy associative matrices.

Rules are usually expressed in the form:

IF variable IS property THEN action

For example, a simple temperature regulator that uses a fan might look like this:

IF temperature IS very cold THEN stop fan

IF temperature IS cold THEN turn down fan

IF temperature IS normal THEN maintain level

IF temperature IS hot THEN speed up fan

There is no "ELSE" all of the rules are evaluated, because the temperature might be "cold" and

"normal" at the same time to different degrees.

The AND, OR, and NOT operators of boolean logic exist in fuzzy logic, usually defined as the

minimum, maximum, and complement; when they are defined this way, they are called the Zadeh

operators. So for the fuzzy variables x and y:

NOT x = (1 - truth(x))

x AND y = minimum(truth(x), truth(y))

x OR y = maximum(truth(x), truth(y))

There are also other operators, more linguistic in nature, called hedges that can be applied. These

are generally adverbs such as "very", or "somewhat", which modify the meaning of a set using a

mathematical formula.

3.5 Propositional fuzzy logics

The most important propositional fuzzy logics are:

Monoidal t-norm-based propositional fuzzy logic MTL is an axiomatization of logic where

conjunction is defined by a left continuous t-norm, and implication is defined as the residuum of the

t-norm. Its models correspond to MTL-algebras that are prelinear commutative bounded integral

residuated lattices.

Basic propositional fuzzy logic BL is an extension of MTL logic where conjunction is defined by a

continuous t-norm, and implication is also defined as the residuum of the t-norm. Its models

correspond to BL-algebras.

ukasiewicz fuzzy logic is the extension of basic fuzzy logic BL where standard conjunction is the

ukasiewicz t-norm. It has the axioms of basic fuzzy logic plus an axiom of double negation, and

its models correspond to MV-algebras.

Gdel fuzzy logic is the extension of basic fuzzy logic BL where conjunction is Gdel t-norm. It

has the axioms of BL plus an axiom of idempotence of conjunction, and its models are called G-

algebras.

Product fuzzy logic is the extension of basic fuzzy logic BL where conjunction is product t-norm. It

has the axioms of BL plus another axiom for cancellativity of conjunction, and its models are called

product algebras.

Fuzzy logic with evaluated syntax (sometimes also called Pavelka's logic), denoted by EV, is a

further generalization of mathematical fuzzy logic. While the above kinds of fuzzy logic have

traditional syntax and many-valued semantics, in EV is evaluated also syntax. This means that

each formula has an evaluation. Axiomatization of EV stems from ukasziewicz fuzzy logic. A

generalization of classical Gdel completeness theorem is provable in EV.

3.6 Information Retrieval On the Web

Retrieving information from the web can prove to be difficult because of the size and abstractness

of data contained on the web. Approximations for 2011 estimated the web to be as large as 50

billion web pages or more. Web retrieval is made increasingly difficult when adding in factors such

as word ambiguity (where a single word can take on multiple meanings), and the large amount of

typographical errors contained within web information. It is estimated that one in every two-

hundred words, on an average web site, will contain a textual error.

There are several key issues involving information retrieval. These issues are relevance, evaluation,

and information needs. However, these are not the only issues involving information retrieval.

Other issues such as performance, scalability and occurrences of paging update are other common

information retrieval issues.

Relevance is the relational value of a given user query to the documents within the database.

Relevance of a document is normally based on a document ranking algorithm. These algorithms

define how relevant a document is to a user query by using functions that define relations between

the query given and the documents collected in the index.

The evaluation of the feedback given by the information retrieval system is another issue with

information retrieval. The behavior of the system may not meet the expectations of the user or the

documents returned from the system may not all be relevant to a query. Depending on the system

and the user, the results of a query should be in a format that most fits the data being searched and

returned.

Web information retrieval is an area open for many research opportunities. The larger problems

with web information retrievalrelevance, evaluation, and information needsamongst others, are

still important topics that require attention.

Fuzzy information retrieval has proven to be a suitable solution for many areas involving

information retrieval that may have data that can be uncertain, such as the web. The individual

sections of the system were developed. This includes a crawler system that obeys standard internet

etiquette rules, and an indexing application that stores all information retrieved from the web in the

form of a standard inverted index. Each section entered its gathered data to the database.

Information needs is how the user interacts with the information retrieval system. The data within

the system should be able to be accessed easily and in a way that is convenient to the user.

Retrieving too much information might be inconvenient in certain systems, also in other systems

not returning all relevant information may be unacceptable.

4. Evaluation of Performance of Information Retrieval System

In the past, various researchers have used following parameters to evaluate the performance of IR

Systems:

1. Precision: It is a fraction of documents that are relevant among the entire retrieved document.

Practically it gives accuracy of result.

Precision=|Ra|/|A| (1)

where,

Ra : Set of relevant doc cuments retrieved

A: Set of documents retrieved

2. Recall: A fraction of the documents that is retrieved and relevant among all relevant documents

is defineed as recall. Practically it gives coverage of result.

Recall =|Ra|/|R| (2)

Ra : Set of relevant doocuments retrieved

R: Set of all relevant documents

3. Precision-Recall Curve: This curve is based upon the value of precision and recall where the x-

axis is recall and y-axis is precision. Instead of using precision and recall on at each rank posiition ,

the curve is commonly plotted using 11 standard recall level 0%, 10%, 20% ...........100%.

Moreover, average similarity value of documents for individual query and average number of

retrieved relevant documents can also be used as parameters to check the performance of IR

System. If the values for both of these parameters are high then the performance of IR System will

be good.

5. REFERENCES

1. Yates, R.B., Berthier, R.: Modern Information retrieval. Addisson Wesley (1999)

2. Cooper, W.S.: Getting beyond Boole. Information Processing and Management 24, 243248

(1988)

3. Harman, D.: Ranking Algorithms. Information retrieval: data structures and algorithms,

pp. 363392. Prentice-Hall (1992)

4. Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of infor-

mation by computer. Addison Wesley (1998)

Fuzzy Logic Information Retrieval

Documents

Transcript of Fuzzy Logic Information Retrieval