Fuzzy Logic Information Retrieval

download Fuzzy Logic Information Retrieval

of 18

description

Information retrieval using Fuzzy Logic on TREC Dataset.

Transcript of Fuzzy Logic Information Retrieval

  • MINOR PROJECT SYNOPSIS

    Implementation of the Fuzzy Logic based IR system for TREC dataset.

    Submitted in partial fulfilment of the requirements

    For the award of degree of

    Bachelor of Technology

    In

    Computer Science Engineering

    TEAM MEMBERS: PROJECT GUIDE:

    Prabhjot Singh (03311502711)

    Sumit Dhawan (04311502711)

    Shubham Agarwal (04011502711) Mrs Narina Thakur

    BHARATI VIDYAPEETHS COLLEGE OF ENGINEERING

    A-4, PASCHIM VIHAR, ROHTAK ROAD, NEW DELHI- 110063

    AFFILIATED TO

    GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY, DELHI

  • ABSTRACT

    Relevance, evaluation, and information needs are various key issues associated with InformationRetrieval. Relevance is the relational value of a given user query to the documents within thedatabase. Relevance of a document is normally based on a document ranking algorithm. Thesealgorithms define how relevant a document is to a user query by using functions that definerelations between the query given and the documents collected in the index.

    Information needs is how the user interacts with the information retrieval system. The data withinthe system should be able to be accessed easily and in a way that is convenient to the user.Retrieving too much information might be inconvenient in certain systems, also in other systemsnot returning all relevant information may be unacceptable.Reserach reveals the effectiveness offuzzy logic to handle uncertainty and vagueness of queries and documents.

    In present project, a fuzzy based similarity measure is implemented using TREC data collection.The performance of proposed similarity measure is evaluated and compared with Cosine similaritymeasures on the basis of Precision-Recall curves for individual query.

  • INDEX1. Objective.12. Introduction.....23. Fuzzy Logic based information system..3

    3.1 TREC Dataset...............................4 3.2 Information Retrieval...6

    3.3 Fuzzy Logic......7 3.4 Hard Science with If/then rules....8 3.5 Propositional Fuzzy Logic...........10 3.6 Information Retrieval on the web........11

    4. Evaluation of performance of IR systems.12-175. References18

  • 1. OBJECTIVE

    This project implements an improved IR system using fuzzy logic based similarity

    measure . Fuzzy logic based similarity measure will implementation will improve IR

    systems performance. The performance of fuzzy logic based similarity measure is

    compared with Cosine similarity measures on TREC data collection.

  • 2. INTRODUCTION

    In last couple of decades, various methods have been suggested to improve the performance of

    Information Retrieval (IR) System. An IR system generally deals with retrieval of relevant

    documents against user defined queries. Baeza-Yates defined a general Information Retrieval model

    as a quadruple [D, Q, F, R (q i , d j )], where D is a set composed of logical views for the documents

    in the collection, Q is a set composed of logical views for the user information needs expressed as

    queries, F is a framework for modeling document representations, queries and their relationships

    and R (q i , d j ) is a similarity measure/ranking function which associates a real number with a

    query q i Q and a document representation d j D. Such ranking defines an ordering among the

    documents with regard to the query q i . Therefore similarity measure plays an important role to

    develop a quality IR system.

    An IR system evaluates the relevancy using some representations of a document and a query. There

    are different models for representation documents and queries. Each model has its pros and cons.

    The Boolean model was the first model which was adopted by most of the earlier systems and even

    today some of the commercial systems use this model, which makes use of the concepts of Boolean

    logic and set theories.

    The documents and queries are a collection of terms and each term from the document is indexed.

    The presence and absence of a term in a document is represented by 1 and 0 respectively. For the

    term matching of document and query we maintain an inverted index of the terms i.e. for each term

    we must store a list of documents that contain the term. However, the Boolean model has some

    major limitations like binary decision criterion without any notion of grading scale and overloading

    of documents. While some researchers have tried to overcome the weaknesses of the Boolean model

    by building refinements to the existing Boolean model, others have approached IR with a different

    search strategy called the Vector Space model.

  • The Vector Space Model, as the name implies, represents documents and queries internally in the

    form of vectors. In the vector space model all queries and documents are represented as vectors in |

    V|-dimensional space, where V is the set of all distinct terms in the collection (the vocabulary).

    Some of the advantages of the Vector Space Model are that it is simple and fast model, that it can

    handle weighted terms, that it produces a ranked list as output and that the indexing process is

    automated which means a significantly lighter workload for the administrator of the collection.

    Also, it is easy to modify individual vectors, which is essential for the query expansion technique

    and logic based similarity measure. Therefore, vector space model is used as a base model in this

    paper.

    An IR system needs to calculate the similarity of the query and the particular document in order to

    decide relevancy of that document with the query. When a document retrieval system is used to

    query a collection of documents with n terms, the system computes a vector D (d i1 , d i2 ... d in ) of

    size n for each document. The vectors are filled with the weights and similarly, a vector Q (W q1 ,

    W q2 ... W qn ) is constructed for the terms found in the query. In recent years, some efforts have

    made to construct a effective similarity measure for enhancing the performance of IR System.

    In Fan presented similarity functions as trees and a classical generational scheme. Pathak et al have

    proposed the idea of combined similarity measure in which they have proposed a linear

    combination of various similarity measures and then optimize the weight of each similarity measure

    using GA. Mehran Sahami proposes a novel method for measuring the similarity between short text

    snippets by leveraging web search results to provide greater context for the short texts.

    Vincent Schickel-Zuber et al., present a novel approach that allows similarities to be asymmetric

    while still using only information contained in the structure of the ontology. Torra et al. presented a

    method to calculate similarity between words based on dictionaries using Fuzzy graphs in reference.

    Chen presented in reference a new similarity measure based on the geometric mean averaging

    operator to handle the similarity problems of generalized fuzzy numbers. Usharani et al. proposed a

    genetic algorithm based method for finding similarity of web document based on cosine similarity.

    In the past, most popular similarity measures used in IR Systems are Cosine, Euclidean, Jaccard and

    Okapi.

  • In the present project, a Fuzzy Logic based Similarity Measure, is proposed for vector space IR

    model. The performance of proposed similarity measure would be evaluated and compared with

    above mentioned similarity measures on the basis of 226 Y. Gupta et al.

    3. Proposed Fuzzy Logic Based Similarity Measure

    The IR System retrieves documents based on a query given by the user. In most of the cases, both

    queries and documents are vague or imprecise and usually expressed in Natural Language (NL).

    Sometimes user may change his query during information retrieval process and/or he may not be

    conscious of his exact needs of information [1].

    Therefore, to handle this uncertainty, vagueness and impreciseness, Fuzzy Logic is very suitable.

    Fuzzy logic is based on Fuzzy Set theory and membership functions .

    Documents retrieved by a query are evaluated by the rules of Fuzzy Inference System (FIS). Vector

    Space Model is used as a base model due to its advantages over other models. In this FIS, we have

    used three input variables: term frequency (tf), inverse document frequency (idf), overlap and one

    output variable: relevance. These input variables are very useful to determine the relevancy of

    document against a particular query. TF indicates that the number of occurrences of a term in each

    document of the corpus[2].

    IDF can be given as log (N/n), where N is the total number of documents in corpus and n is the

    number of documents contains the term. Overlap reflects that many of the terms of the query are

    found in documents. Mamdani type fuzzy inference system is used in FLBSM with the help of

    Matlab Fuzzy Logic Toolbox.

  • The range of input variables tf, idf and output variable relevance are represented by LOW,

    MEDIUM and HIGH, while the range of input variable overlap is represented by LOW and HIGH.

    In this paper, triangular membership function is being used to map input space to a degree ofmembership of fuzzy set. The details of the membership functions for input and output variables of

    FLBSM are shown in Fig. 1 [3].

    Fig 1: membership functions for input and output variables of FLBSM

    Fuzzy rules are derived from tf.idf weighting scheme i.e. if a query term has high tf and high idf in

    a document, , then relevance is likely to be high. If many of the terms of the query are found in the

    document (overlap), then relevance is likely to be high. It is known that if the rules tha at penalize

    low features are added, the performance of the system is increased. So the following rules are

    constructed for each of the query term:

    If (tf is High) and (idf is High) then (relevance is High).

    If (tf is Medium) and (id f is Medium) then (relevance is Medium).

    If (tf is Low) and (idf is Low) then (relevance is Low).

    Two fuzzy rules are also de efined for overlap as follows:

    If (overlap is High) then (relevance is High).

    If (overlap is Low) then (relevance is Low).

  • 3.1 Trec Dataset

    The Text REtrieval Conference (TREC) is an on-going series of workshops focusing on a list of

    different information retrieval (IR) research areas, or tracks. It is co-sponsored by the National

    Institute of Standards and Technology (NIST) and the Intelligence Advanced Research Projects

    Activity (part of the office of the Director of National Intelligence), and began in 1992 as part of the

    TIPSTER Text program. Its purpose is to support and encourage research within the information

    retrieval community by providing the infrastructure necessary for large-scale evaluation of text

    retrieval methodologies and to increase the speed of lab-to-product transfer of technology.

    Each track has a challenge wherein NIST provides participating groups with data sets and test

    problems. Depending on track, test problems might be questions, topics, or target extractable

    features. Uniform scoring is performed so the systems can be fairly evaluated. After evaluation of

    the results, a workshop provides a place for participants to collect together thoughts and ideas and

    present current and future research work [4].

    TREC systems often provide a baseline for further research. Examples include:

    Hal Varian, Chief Economist at Google, says Better data makes for better science. The

    history of information retrieval illustrates this principle well," and describes TREC's

    contribution.

    TREC's Legal track has influenced the e-Discovery community both in research and in

    evaluation of commercial vendors.

    The IBM researcher team building IBM Watson (aka DeepQA), which beat the world's best

    Jeopardy! players, used data and systems from TREC's QA Track as baseline performance

    measurements.

  • 3.2 Information Retrieval

    Information retrieval is the activity of obtaining information resources relevant to an information

    need from a collection of information resources. Searches can be based on metadata or on full-text

    (or other content-based) indexing.

    Automated information retrieval systems are used to reduce what has been called "information

    overload". Many universities and public libraries use IR systems to provide access to books,

    journals and other documents. Web search engines are the most visible IR applications.

    An information retrieval process begins when a user enters a query into the system. Queries are

    formal statements of information needs, for example search strings in web search engines. In

    information retrieval a query does not uniquely identify a single object in the collection. Instead,

    several objects may match the query, perhaps with different degrees of relevancy.

    An object is an entity that is represented by information in a database. User queries are matched

    against the database information. Depending on the application the data objects may be, for

    example, text documents, images, audio, mind maps or videos. Often the documents themselves are

    not kept or stored directly in the IR system, but are instead represented in the system by document

    surrogates or metadata.

    Most IR systems compute a numeric score on how well each object in the database matches the

    query, and rank the objects according to this value. The top ranking objects are then shown to the

    user. The process may then be iterated if the user wishes to refine the query.

    For effectively retrieving relevant documents by IR strategies, the documents are typically

    transformed into a suitable representation. Each retrieval strategy incorporates a specific model for

    its document representation purposes.

  • Fig

    2:

    Categorisation of IR models

    Applications of IR include:

    Digital libraries

    Information filtering

    Recommender systems

    Media search

    Blog search

    Image retrieval

    Speech retrieval

    Video retrieval

    To measure ad hoc information retrieval effectiveness in the standard way, we need a test collection

    consisting of three things:

    A document collection

    A test suite of information needs, expressible as queries

    A set of relevance judgments, standardly a binary assessment of either relevant or

    nonrelevant for each query-document pair.

  • 3.3 Fuzzy Logic

    Fuzzy logic is a form of many-valued logic; it deals with reasoning that is approximate

    rather than fixed and exact. Compared to traditional binary sets, fuzzy logic variables may

    have a truth value that ranges in degree between 0 and 1.

    Fuzzy logic has been extended to handle the concept of partial truth, where the truth value

    may range between completely true and completely false. Furthermore, when linguistic

    variables are used, these degrees may be managed by specific functions.

    The term "fuzzy logic" was introduced with the 1965 proposal of fuzzy set theory by Lotfi

    A. Zadeh. Fuzzy logic has been applied to many fields, from control theory to artificial

    intelligence. Fuzzy logics had, however, been studied since the 1920s, as infinite-valued

    logics - notably by ukasiewicz and Tarski.

    A basic application might characterize subranges of a continuous variable. For instance, a

    temperature measurement for anti-lock brakes might have several separate membership

    functions defining particular temperature ranges needed to control the brakes properly. Each

    function maps the same temperature value to a truth value in the 0 to 1 range. These truth

    values can then be used to determine how the brakes should be controlled.

    Fig 3: Fuzzy logic

    temperature

  • 3.4 Hard science with IF-THEN rules

    Fuzzy set theory defines fuzzy operators on fuzzy sets. The problem in applying this is that the

    appropriate fuzzy operator may not be known. For this reason, fuzzy logic usually uses IF-THEN

    rules, or constructs that are equivalent, such as fuzzy associative matrices.

    Rules are usually expressed in the form:

    IF variable IS property THEN action

    For example, a simple temperature regulator that uses a fan might look like this:

    IF temperature IS very cold THEN stop fan

    IF temperature IS cold THEN turn down fan

    IF temperature IS normal THEN maintain level

    IF temperature IS hot THEN speed up fan

    There is no "ELSE" all of the rules are evaluated, because the temperature might be "cold" and

    "normal" at the same time to different degrees.

    The AND, OR, and NOT operators of boolean logic exist in fuzzy logic, usually defined as the

    minimum, maximum, and complement; when they are defined this way, they are called the Zadeh

    operators. So for the fuzzy variables x and y:

    NOT x = (1 - truth(x))

    x AND y = minimum(truth(x), truth(y))

    x OR y = maximum(truth(x), truth(y))

    There are also other operators, more linguistic in nature, called hedges that can be applied. These

    are generally adverbs such as "very", or "somewhat", which modify the meaning of a set using a

    mathematical formula.

  • 3.5 Propositional fuzzy logics

    The most important propositional fuzzy logics are:

    Monoidal t-norm-based propositional fuzzy logic MTL is an axiomatization of logic where

    conjunction is defined by a left continuous t-norm, and implication is defined as the residuum of the

    t-norm. Its models correspond to MTL-algebras that are prelinear commutative bounded integral

    residuated lattices.

    Basic propositional fuzzy logic BL is an extension of MTL logic where conjunction is defined by a

    continuous t-norm, and implication is also defined as the residuum of the t-norm. Its models

    correspond to BL-algebras.

    ukasiewicz fuzzy logic is the extension of basic fuzzy logic BL where standard conjunction is the

    ukasiewicz t-norm. It has the axioms of basic fuzzy logic plus an axiom of double negation, and

    its models correspond to MV-algebras.

    Gdel fuzzy logic is the extension of basic fuzzy logic BL where conjunction is Gdel t-norm. It

    has the axioms of BL plus an axiom of idempotence of conjunction, and its models are called G-

    algebras.

    Product fuzzy logic is the extension of basic fuzzy logic BL where conjunction is product t-norm. It

    has the axioms of BL plus another axiom for cancellativity of conjunction, and its models are called

    product algebras.

    Fuzzy logic with evaluated syntax (sometimes also called Pavelka's logic), denoted by EV, is a

    further generalization of mathematical fuzzy logic. While the above kinds of fuzzy logic have

    traditional syntax and many-valued semantics, in EV is evaluated also syntax. This means that

    each formula has an evaluation. Axiomatization of EV stems from ukasziewicz fuzzy logic. A

    generalization of classical Gdel completeness theorem is provable in EV.

  • 3.6 Information Retrieval On the Web

    Retrieving information from the web can prove to be difficult because of the size and abstractness

    of data contained on the web. Approximations for 2011 estimated the web to be as large as 50

    billion web pages or more. Web retrieval is made increasingly difficult when adding in factors such

    as word ambiguity (where a single word can take on multiple meanings), and the large amount of

    typographical errors contained within web information. It is estimated that one in every two-

    hundred words, on an average web site, will contain a textual error.

    There are several key issues involving information retrieval. These issues are relevance, evaluation,

    and information needs. However, these are not the only issues involving information retrieval.

    Other issues such as performance, scalability and occurrences of paging update are other common

    information retrieval issues.

    Relevance is the relational value of a given user query to the documents within the database.

    Relevance of a document is normally based on a document ranking algorithm. These algorithms

    define how relevant a document is to a user query by using functions that define relations between

    the query given and the documents collected in the index.

    The evaluation of the feedback given by the information retrieval system is another issue with

    information retrieval. The behavior of the system may not meet the expectations of the user or the

    documents returned from the system may not all be relevant to a query. Depending on the system

    and the user, the results of a query should be in a format that most fits the data being searched and

    returned.

    Web information retrieval is an area open for many research opportunities. The larger problems

    with web information retrievalrelevance, evaluation, and information needsamongst others, are

    still important topics that require attention.

  • Fuzzy information retrieval has proven to be a suitable solution for many areas involving

    information retrieval that may have data that can be uncertain, such as the web. The individual

    sections of the system were developed. This includes a crawler system that obeys standard internet

    etiquette rules, and an indexing application that stores all information retrieved from the web in the

    form of a standard inverted index. Each section entered its gathered data to the database.

    Information needs is how the user interacts with the information retrieval system. The data within

    the system should be able to be accessed easily and in a way that is convenient to the user.

    Retrieving too much information might be inconvenient in certain systems, also in other systems

    not returning all relevant information may be unacceptable.

  • 4. Evaluation of Performance of Information Retrieval System

    In the past, various researchers have used following parameters to evaluate the performance of IR

    Systems:

    1. Precision: It is a fraction of documents that are relevant among the entire retrieved document.

    Practically it gives accuracy of result.

    Precision=|Ra|/|A| (1)

    where,

    Ra : Set of relevant doc cuments retrieved

    A: Set of documents retrieved

    2. Recall: A fraction of the documents that is retrieved and relevant among all relevant documents

    is defineed as recall. Practically it gives coverage of result.

    Recall =|Ra|/|R| (2)

    Ra : Set of relevant doocuments retrieved

    R: Set of all relevant documents

    3. Precision-Recall Curve: This curve is based upon the value of precision and recall where the x-

    axis is recall and y-axis is precision. Instead of using precision and recall on at each rank posiition ,

    the curve is commonly plotted using 11 standard recall level 0%, 10%, 20% ...........100%.

    Moreover, average similarity value of documents for individual query and average number of

    retrieved relevant documents can also be used as parameters to check the performance of IR

    System. If the values for both of these parameters are high then the performance of IR System will

    be good.

  • 5. REFERENCES

    1. Yates, R.B., Berthier, R.: Modern Information retrieval. Addisson Wesley (1999)

    2. Cooper, W.S.: Getting beyond Boole. Information Processing and Management 24, 243248

    (1988)

    3. Harman, D.: Ranking Algorithms. Information retrieval: data structures and algorithms,

    pp. 363392. Prentice-Hall (1992)

    4. Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of infor-

    mation by computer. Addison Wesley (1998)