Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval:...

26
Set Theoretic Models 1

description

Set Theoretic Models  The Boolean model imposes a binary criterion for deciding relevance  The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past  Two set theoretic models for this:  Fuzzy Set Model  Extended Boolean Model 3

Transcript of Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval:...

Page 1: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Set Theoretic Models

1

Page 2: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

IR Models

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval: Adhoc

Filtering

Browsing

U s e r

T a s k

Classic Models

Boolean Vector

Probabilistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector Lat. Semantic Index

Neural Networks

Browsing

Flat Structure Guided

Hypertext 2

Page 3: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Set Theoretic Models

The Boolean model imposes a binary criterion for deciding relevance

The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past

Two set theoretic models for this: Fuzzy Set Model Extended Boolean Model

3

Page 4: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Fuzzy Set Model

Queries and docs represented by sets of index terms: matching is approximate from the start

This vagueness can be modeled using a fuzzy framework, as follows:with each term is associated a fuzzy seteach doc has a degree of membership in this fuzzy

set This interpretation provides the foundation for many

models for IR based on fuzzy theory In here, we discuss the model proposed by Ogawa,

Morita, and Kobayashi (1991)4

Page 5: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Fuzzy Set Theory

Framework for representing classes whose boundaries are not well defined

Key idea is to introduce the notion of a degree of membership associated with the elements of a set

This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership

Thus, membership is now a gradual notion, contrary to the crispy notion enforced by classic Boolean logic

5

Page 6: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

6

Fuzzy Set Theory

Model A query term: a fuzzy set A document: degree of membership in this test Membership function

Associate membership function with the elements of the class

0: no membership in the test 1: full membership 0 ~1: marginal elements of the test

documents

Page 7: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Fuzzy Set Theory

A fuzzy subset A of a universe of discourse U is characterized by a membership function µA: U[0,1] which associates with each element u of U a number µA(u) in the interval [0,1]

– complement:

– union:

– intersection:

)(1)( uu AA

))(),(max()( uuu BABA

))(),(min()( uuu BABA

a class document collectionfor query term

7

Page 8: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Examples

Assume U={d1, d2, d3, d4, d5, d6} Let A and B be {d1, d2, d3} and {d2, d3, d4}, respectively. Assume A={d1:0.8, d2:0.7, d3:0.6, d4:0, d5:0, d6:0}

and B={d1:0, d2:0.6, d3:0.8, d4:0.9, d5:0, d6:0} = {d1:0.2, d2:0.3, d3:0.4, d4:1, d5:1, d6:1} =

{d1:0.8, d2:0.7, d3:0.8, d4:0.9, d5:0, d6:0} =

{d1:0, d2:0.6, d3:0.6, d4:0, d5:0, d6:0}

)(1)( uu AA ))(),(max()( uuu BABA

))(),(min()( uuu BABA

8

Page 9: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Fuzzy Information Retrieval basic idea

– Expand the set of index terms in the query with related terms (from the thesaurus) such that additional relevant documents can be retrieved

– A thesaurus can be constructed by defining a term-term correlation matrix c whose rows and columns are associated to the index terms in the document collection

keyword connection matrix

9

Page 10: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Fuzzy Information Retrieval(Continued)

normalized correlation factor ci,l between two terms ki and kl (0~1)

In the fuzzy set associated to each index term ki, a document dj has a degree of membership µi,j

lili

lili nnn

nc

,

,,

)1(1 ,,

jdlk

liji c

where ni is # of documents containing term ki

nl is # of documents containing term kl

ni,l is # of documents containing ki and kl

10

Page 11: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Fuzzy Information Retrieval(Continued)

physical meaning– A document dj belongs to the fuzzy set associated to the

term ki if its own terms are related to ki, i.e., i,j=1.

– If there is at least one index term kl of dj which is strongly related to the index ki, then i,j1.

ki is a good fuzzy index

– When all index terms of dj are only loosely related to ki, i,j0.

ki is not a good fuzzy index

11

Page 12: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Example

q = (ka (kb kc))= (ka kb kc) (ka kb kc) (ka kb kc)= cc1+cc2+cc3

Da

Db

Dc

cc3cc2

cc1

Da: the fuzzy set of documents associated to the index ka

djDa has a degree of membership a,j > a predefined threshold K

Da: the fuzzy set of documents associated to the index ka

(the negation of index term ka)

12

Page 13: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Example

))1)(1(1())1(1()1(1

)1(1

,,,,,,,,,

3

1,

,321,

jcjbjajcjbjajcjbja

ijicc

jccccccjq

Query q=ka (kb kc)

disjunctive normal form qdnf=(1,1,1) (1,1,0) (1,0,0)

(1) the degree of membership in a disjunctive fuzzy set is computed using an algebraic sum (instead of max function) more smoothly(2) the degree of membership in a conjunctive fuzzy set is computed

using an algebraic product (instead of min function)

Recall )(1)( uu AA

13

Page 14: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Fuzzy Set Model

– Q: “gold silver truck”D1: “Shipment of gold damaged in a fire”D2: “Delivery of silver arrived in a silver truck”D3: “Shipment of gold arrived in a truck”

– IDF (Select Keywords)• a = in = of = 0 = log 3/3

arrived = gold = shipment = truck = 0.176 = log 3/2

damaged = delivery = fire = silver = 0.477 = log 3/1

– 8 Keywords (Dimensions) are selected• arrived(1), damaged(2), delivery(3), fire(4), gold(5),

silver(6), shipment(7), truck(8)

14

Page 15: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Fuzzy Set Model

15

Page 16: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Fuzzy Set Model

16

Page 17: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Fuzzy Set Model

Sim(q,d): Alternative 1

Sim(q,d3) > Sim(q,d2) > Sim(q,d1) Sim(q,d): Alternative 2

Sim(q,d3) > Sim(q,d2) > Sim(q,d1)17

Page 18: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Extended Boolean Model

• Disadvantages of “Boolean Model” :• No term weight is used• Counterexample: query q=Kx AND Ky. Documents containing just one term, e,g, Kx is considered as

irrelevant as another document containing none of these terms.• No term weight is used• The size of the output might be too large or too small

18

Page 19: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Extended Boolean Model

• The Extended Boolean model was introduced in 1983 by Salton, Fox, and Wu

• The idea is to make use of term weight as vector space model.

• Strategy: Combine Boolean query with vector space model.

• Why not just use Vector Space Model?• Advantages: It is easy for user to provide query.

19

Page 20: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Extended Boolean Model

• Each document is represented by a vector (similar to vector space model.)

• Remember the formula.• Query is in terms of Boolean formula.• How to rank the documents?

ii

xjxjx

idfidffw

max*,,

20

Page 21: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Fig. Extended Boolean logic considering the space composed of two terms kx and ky only.

dj

dj +1dj +1

dj

kx and ky

kx or ky

( 0, 1) ( 0, 1)( 1, 1) ( 1, 1)

( 0, 0) ( 1, 0) ( 0, 0) ( 1, 0)

ky ky

kx kx21

Page 22: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Extended Boolean Model

• For query q=Kx or Ky, (0,0) is the point we try to avoid. Thus, we can use

to rank the documents• The bigger the better.

2),(

22 yxdqsim or

22

Page 23: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Extended Boolean Model

• For query q=Kx and Ky, (1,1) is the most desirable point.

• We use

to rank the documents.• The bigger the better.

2

1(1),(

))1( 22 yxdqsim and

23

Page 24: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Extend the idea to m terms

• qor=k1 p k2 p … p Km

• qand=k1 p k2 p … p km

)...( 21

/1

),(m

xxx pm

pp p

jor dqsim

))1(...)1()1(

(1 21

/1

),(m

xxx mppp

jand

p

dqsim

24

Page 25: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Properties

• The p norm as defined above enjoys a couple of interesting properties as follows. First, when p=1 it can be verified that

• Second, when p= it can be verified that• Sim(qor,dj)=max(xi)

• Sim(qand,dj)=min(xi)

mxxdqsimdqsim m

jandjor

...),(),( 1

25

Page 26: Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.

Example

• For instance, consider the query q=(k1 k2) k3. The similarity sim(q,dj) between a document dj and this query is then computed as

• Any boolean can be expressed as a numeral formula.

)2

))(1((

321

/1 /1

2)1()1(

),( x pp p p

xxdqsim

pp

26