05359477

5
The Case Retrieval Strategy Based on Predicting Query Performance Shixia Ma, Dan Liu, Bo Sun Computer Science and Technology Department Henan Mechanic and Electric Engineering College Xinxiang, Henan, China e-mail:{ msx123456,liudan1005}@126.com Abstract—The case retrieval is the core of the case-based reasoning system. The paper applies performance prediction into the case retrieval based on case reasoning system, designs a retrieval strategy based on heterogeneous case base. The paper analyzes the organization fashion of the heterogeneous case, and mainly discusses the dynamic allocation strategy of characteristic items’ weight and construction method of the case retrieval log. The results of the experiment prove that the method can support users to build the target case and then efficiently enhance the success rate of case retrieval as well as the utilization rate of case base. Keywords- case-based reasoning system; target case; characteristic items’ weight; case retrival log I. INTRODUCTION The method of Case-Based Reasoning (CBR) is a strategy by visiting the past processes and results which solved the similar problems in repository and obtain new knowledge from new case in order to adapt to the current problem. It is an important reasoning method of AI [1] [2] [3] . Case retrieval is the core of CBR system; the retrieval quality concerns the quality of the entire system [4] . The target cases for user-generated query can be mapped to a point in the corresponding space. Based on user selected different eigenvalue of the characteristic item, the target cases mapped to different regions of space, and the similar distance cases are also different. Namely, the output result set of cases are also different. When the target case should not fully reflect the needs of users’, the return result set can not fulfill the users either. It makes the users must through several times queries to obtain the required the case result set. This situation not only increases the burden on the system, has also increased the difficulty of the user application system. Predicting Query Performance (PQP) also called Predicting Query Difficulty [5] [6] [7] . The PQP evaluates the good extent of a query return results with retrieval system under the situation of no relevant information. At present, the query performance prediction gets more and more attention, and become one of the key function of retrieval system [8] . The methods of query performance prediction divide into two categories: pre-retrieval and post-retrieval [9] . In this paper, query performance prediction based on the post-retrieval is applied to case-based reasoning system for case retrieval. The main thought is: after the first case retrieval, according with the dynamic distribution weight and retrieval logs of the characteristics item, adjust the user- generated target cases, assist the user to rebuild the target case. II. THE PARTITION OF CASE BASE The case base is an important part of case reasoning system. Set up the collection 1 2 ( , , , ) n C cc c = ⋅⋅⋅ for a non- empty finite set constituted by n cases, (1 ) i c i n is on behalf of a case in the case set C. The structure of case base can be divided into isomorphism case base and heterogeneous case base according to case’s characteristic item, the characteristic item’s weight and the relationship between the set of characteristic items [10] The paper uses complete heterogeneous case base as the organization method of case. In fully heterogeneous case base, the number and the types of cases may be different, and the weight of the same characteristic item in different cases also may be different. Set the collection 1 2 ( , , , ) m F f f f = ⋅⋅⋅ for a non-empty finite set constituted by m characteristic items, (1 ) i f i m of collection F represents the characteristic item of a case. Set the collection 1 2 ( , , , ) m W ww w = ⋅⋅⋅ for a non-empty finite set constituted by m case weights, 0 1,1 i w i m ,m is the element in case set F, i w is the case weight corresponding to i f and 1 1 m i i w = = . F is the collections of global characteristic items in case base, m is the largest number of characteristics, i c can be regarded as a subset of F. The 1 2 ( , , , ) i v w ww w = ⋅⋅⋅ which is the collection of characteristic items’ weights can be regarded as a subset of W. The 1 2 ( , , , )(1 ) i k c f f f k m = ⋅⋅⋅ which is a case in case base can be mapped to the point of k dimensional space in 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery 978-0-7695-3735-1/09 $25.00 © 2009 IEEE DOI 10.1109/FSKD.2009.399 395 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery 978-0-7695-3735-1/09 $25.00 © 2009 IEEE DOI 10.1109/FSKD.2009.399 395

Transcript of 05359477

Page 1: 05359477

The Case Retrieval Strategy Based on Predicting Query Performance

Shixia Ma, Dan Liu, Bo Sun Computer Science and Technology Department

Henan Mechanic and Electric Engineering College Xinxiang, Henan, China

e-mail:{ msx123456,liudan1005}@126.com

Abstract—The case retrieval is the core of the case-based reasoning system. The paper applies performance prediction into the case retrieval based on case reasoning system, designs a retrieval strategy based on heterogeneous case base. The paper analyzes the organization fashion of the heterogeneous case, and mainly discusses the dynamic allocation strategy of characteristic items’ weight and construction method of the case retrieval log. The results of the experiment prove that the method can support users to build the target case and then efficiently enhance the success rate of case retrieval as well as the utilization rate of case base.

Keywords- case-based reasoning system; target case; characteristic items’ weight; case retrival log

I. INTRODUCTION The method of Case-Based Reasoning (CBR) is a

strategy by visiting the past processes and results which solved the similar problems in repository and obtain new knowledge from new case in order to adapt to the current problem. It is an important reasoning method of AI [1] [2] [3].

Case retrieval is the core of CBR system; the retrieval quality concerns the quality of the entire system [4]. The target cases for user-generated query can be mapped to a point in the corresponding space. Based on user selected different eigenvalue of the characteristic item, the target cases mapped to different regions of space, and the similar distance cases are also different. Namely, the output result set of cases are also different. When the target case should not fully reflect the needs of users’, the return result set can not fulfill the users either. It makes the users must through several times queries to obtain the required the case result set. This situation not only increases the burden on the system, has also increased the difficulty of the user application system.

Predicting Query Performance (PQP) also called Predicting Query Difficulty [5] [6] [7]. The PQP evaluates the good extent of a query return results with retrieval system under the situation of no relevant information. At present, the query performance prediction gets more and more attention, and become one of the key function of retrieval system [8]. The methods of query performance prediction divide into two categories: pre-retrieval and post-retrieval [9].

In this paper, query performance prediction based on the post-retrieval is applied to case-based reasoning system for

case retrieval. The main thought is: after the first case retrieval, according with the dynamic distribution weight and retrieval logs of the characteristics item, adjust the user-generated target cases, assist the user to rebuild the target case.

II. THE PARTITION OF CASE BASE The case base is an important part of case reasoning

system. Set up the collection 1 2( , , , )nC c c c= ⋅⋅ ⋅ for a non-

empty finite set constituted by n cases, (1 )ic i n∃ ≤ ≤ is on behalf of a case in the case set C.

The structure of case base can be divided into isomorphism case base and heterogeneous case base according to case’s characteristic item, the characteristic item’s weight and the relationship between the set of characteristic items[10] The paper uses complete heterogeneous case base as the organization method of case. In fully heterogeneous case base, the number and the types of cases may be different, and the weight of the same characteristic item in different cases also may be different.

Set the collection 1 2( , , , )mF f f f= ⋅⋅⋅ for a non-empty finite set constituted by m characteristic items,

(1 )if i m∃ ≤ ≤ of collection F represents the characteristic

item of a case. Set the collection 1 2( , , , )mW w w w= ⋅⋅⋅ for a non-empty finite set constituted by m case

weights, 0 1,1iw i m≤ ≤ ≤ ≤ ,m is the element in case set

F, iw is the case weight corresponding to if and

11

m

ii

w=

=∑.

F is the collections of global characteristic items in case

base, m is the largest number of characteristics, ic∃ can be

regarded as a subset of F. The 1 2( , , , ) i vw w w w= ⋅⋅⋅ which is the collection of characteristic items’ weights can be regarded as a subset of W. The

1 2( , , , )(1 )i kc f f f k m= ⋅⋅⋅ ≤ ≤ which is a case in case base can be mapped to the point of k dimensional space in

2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery

978-0-7695-3735-1/09 $25.00 © 2009 IEEE

DOI 10.1109/FSKD.2009.399

395

2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery

978-0-7695-3735-1/09 $25.00 © 2009 IEEE

DOI 10.1109/FSKD.2009.399

395

Page 2: 05359477

accordance with its characteristic items and the eigenvalue of the characteristic items .

According to the difference between the characteristic items’ quantity and type contained by the case in the case

base, set up 1 2( , ,..., )pS s s s= as a non-empty finite set

composed by p elements, (1 )is i p∃ ≤ ≤ are collections having different sizes and which’s content is consisted by

the elements in F , is∃ represents a case structure category in case base.

From the view of the efficiency of case retrieval, the case organization method based on the heterogeneous case base is dealing with the case through dimensionality reduction. Dimensionality reduction can not only reduce the complexity in the case of calculating the similarity, but also map the target case to a certain case structure category, and only to match with the case having the same case structure type, which eases the reduction of case retrieval brought by the increase of case in case base.

From the view of building a target case, the users are not aware of or interested in m characteristic items of the case and they do not entirely select the m characteristic items of the case when they build the target case. The organization method based on the heterogeneous case can not only more accurately describe the needs information of users, but also reduce the difficulty using the system by user.

III. THE TARGET CASE RECOMMENDATION BASED ON POST-RETRIEVAL

A. The Distribution Strategy of Characteristic Item’s Weight The target case is composed by the characteristic item

selected by user and the importance level of characteristic item can the evaluate merits of the target case. Elements in F will be given a certain weight which is the level being used when the case is be retrieved. The higher the level is, the greater the weight is. Therefore, users can be recommended new target case according to the characteristic item selected by the user and the elements’ weight in F after retrieval. Therefore, the paper designs a weight updating strategy based on the ant colony algorithm which adjusts the elements’ weights in F according to the characteristic items selected by the user.

In 20th century 90's, Italian scholar M. Dorigo, V. Maniezzo, A. Colorni etc. proposed a novel simulated evolutionary algorithm - ant colony algorithm [11] through simulating the ants search path in the natural world. The principle of ant colony algorithm is a pheromone will be left on the way ants search for food. Ants exchange, cooperate with other ants according to the pheromone, and find a shorter path. The more ants transit under a certain path, the greater the intensity of pheromone, ants will tend to choose the direction the pheromone’s intensity is greater, while the pheromone will automatically volatile in a certain period in

order to ensure not to get into search for local optimal solution.

According to ant colony algorithm, all the elements in the collection CF can be regarded as the global path; the users’ selecting characteristic items will be collected to compose of the new adding case which is regarded as ants. When the case retrieval is completed and recommends the match results for the user, which is deemed ants pass through the path. When the ants pass through the path, update the weight collection W corresponding to the collection F according to the whole elements in the characteristic item contained by

is . The weight updating algorithm of characteristic items is

as follows: (1) After the case retrieval is complete, extract the

characteristic items’ collection 1 2( , ,..., ) (1 )i vs f f f v m= ≤ ≤ contained by the case

structure category.

(2) Distribute the initial weights

1 (1 )iw i vv

= ≤ ≤for

the all elements in collection is . (3) Update the elements’ weight in collection F . The

updating formula

is ( 1) ( )iw t tτ ρ τ τ= + = + Δi , 1 i v≤ ≤ .In the formula , ρ is a parameter , 0 1ρ< < . 1 ρ− represents the evaporation factor of the characteristic items’ weight in t to t +1 times, ( )tτ is the weights of elements in F at the time of t times case retrieval, τΔ is remaining pheromone after the case matches successfully one time, in order to match the success of the case after the first left the pheromone,

(1 ) 'wτ ρΔ = − is the elements which have not been

distributed weights in CF , and ' 0w = , otherwise

1'wv

=.

B. Case Retrieval Log Case retrieval logs store the producing target case and

their similar case in form of collection after case is retrieved every time, and then all cases in the space ,which is corresponding to the case structure categories mapped by the target case, are generated alignment in reversing order . Every case in the space maintains a list that a case is belongs to after it is retrieved. The more the list is, the more times the case is output.

The result set that the case will recommend after it is retrieved follows the definitions as follows.

Definition 1: the similarity between the recommended cases and the target cases must be larger than prescriptive threshold ''sim .

When the case is retrieved, the target case, according to the characteristic items contained by it, corresponds to any

one element 1 2( , ,..., ) (1 )i vs f f f v m= ≤ ≤ in collection

396396

Page 3: 05359477

S .And then in accordance with the eigenvalue of characteristic items contained by the target case; the target case is mapped to the point of v dimensional space. After the match between the target case and all cases in this space is complete, the collection of similar cases will be output Therefore, after the case is retrieved every time, the target case and the similar cases’ collection can build a region which regards the target case as the center, and distance from the center to the border is 1 'dis sim= − .

Definition 2: After the case is retrieved, the region built by the target case and the similar cases’ collection can be expressed as ( , , )area case caselist s= .And case represents the target case having been stored, caselist represents the similar cases’ list having been output, s represents the case’s structure category the region belongs to.

Definition 3: The inverted structure of the case is ( , )ca c arealist= . c represents the case, arealist

represents the region list corresponding to the case, area represents the region, count represents the times the region occurred.

When the case is retrieved, the following rules must be followed.

Definition 4: If the similarity between the recommended cases and the target cases is larger than prescriptive threshold ''sim , the two cases can be regarded as the same case.

In the case base, the distances between the cases having with the same case structure category are larger than 1 ''dis sim= − . If the similarity between the target case and case is larger than ''sim , we can consider the user performs retrieval in the same region, count will be added 1 after retrieval is complete. Therefore, the cases in the same case structure category can be sorted in descending order according to the quantity of the region contained by the case. If the quantity of the region by the case is same, the cases could be sorted by the sum of count corresponding to the region. The case collection contained by the corresponding space of this case’s structure category after sorted can be

expressed as 1 2' ( , ,..., ) (1 )i vs c c c v n= ≤ < .

C. Recommendation Ways of Target Case The target case recommended for the user after retrieved,

according to the situation of the characteristic items and the corresponding eigenvalues of the characteristic items, can be extracted by the following three methods.

Method 1: The recommended target case 1''targetcase has the same case structure category with the

target case 'targetcase generated through user retrieval. But there are different corresponding eigenvalues between

1''targetcase with 'targetcase . It means that the

recommended 1''targetcase is mapped to the same space in different regions. The eigenvalues of the characteristic items

contained by 1''targetcase is the eigenvalues of the characteristic items contained by the case which’s label is

smallest in collection 'ice having been sorted and contained

in corresponding space of 1''targetcase .

Method 2: Extract the characteristic item if which has largest weight and is not contained in the target case

'targetcase that is generated through user retrieval from the collection F ,and re-build a new target

case 2''targetcase using of if and the characteristic item

contained by 'targetcase . The eigenvalues of the

characteristic items contained by 2''targetcase is the eigenvalues of the characteristic items contained by the case

which’s label is smallest in collection 'is having been sorted

and contained in corresponding space of 2''targetcase . If the quantity of characteristic items contained by

'targetcase is same as the quantity of elements in

collection F , 2''targetcase can not be built.

Method 3: In the target case 'targetcase generated by user retrieval, extract the collection F of characteristic items

of 'targetcase , remove the characteristic item having smallest weight in collection F .The eigenvalues of the

characteristic items contained by 3''targetcase is the eigenvalues of the characteristic items contained by the case

which’s label is smallest in collection 'ice having been sorted and contained in corresponding space

of 3''targetcase . If the quantity of the characteristic items

contained by 'targetcase is 1, 3''targetcase can not be built.

IV. CASE RETRIEVAL STRATEGY In the process of case retrieval, the following definitions

should be also fulfilled: Definition 5: If the similarity between the target cases

and all the elements in the sub-case collection contained by the corresponding space exceeds the prescribed threshold ''sim , it is not stored. By definition 2, in order to guarantee the case has representative in the space and to reduce space storage density, the target case, the similarity between it and cases in space is larger, will be not stored.

The case retrieval process is as follows:

397397

Page 4: 05359477

(1)Extract the characteristic items contained in the new generation target case targetcase ,and maps targetcase

into the corresponding space is . (2)Match targetcase with all cases in the space and

calculate the distances , , ,i j ndis dis dis⋅ ⋅ ⋅

.

(3)Compare , , ,i j ndis dis dis⋅ ⋅ ⋅

with 'sim , sort the cases which meet the definition 1 by the size of the similarity and output them, and update the weights of all the elements in collection F . If the space exists in the case, the similarity between it and targetcase is larger than ''sim , then update the count value of area corresponding to the case’s inverted structure ca .

( 4 ) Determine whether the recommendation is successfully according to the information of user feedback, if

successful, compare , , ,i j ndis dis dis⋅ ⋅ ⋅

with ''sim to determine whether all cases in targetcase and the space meet the definition 5, if met, store them. Then update the inverted structure caof cases in the corresponding space of targetcase according to the new region generated by targetcase , the retrieval process ends; If not met, not stored. If the recommendation is failing, then implement step 5.

( 5) In accordance with the method extracting the recommended target case, recommend target case

1''targetcase, 2''targetcase and 3''targetcase for users。

(6)In accordance with the recommended target case

1''targetcase , 2''targetcase and 3''targetcase , users re-build the target case,then back to step 2.The progress of case retrieval can be shown in Figure 1.

V. EXPERIMENTAL RESULTS AND ANALYSIS Develop simulation experiment to the algorithm, XML is

regarded as the storage medium of the case collections, the collection size of global characteristic items is 12, increases 5000 cases manually. The case structure category is 30. Experimental number is 600 times. The value of the parameter 'sim is 0.35, the value of the parameter ''sim is 0.95, and the value of the parameter ρ is 0.98.

TABLE I. THE EXPERIMENT RESULT OF RETRIEVAL

retrieval categories

success Times

failure times

Success rate of recommend

One time retrieval 314 286 52.3%

Second time retrieval 122 164 42.7%

According to the tableI experiment results, 'sim determines one time recommendation result’s quantity

and quality; ''sim not only can affect the quality of the storage case, but also can affect the case retrieval logs; ρ is

regarded as the adjustment factor of the characteristic items’ weight to affect the speed of the change of the characteristic items’ weight. The recommendation success rate of first time recommendation depends on the cases’ quantity and quality in the case base, repetitious retrieval not only takes up system’s resources but also affect the user’s experiences. Providing users with the target case can assist users to reconstruct the target case. The experiment only records the second retrieval, through the user's secondary retrieval, which will effectively improve the recommendation system’s overall recommendation success rate and increase the utilization rate of cases in the case base.

Figure 1. The Flow Figure of Case retrieval

VI. CONCLUSIONS The efficiency of case retrieval and the quality of the

results are the important factors to judge the merits of intelligent recommendation system. The paper applies predicting query performance based on post-retrieval into the case retrieval based on the case reasoning system, in accordance with the user feedback’ information after initial case retrieval to support users to reconstruct the target case. The experiment proves that the method has good results in improving the utilization of case and the recall rate after case retrieval. The next step of this research will apply the query performance prediction to maintenance of the case base.

No

Yes

No

Yes

User input

the implement of collecting characteristic item

the implement of generating case

the target case

Computer of case

similarity

The Collection of characteristic item

The case Collection of sound case

Filter of case similarity

The Collection of similar case

Not Storing

Success?

Store?

updating overall characteristic item weight and corresponding case’s

reversing structure

producing recommendation’s

target case

User rebuilding target case

overall characteristic item weight

Updating corresponding

case’s reversing structure

Store case

the implement of extracting case

the implement of collecting characteristic item

Case library

Retrieval end

Case retrieval log

398398

Page 5: 05359477

ACKNOWLEDGMENT This work was supported in part by The Natural Science

Research Project of Henan Province Education Department (2007 520008, 2007 520009).

REFERENCES [1] Schafer,J.B.,Konstan,J.A.,and Riedl,J.Recommender Systems in E-

Conference.In ACM Conference on Electronic Commerce(EC99).1999.

[2] R T Mclvor,P K Humphreys. A case-based reasoning approach to the make or buy decision[J].Integrated Manufacturing Systems,2000;11(5):295-310.

[3] Leake D B. CBR in Context: the present and future[A].Case-Based Reasoning, Experiences, Lessons&Future Directions[C].Menlo Park CA,USA:AAAI Press/the MIT Press,1996.1-30.

[4] Luo Zhongliang Wang Keyun Kang Renke Guo Dongming, Study on a Case Retrieval Algorithm in Case-based Reasoning System [J] .Computer Engineering and Applications, 2005, (25):230-232.

[5] LANG Hao, WANG Bin,LI Jin-Tao,and DING Fan。 Predicting Query Performance for Text Retrieval [J], Journal of Software, 2008, 19(2):291-300.

[6] Zhou Y, Croft WB. Ranking robustness: A novel framework to predict query performance. In: Proc. of the 15th ACM Int’l Conf. on Information and Knowledge Management. Arlington: ACM Press, 2006. 567−574.

[7] Cronen-Townsend S, Zhou Y, Croft WB. Predicting query performance. In: Proc. of the 25th Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval. Tampere: ACM Press, 2002. 299−306.

[8] He B, Ounis I. Inferring query performance using pre-retrieval predictors. In: Apostolico A, Melucci M, eds. String Processing and Information Retrieval, 11th Int’l Conf., SPIRE 2004. LNCS 3246, 2004. 43−54.

[9] Xu JX, Croft WB. Query expansion using local and global document analysis. In: Proc. of the 19th Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval. Zürich: ACM Press, 1996. 4−11.

[10] Jia Shi jie, Huang Song qing,Liu Li jun, Constructive strategy of non-isomorphic case set based on pheromone theory of ant colony algorithm [J]. Computer Engineering and Applications, 2008, 44(25):210-211.

[11] Zhang Jia-hua,Zhao Dong-dong,Jiang He,and Zhang Jian chao, An Ant Colony Clustering Algorithm Based on Pheromone [J]. Computer Engineering and Applications, 2006, 20(2):157-163.

399399