Usability Guidelines for Preference-Based Product …Usability Guidelines for Preference-Based...

Usability Guidelines for Preference-Based Product Recommenders

Pearl Pu1, Boi Faltings2, Li Chen1 and Jiyong Zhang1

1Human Computer Interaction Group 2Artificial Intelligence Laboratory

School of Computer and Communication Sciences

Swiss Federal Institute of Technology in Lausanne (EPFL)

CH-1015, Lausanne, Switzerland

{pearl.pu,boi.faltings,li.chen,jiyong.zhang}@epfl.ch

Abstract

In this chapter, we survey the state of the art and present successful methods for preference-based product recommenders in three important facets of user interac-tion: preference elicitation, preference revision, and search result presentation. Each method was carefully selected based on an evaluation framework we call ACE. It relies on three criteria: how a system increases users’ decision accuracy (A) and their confidence (C) in choosing items, and the amount of interaction ef-fort (E) required of them. After synthesizing and presenting the main results for each facet, we develop a set of best-practice design guidelines applicable in gen-eral to recommenders used today. The chapter concludes with a theoretical model to clarify the role of the guidelines in the ACE evaluation and thus their applica-tion in the interaction process. The ACE evaluation framework presented here is the first in the field to validate the performances of preference-based recommend-ers based on a set of user-centric criteria.

Keywords: preference-based recommender systems, product search, interac-tive preference elicitation, example critiquing, evaluation methodologies, in-terface design guidelines, preference-based product search, decision support, tradeoff support, e-commerce, interface design for recommender systems.

2

1. Introduction

According to Jacob Nielsen, the first usability principle of e-commerce is that if users cannot find the product, they cannot buy it either. The word “finding” has a broader meaning than just performing a targeted search for a product whose name is already known. It refers to the identification of an ideal product to satisfy the user’s needs, implying a decision making proc-ess during the search.

Both commercial and non-commercial online sites are offering an al-most “infinite” shelf space of available items. The task of locating a de-sired choice appears to be too daunting for an average customer. Many ef-forts have been made to develop highly interactive and intelligent tools to assist users. As a result, preference-based recommenders have emerged and are broadly recognized as an effective search and navigation mecha-nism guiding users to find their ideal products. Many novel interaction methods and systems have been developed in the past decade and evalu-ated with successful outcome. However, due to the lack of synthesis of these methods, their wide adoption remains limited.

Our goal is to survey these novel methods and systems discussed in var-ious scientific publications and industrial reports. We focus on three facets of user-system interaction issues of such preference recommenders: initial preference elicitation, preference revision, and presentation of recommendation results. The three components do not have to be simultaneously present in a given system. For example, the initial preference elicitation is an optional step when users are presented with a set of recommendations (e.g., best sellers in different categories) as soon as they visit a site. Other systems, on the other hand, elicit users’ initial preferences but do not provide the option to allow users to revise them.

In order to facilitate the adopting of these methods in a wider and more general context, it is also necessary to synthesize the accumulated knowl-edge into a set of best-practice guidelines. To select the successful meth-ods for the derivation of the guidelines, we will not be using the methods’ accuracy alone as our evaluation criterion. While accuracy is an important and central aspect of recommenders, user benefits cannot be captured en-tirely in accuracy alone [35]. We need to take into account a basic human condition: we have limited cognitive resources and are not likely to achieve a high level of accuracy if the required effort is excessive. How can we identify and select systems that produce high recommendation ac-curacies while requiring an effort level that users are predisposed to make? We propose a multi-criteria evaluation framework called ACE: 1) the sys-tem’s ability to help users find their most preferred item (accuracy), 2) its

3

ability to inspire users’ confidence in selecting the items that were recom-mended them, and 3) the amount of user effort it requires for achieving the relative accuracy (effort). Moreover, we show how these guidelines indeed offer improvements in system usability based on the outcome and experi-ences of user studies.

This chapter therefore contributes to the field in two main areas. From an academic point of view, the article establishes for the first time a set of user-centric criteria to evaluate preference recommenders’ performances. From a practical point of view, the chapter surveys the state of the art of this field on new interaction technologies and derives usability guidelines that can be applied in a wider and more scalable way. Since these guide-lines are derived from interaction methods that have been proven to work effectively in user studies, we hope to help future designers of these sys-tems implement systems with predictable good performances, without hav-ing to test them again in costly user trials.

2. Preliminaries

Preference-based recommenders suggest items to users based on their explicitly stated preferences, either in the form of ratings of items or pref-erences over the attributes of the items. We do not consider behavioral-based recommenders which generate recommendations based on users’ ac-cumulated interaction behaviors, for example, the items users have exam-ined, purchased, or both [70], nor demographic-based recommenders [26,53].

The systems described in this chapter share some overall characteristics even though they may be designed in different system architectures and evaluated on different data domains. Figure 1 represents a generic model. A user interacts with such systems by stating a set of initial preferences, ei-ther via a graphical user interface or a natural language dialog. After ob-taining that information, the system filters the space of options and selects the items to be recommended to users based on their stated preferences. This set is called the recommendation set. At that point, either the user finds her most preferred item in the recommendation set and thus termi-nates her interaction with the system, or she revises the preference model, e.g., “I would like a cheaper item”, in order to obtain more accurate rec-ommendations. This last step is called preference revision, or the user feedback step. In many scenarios that we will consider, users do not just

4

pick a single item by the time they terminate the interaction, but construct a list of items known as the consideration set.

Fig. 1. The generic system-user interaction model of a preference-based recommender system.

Once a recommendation set has been determined, a system may use var-ious display strategies to show the results. A typical tool presents the user with a set of k items (1 ≤ k ≤ n, where n is the total number of products) in each of a total of m interactions. In each display of these k items, a user is identifying her target choice to be included in the consideration set. The more options displayed, the more effort the user must expend to examine them. On the other hand, in a small display set users could easily overlook their target choice and engage in more interaction cycles. This tradeoff of effort versus accuracy will be discussed in further detail when individual systems are presented.

Within the area of preference-based recommenders, there are four sub-types: rating-based, case-based, utility-based and critiquing-based. Please see other ways to classify recommender systems [7][1]. The criterion used here relates to how users’ preferences are introduced to the system: via rat-ings on items that users have experienced, or via their preferences on fea-tures of their ideal product (e.g., I would like to rent an apartment with a private bathroom with at least 70 square meters of living space).

Preference Model

Step 1: user specifies in-itial preferences

Step 2: the product search tool filters the options and displays recommended items based on users’ stated preferences

K-best items are displayed as the recom-mended set

Step 4: user picks the final choice

space of all op-tions

Step 3: user revises preferences to receive more recommendations

5

2.1 Rating-based Systems

Users explicitly express their preferences (even though they may not know it) by giving either binary or multi-scale scores to items that they ex-perienced. Either the system proposes a user to rate a set of items or users will select on their own a set of items to rate. These initial ratings consti-tute the user profile. Systems that fall into this category are most com-monly known as collaborative recommenders mainly due to the fact that the user is recommended with items that people with similar tastes and preferences liked in the past. For this reason, this type of systems is also called social recommenders. The details of how the collaborative algo-rithms work can be found in [1]. Lately some websites, such as tripadvi-sor.com, started collecting users’ ratings on multiple attributes of an item to obtain a more refined preference profile.

2.2 Case-based Systems

This type of systems recommends items that are similar to what users have indicated as interesting. A product is treated as a case having multiple attributes. Content-based [1] and case-based technologies [7] are used to analyze the attribute values of available products and the stated prefer-ences of a user, and then identify one or several best-ranked options ac-cording to a ranking scheme.

2.3 Utility-based Systems

Utility-based recommenders propose items based on users’ stated pref-erences on multi-attribute products. Multi-attribute products refer to the encoding scheme used to represent all available data with the same set of attributes {a1…ak} where each attribute ai can take any valuev, from a do-main of values d(ai). For example, a data set comprising all digital cameras in an e-store can be represented by the same set of attributes: manufac-turer, price, resolution, optical zoom, memory, screen size, thickness, weight, etc. The list of attributes as well as the domain range varies among product domains. We assume that users’ preferences depend entirely on the values of these attributes so that two items that are identical in all at-tributes would be equally preferred. Furthermore, products considered here, such as digital cameras, portable PCs, apartments, demand a signifi-

6

cant financial commitment. They are called high involvement products be-cause users are expected to possess a reasonable amount of willingness to interact with the system, participate in the selection process and expend a certain amount of effort to process information [59]. Users are also ex-pected to exhibit slightly more complex decision behaviors in such envi-ronments than they would in selecting a simpler item, such as a book, a DVD, or a news article.

Tools using these technologies have also been referred to as knowledge-based recommenders [7] and utility-based decision support interface sys-tems (DSIS) [59]. Utility refers to the multi-attribute utility theory that such technologies use to calculate a product’s suitability to a user’s stated preferences. A related technology, specializing in searching configurable products, uses constraint satisfaction technology [46]. The difference be-tween utility- and case-based systems lies in the notion of utility. While the weight of user preference is important for UBR, it is ignored in CBR. Fur-ther, the notion of value tradeoff is an essential part of decision making for UBRs. Essentially UBRs recommend decisions, rather than just similar products. See more details about this topic in the section on tradeoff rea-soning.

2.4 Critiquing-based Systems

Both case- and utility-based recommenders can be improved by adding the additional interaction step of critiquing. A critiquing-based product re-commender simulates an artificial salesperson that recommends options based on users’ current preferences and then elicits their feedback in the form of critiques such as “I would like something cheaper” or “with faster processor speed.” These critiques help the agent improve its accuracy in predicting users’ needs in the next recommendation cycle. For a user to fi-nally identify her ideal product, a number of such cycles are often re-quired. Since users are unlikely to state all of their preferences up front, especially for products that are unfamiliar to them, the preference critiqu-ing agent is an effective way to help them incrementally construct their preference model and refine it as they see more options.

2.5 How Does a Utility-Based Recommender Work

7

Since this topic has not been treated elsewhere, we provide an overview of the underlying recommender algorithm for utility-based recommenders.

The fundamental assumption underlying these recommenders is that people prefer items because of their attributes. Different values of an at-tribute correspond to different degrees of preference depending on the situation: for example, a large apartment is useful when there are guests, but not useful when it has to be cleaned. A user will determine preferences for an item by weighing the advantages and disadvantages of each feature according to how often it is beneficial and how often it is not. Thus, pref-erence is a weighted function of attributes.

Formally, the preference-based search problem can be formulated as a

Multi-Attribute Decision Problem (hereafter MADP) ,

where is a finite set of attributes that the product catalog has, indicates the product domain space (each

is a set of domain values for attribute ), is a finite set of available products that the system may provide, and

denotes a set of preferences that the user may have. The objective of a MADP is to find a product (or products) that is (or are) most preferred by the user. A MADP can be solved by constraint-based ap-proaches or utility-based approaches. Below we introduce the approach based on the multi-attribute utility theory (MAUT) to solve a given MADP. Please see [46] for the constraint-based approach.

Multi-Attribute Utility Theory (MAUT)

The origination of utility theory dates back to 1738 when Bernoulli pro-posed his explanation to the St. Petersburg paradox in terms of the utility of monetary value [4]. Two centuries later it was von Neumann and Morgenstern (1944) who revived this method to solve problems they en-countered in economics [66]. Later, in the early 1950s, in the hands of Marschak [29] and of Herstein and Milnor [21], the Expected Utility The-ory was established on the basis of a set of axioms known as the von Neu-mann Morgenstern theorem (VNM Theorem) [39, 56].

In the 1970s Keeney and Raiffa [24] extended the utility theory to the case of multi-attributes. The main idea of the multi-attribute utility theory is that user preferences can be represented as a utility function.

Let the symbol denote a user’s preference order, e.g. means “A is preferred or indifferent to B”. According to the Utility Theory, for a

8

given MADP, there exists a utility function , that for any two possible products and ,

More specifically, a product can be represented by a set of attribute

values (in short as ), thus the above formula can be rewritten as

Usually the utility function is scaled from zero to one. If the utility function is given, the likeness of each product will be calculated and the preference order of all products can be sorted according the utility they gain.

Finding the proper utility function to represent users’ preferences is a challenging task. While in theory the utility function can be used in any style to represent user preferences, a special case is commonly used to re-

duce computation effort. If the attributes are mutually preferen-tially independent1 based on the utility theory, the utility function has the additive form as follows:

where is a value function of attribute ranged in [0,1], and is the

weight value of satisfying .

In other words, the utility function for a product O is the weighted sum of the utility functions for each of its attributes. The weight value for each attribute can be given as default value 1/n, and we can allow the user to

specify the weight values of some attributes. The value function can be

1 An attribute X is said to be preferentially independent of another attribute Y if preferences for levels of attribute X do not depend on the level of attribute Y. If Y is also preferentially inde-pendent of X, then the two attributes are said to be mutually preferentially independent. See more details in [22].

9

determined to satisfy the user’s preferences related to the attribute .

Usually a linear function with the form is enough to represent the user’s preference on each attribute.

Once the utility function for each product is determined, we are able to rank all the products based on their overall utilities and select the top K products with highest utilities as the recommendation set. In practice, we assume that the attributes of any product are mutually preferentially inde-pendent, so the additive form of the utility function can always be applied.

2.6 Definition of the ACE Evaluation Framework

We give a more precise definition of the ACE (Accuracy, Confidence, Effort) framework as well as ways of measuring these variables.

Accuracy refers to the objective accuracy of a recommender. For rating based systems, the most often used measure is the mean absolute error (or MAE). It is measured by an offline procedure known as leave-one-out on a previously acquired dataset. Leave-one-out involves leaving one rating out and then trying to predict it with the recommender algorithm being evalu-ated. The predicted rating is then compared with the real rating and the dif-ference in absolute value is computed. The procedure is repeated for all the ratings and an average of all the errors is called the Mean Absolute Error.

Recommendations generated by case or utility-based systems are based on users’ preference profiles. Since such profiles cannot be simulated, off-line methods to measure accuracy are not possible. One method currently used in this field is the switching task as defined in [43]. It measures how often users’ truly preferred items are selected from the ones recommended to them. For example, if 70 out of 100 users found their preferred items during the recommendation process without changing their decisions after the experiment administrator presented all of the available options to them, then we say the accuracy of the system is 70%. This implies an experimen-tal procedure where each user first interacts with a recommender to pick items for her consideration set. Her task time is measured. In a second phase of this procedure, the experiment administrator will show all of the available items and then ask her whether she finds the items in the consid-eration set still attractive. If she switches to other items, call the switching rate, the system has failed to help her make an accurate decision. Such procedures were already employed in consumer decision research to meas-ure decision quality, known as the switching task [20]. This measure gives the only precise account of accuracy for a personalized recommender tool

10

even though it is very time consuming. This is the reason that switching tasks are only performed in empirical studies.

User confidence, the second evaluation criterion in our ACE frame-work, is the system’s ability to inspire users to select the items recom-mended to them. It can be assessed using a post-study questionnaire such as this question: “I am confident that the items recommended to me are the ones I truly want”. Users can then answer this question by giving an agreement score. When the objective accuracy cannot be feasibly obtained, user confidence can be used to assess the perceived accuracy of a system.

By user effort, we refer to the actual task time users take to finish an in-structed action, such as the preference elicitation and preference revision procedures and the time they take to construct the consideration set. When actual time is not preferred, we can also measure the number of interaction cycles for any of the interaction activities.

2.7 Organization of this Chapter

We structure the guidelines according to the generic components of the model presented in Figure 1: initial preference elicitation (step 1), prefer-ence revision (step 4), and display strategy (step 2). The rest of this article is organized as follows: section 3 (Related Work) reviews design guide-lines for preference elicitation and personalized product recommender tools in other fields; Section 4 (Initial Preference Elicitation) presents guidelines for motivating users to state their initial preferences as accu-rately as possible using principles from behavioral decision research; Sec-tion 5 (Stimulating Preference Elicitation with Examples) continues the discussion on preference elicitation and identifies concrete methods to help users state complete and sound preferences; Section 6 (Preference Revi-sion) describes strategies to help users resolve conflicting preferences and perform tradeoff decisions; Section 7 (Display Strategies) presents guide-lines for device display strategies that achieve a good balance between minimizing user’s information processing effort and maximizing decision accuracy and user confidence; Section 8 shows the cost-effect model on the relevance and use of the guidelines; and Section 9 concludes this arti-cle.

11

3. Related Work

This chapter derives a set of usability design guidelines based on our re-cent and related works in the domain of interaction technologies for pref-erence-based search and recommender tools. Therefore, we will not review the related works that contribute to the accumulated list of guidelines in this section. Rather, discussions of these works will be provided through-out the chapter in areas that correspond to the established guidelines. We will, however, describe two papers which share our goal of deriving good design guidelines for decision support systems. The first cited work pro-posed a set of recommendations derived from marketing research in order to increase users’ motivation to interact with the recommender agent of an online store and its website. The second cited work describes a list of “building code” guidelines for an effective preference construction proce-dure in decision problems involving higher-stake outcomes.

Based on a critical but well-justified view of preference-based product search and recommender tools available at the time, Spiekermann and Pa-raschiv proposed a set of nine design recommendations to augment and stimulate the interaction readiness between a user and such systems [59]. For example, the first of the nine says that systems should ask users for what purposes they seek a product and what their purchase goals are. There is no overlap between these recommendations and our guidelines.

Much of the basis for these recommendations are insights from market-ing literature relative to information search, and perceived risk theory that defines a user’s readiness to interact with a product recommender as her motivation to reduce the functional, financial, and emotional risks associ-ated with the purchase decision. The design recommendations were there-fore derived from methods concerned with reducing user risk in all of these dimensions. Compared to our work, this is a “horizontal” approach that covers the general design of an entire e-commerce website, in which the recommender agent is the principal technical component. We perform an in-depth examination of the recommender engine’s interaction tech-nologies using a more “vertical” approach, ensuring that consumers are of-fered the optimal usability support for preference elicitation and decision making when interacting with the product recommender tools.

Although the subject matter is more concerned with making high-stake decisions, the valuation process described by Payne et al. [41] is similar to our goal of addressing the needs and preferences of a consumer facing a purchase decision. Their work challenged many of the traditional assump-tions about well-defined and preconceived preferences and strongly pro-moted a theory of preference construction. Under that new framework, the

12

authors proposed a set of “building code” (similar to guidelines) to help professional decision makers establish value functions and make high quality decisions. Their discussions on the nature of human preferences, the way people construct and measure preferences, and how to face trade-offs to obtain rational decisions are especially influential to our work. Ref-erences to the specific details of this and other works in behavior decision research will be given at relevant areas in this chapter.

4. Initial Preference Elicitation

We consider preference elicitation for utility-based recommenders since it is the most user involving case. We use P={(ai, wi)} where 1 ≤ i ≤ n to specify a user’s preferences over a total of n attributes of her desired prod-uct for utility-based recommenders. Furthermore, ai represents the desired characteristics on the ith attribute and wi the degree to which such charac-teristics should be satisfied. This model is also known as the value function in [24]. It is called the preference model in most systems discussed here. Some methods assume the same weights for all attributes and therefore do not elicit such information [9,51]. Preference elicitation (also known as query specification) is the initial acquisition of this model for a given user.

It would seem apparent that a user’s preferences could be elicited sim-ply by asking her to state them. Many online search tools, for example a form-filling type of graphical user interface, practice such an approach. Users are asked to state their preferences on every aspect, such as depar-ture and arrival date and time, airlines, intermediate airports, etc., and are given the impression that all fields must be filled. We call such approaches non-incremental since all preferences must be obtained up-front.

To understand the nature of user preference expression in order to de-sign elicitation tools, we turned to behavior decision theory literature for some answers. According to the available literature, users’ preferences are context-dependent and are constructed gradually as a user is exposed to more information regarding his/her desired product [40,41]. For example, Tversky et al. reported a user study about asking subjects to buy a micro-wave oven [63]. Participants were divided into 2 groups with 60 users each. In the first group, each user was asked to choose between an Emer-son priced at $110 and a Panasonic priced at $180. Both items were on sale, and these prices represented a discount of one third off the regular price. In this case 43% of the users chose the more expensive Panasonic at $180. A second group was presented with the same choices except with an even more expensive item: a $200 Panasonic, which represented a 10%

13

discount. In this context, 60% of the users chose the Panasonic priced at $180. In other words, more subjects prefer the same item just because the context has changed. This finding demonstrates that people are not likely to reveal their preferences as if they were innate to them, but construct them based on contextual information. More examples of this nature can be found in Ariely’s recent book on Predicable Irrationality.

These studies on human behaviors suggest that a novice user is unlikely to state all of her preferences and their relative importance in the begin-ning. To validate these findings in our field of research, namely preference elicitation, we conducted some empirical studies ourselves. 22 subjects were asked to interact with a preference elicitation interface [65]. The av-erage user stated only preferences on 2.1 attributes out of a total of 10 when they had the possibility to freely state all preferences. Their prefer-ences only increased to 4.19 attributes at the time of selecting the final choice.

Another study was conducted soon afterwards to further confirm the finding that users were unlikely to state all of their preferences in the be-ginning. It compared how users perform product search tasks in terms of decision accuracy and effort while interacting with a non-incremental pro-cedure versus the incremental one [64]. With the former procedure, users were required to specify all of their preferences in a single graphical user interface (called form filling), whereas with the latter approach, each pref-erence was constructed by a user. All 40 users were randomly and evenly divided into two groups, and each group was assigned one system (with non-incremental procedure or incremental approach) to evaluate. In the non-incremental approach, users stated on average 7.5 preferences on the 10 attributes, while the results for the incremental approach remained the same. However, the non-incremental approach had an accuracy of only 25%, while the incremental method achieved 70% accuracy with compa-rable user effort. That is, only 25% of users found the target products when they had to state preferences in the non-incremental form-filling style. Thus, while the non-incremental approach may produce the required data, the quality of what has been elicited may be questionable. There is no guarantee that users will provide correct and consistent answers on attrib-utes for which their preferences are still uncertain.

Similar findings were reported in preference elicitation for collaborative filtering based recommender systems. McNee et al. compared three inter-face strategies for eliciting movie ratings from new users [34]. In the first strategy, the system asked the user to rate movies that were chosen based on entropy comparisons to obtain a maximally informative preference model. In another strategy, the user was allowed to freely propose movies

14

they wanted to rate. In a mixed strategy, the user had both possibilities. A total of 225 new users participated in the experiment which found, surpris-ingly to its authors, that the user-controlled strategy obtained the best rec-ommendation accuracy compared to the other two strategies in spite of a lower number of ratings completed by each user (14 vs. 36 for the system-controlled interface). Furthermore, the user-controlled interface was more likely to motivate users to return to the system to assign more ratings. This demonstrates that a higher level of user control over the amount of interac-tion effort gives rise to more accurate preference models.

Incremental elicitation methods, as described in [1] and [5], can be used to revise users’ stated preferences (or value functions) in general cases. Other incremental methods improve decision quality in specific areas. In [6,36,42], researchers addressed preference uncertainty by emphasizing the importance of displaying a diverse set of options in the early stage of user-system interaction. Faltings et al. [13] described ways to stimulate users to express more preferences in order to help them increase decision accuracy. Pu and Faltings showed how additional preference information can be ac-quired via preference revision, especially as users perform tradeoff tasks [46]. More detail on these topics will be given in Sections 5.2 and 6.

Based on both classical and recent empirical findings, we have derived several design guidelines concerning the initial preference elicitation proc-ess:

Guideline 1 (any effort): Consider novice users’ preference fluency. Allow them to reveal preferences incrementally. It is best to elicit initial prefer-ences that concern them the most and choose an effort level that is compati-ble with their knowledge and experience of available options.

A rigid elicitation procedure obtains users’ preferences using a system pre-designed order of elicitation. When users are forced to formulate pref-erences in a particular order or using an attribute that does not correspond to their actual decision process, they can fall prey to incorrectly formulat-ing the means objectives. Consequently, they will only partially achieve the fundamental objectives [23]. For example, when planning a trip by air a user may have to first select the airline company before selecting a flight. Thus, the fundamental objective, such as a desired flight time, is replaced by another objective, which is the airline company to use. This objective is called a means objective. To correctly translate the true objective into means objectives, the user needs to have detailed knowledge of the product offering, in this case the flight times offered by the different airlines. But this knowledge is often wrong, and thus search can produce suboptimal re-sults.

15

Thus we propose

Guideline 2 (any order): Consider allowing users to state their preferences in any order they choose.

An elicitation procedure can also be regarded as rigid if it requires users to state preferences on a set of system-designed attributes. For example, in a travel planning system, suppose that the user's objective is to be at his destination at 15:00, but that the tool only asks him for a desired departure time. The user might erroneously believe that the trip requires a plane change and takes about 5 hours, thus forming a means objective of a 10:00 departure in order to answer the question. However, the best option might be a new direct flight that leaves at 12:30 and arrives at 14:30. Means ob-jectives are intermediary goals formulated to achieve fundamental decision objectives. Assume that a user has a preference on attribute ai (e.g., the ar-rival time), but the tool requires expressing preferences on attribute aj (e.g., the departure time). Using beliefs about the available products, the user will estimate a transfer function ti(aj) that maps values of attribute ai to values of attribute aj (e.g., arrival time = departure time + 5 hours) The true objective p(ai) is then translated into the means objective q(aj) = p(t(aj)). When the transfer function is inaccurate, the means objective often leads to very inaccurate results.

Note also that unless there is a strong correlation between attributes ai and aj,an accurate transfer function may not even exist. A recent visit to a comparison shopping website (www.pricegrabber.com) showed that the site does not include the weight attribute of Tablet PCs in the search field, even though such information is encoded in their catalog. For many port-ability conscious users, the weight of a Tablet PC is one of the most impor-tant decision parameters. When unable to examine those products directly based on the weight attribute, the consumer might infer that weight is cor-related to the screen size and perhaps the amount of storage in the hard drive. Consequently, she may consider searching for a Tablet PC with less disk space and a smaller screen size than she actually desires. These users could potentially make unnecessary sacrifices because many manufactur-ers are now able to offer light-weight Tablet PCs with considerable disk space and comfortable screen sizes. Thus we propose

Guideline 3 (any preference): Consider allowing users to state preferences on any attributes they choose.

16

When designing interfaces based on these three guidelines, a good bal-ance between giving the user the maximum amount of control, yet not overwhelming her with interface complexity is necessary. We recommend the use of adaptive interfaces, where users can click on attributes for which they want to state preferences or leave them unclicked or on default values if they do not have strong preferences at that point (see the initial query in-terface found at activedecision.com). Alternatively, designers can use a “ramp-up” approach for initial query specification, where users specify preferences over a small number of attributes initially and are prompted to specify more once some results are displayed (see the initial query inter-face at www.cars.com).

5. Stimulate Preference Expression with Examples

Note that incorrect means objectives arise mainly due to users’ unfa-miliarity with the available options. To overcome this limitation, research-ers have developed some preference elicitation methods using actual prod-ucts as examples, known as the example critiquing method [45, 49]. For empirical proof, we found that it has been observed for some time in be-havior theory that people find it easier to construct a model of their prefer-ences when considering examples of actual options [41]. This constructive view of human decision making also applies to experts. According to Tversky [62], people do not maximize a pre-computed preference order, but construct their choices in light of the available options. Therefore, to educate users about the domain knowledge and help them construct com-plete and sound preferences, we propose the following guideline:

Guideline 4: Consider showing example options to help users gain prefer-ence fluency.

We call such an interaction model example critiquing since users build their preferences by critiquing the example products that are shown. This allows users to understand their preferences in the context of available op-tions. Example critiquing was first mentioned in [67] as a new interface paradigm for database access, especially for novice users to specify que-ries. Recently, example critiquing has been used in two principal forms by several researchers: those supporting product catalog navigation and those supporting product search based on an explicit preference model.

In the first type of system, for example the FindMe systems [8,9], search is described as a combination of search and browsing called as-

17

sisted browsing. The system first retrieves and displays the best matching product from the database based on a user’s initial query. It then retrieves other products based on the user’s critiques of the current best item. The interface implementing the critiquing model is called tweaking, a tech-nique that allows users to express preferences with respect to a current ex-ample, such as “look for an apartment similar to this, but with a better am-biance.” According to this concept, a user navigates in the space of available products by tweaking the current best option to find her target choice. The preference model is implicitly represented by the current best product, i.e., what a user chooses reflects her preference of the attribute values. Reilly et al. have recently proposed dynamic critiquing [51] based on some improvements of the tweaking model. In addition to the unit-value tweaking operators, compound critiques allow users to choose prod-ucts which differ from the current best item in two or more attribute val-ues. For example, the system would suggest a digital camera based on the initial query. It also recommends cameras produced by different manufac-turers, with less optical zoom, but with more storage. Compound critiques are generated by the Apriori algorithm [2] and allow users to navigate to their target choice in bigger steps. In fact, users who more frequently used the compound critiques were able to reduce their interaction cycles from 29 to 6 in a study involving real users [32].

In the second type of example-critiquing systems, an explicit preference model is maintained. Each user feedback in the form of a critique is added to the model to refine the original preference model. An example of a sys-tem with explicit preference models is the SmartClient system used for travel planning [45, 60]. It shows up to 30 examples of travel itineraries as soon as a set of initial preferences have been established. By critiquing the examples, users state additional preferences. These preferences are accu-mulated in a model that is visible to the user through the interface (see the bottom panel under “Preferences” of Figure 6 in [61]) and can be revised at any time. ATA [28], ExpertClerk [57], the Adaptive Place Advisor [16], and the incremental dynamic critiquing systems function similarly [30]. The advantage of maintaining an explicit model is to avoid recommending products which have already been ruled out by the users. Another advan-tage is that a system can suggest products whose preferences are still miss-ing in the stated model, as is further discussed in section 5.2.

5.1 How Many Examples to Show

18

Two issues are critical in designing effective example-based interfaces: how many examples and which examples to show in the display. Faltings el al. investigated the minimum number of items to display so that the tar-get choice is included even when the preference model is inaccurate [14]. Various preference models were analyzed. If preferences are expressed by numerical penalty functions and they are combined using either the weighted sum or the min-max rule, then

where d is the maximum number of stated preferences, and t is the number of displayed items so that the target solution is guaranteed to be included. The error of the preference function is bounded by a factor of ep-silon (ε) above or below. Since this number is independent of the total number of available items, this technique of compensating inaccurate pref-erences by showing a sufficient amount of solutions scales to very large collections. For a moderate number (up to 5) of preferences, the correct amount of display items typically falls between 5 and 20. When the prefer-ence model becomes more complex, inaccuracies have much larger effects. A much larger number of examples is required to cover the model inaccu-racy.

5.2 What Examples to Show

The most obvious examples to include in the display are those that best match the users’ current preferences. However, this strategy proves to be insufficient to guarantee optimality. Since most users are often uncertain about their preferences and are more likely to construct them as options are shown to them, it becomes important for a recommender system to guide the user to develop a preference model that is as complete and accurate as possible. However, it is important to keep the initiative to state more pref-erences on the user’s side. Therefore we call examples chosen to stimulate users to state preferences suggestions. We present two suggestion strate-gies: diversity and model based techniques.

The ATA system was the first to show suggestions [28], which were ex-treme-valued examples where some attributes, for example departure time or price, took extreme values such as earliest or cheapest. However, a problem with this technique is that extreme options are not likely to appeal

19

to many users. For example, a user looking for a digital camera with good resolution might not want to consider a camera that offers 4 times the usual resolution but also has 4 times the usual weight and price. In fact, a tool that suggests this option will discourage the user from even asking for such a feature, since it implies that high resolution can only exist at the expense of many other advantages.

Thus, it is better to select the suggestions among examples that are al-ready good given the currently known preferences, and focus on showing diverse rather than extreme examples. Bradley and Smyth were the first to recognize the need to recommend diverse examples, especially in the early stage of using a recommender tool [6]. They proposed the bounded greedy algorithm for retrieving the set of cases most similar to a user’s query, but at the same time most diverse among themselves. Thus, instead of picking the k best examples according to the preference ranking r(x), a measure d(x,Y) is used to calculate the relative diversity of an example x from the already selected set Y according to a weighted sum

s(x,Y) = α r(x) + (1-α) d(x,Y)

where α can be varied to account for varying importance of optimality and diversity. For example, as a user approaches the final target, α can be set to a higher value (e.g. 0.75 in the experiment setup) so that the system privileges the display set’s similarity rather than diversity. In their imple-mentations, the ranking r(x) is the similarity sim(x,t) of x to an ideal exam-ple t on a scale of 0 to 1, and the relative diversity is derived as

d(x,Y) = 1 - .

The performance of diversity generation was evaluated in simulations in terms of its relative benefit, i.e. the maximum gain in diversity achieved by giving up similarity [58]. Subsequently, McSherry has shown that diversity can often be increased without sacrificing similarity [36]. A threshold t was fixed on the ranking function, and then a maximally diverse subset among all products x for which r(x) > t was selected. When k options are shown, the threshold might be chosen as the value of the k-th best option, thus allowing no decrease in similarity, or at some value that does allow a certain decrease.

We thus propose the following guideline:

20

Guideline 5: Consider showing diverse examples to stimulate preference expression, especially when users are still uncertain about their final pref-erences.

The adaptive search algorithm used in [33] alternates between a strategy that privileges similarity and one that privileges diversity to implement the interaction “show me more like this” by varying the α in the ranking measure. At each point, a set of example products is displayed and the user is instructed to choose her most preferred option among them. Whenever the user chooses the same option twice consecutively, the system considers diversity when proposing the next examples in order to refocus the search. Otherwise, the system assumes that the user is making progress and it con-tinues to suggest new options based on optimality. Evaluations with simu-lated users show that this technique is likely to reduce the length of the recommendation cycles by up to 76% compared to the pure similarity-based recommender.

More recent work on diversity was motivated by the desire to compen-sate for users’ preference uncertainty [42] and to cover different topic in-terests in collaborative filtering recommenders [69]. For general preference models, it is less clear how to define a diversity measure. Pu et al. consid-ered the user’s motivation to state additional preferences when a sugges-tion is displayed [50]. A suggestion is a choice that may not be optimal under the current preference model, but should provide a high likelihood of optimality when an additional preference is added. For example, a user may add “an apartment with a balcony” preference after seeing examples of such apartments. 40 (9 females) subjects from 9 different nationalities took part in a user study to search for an apartment. The experiment’s re-sults show that the use of suggestions almost doubled decision accuracy and allowed the user to find the most preferred option 80% of the time. A user is likely to be opportunistic and will only bother to formulate new pre-ferences if she believes that this might lead to a better choice. Thus, they propose the following look-ahead principle [50]:

Guideline 6: Consider suggesting options that may not be optimal under the current preference model, but have a high likelihood of optimality when ad-ditional preferences are added.

The look-ahead principle can be applied to constructing model-based suggestions by explicitly computing, for each attribute ai, a difference measure diff(ai,x) that corresponds to the probability that a preference on this attribute would make option x most preferred. Items are then ranked according to the expected difference measure over all possible attributes:

21

where is the probability that the user is motivated to state a prefer-ence on attribute ai. Such probabilities are summed over all attributes for which the user has not yet expressed a preference. The best suggestions to display are therefore those items possessing the highest probability of be-coming optimal after considering hidden preferences. It is possible to adapt these techniques to generate a set of suggestions that jointly maximize the probability of an optimal item. More details are given in [13,50]. To investigate the importance of suggestions in producing accurate deci-sions, several empirical user studies were carried out [50,65]. One was conducted in an unsupervised setting, where users’ behavior was moni-tored on a publicly accessible online system. The scientists conducting the experiment collected logs from 63 active users who went through several cycles of preference revision. Another study was carried out in a super-vised setting. The scientists recruited 40 volunteers and divided them into two groups. One group evaluated the interface with model-based sugges-tions, and another group evaluated the one without. Both user studies showed the significant effects of using these model-based sugges-tions:users who used the suggestion interfaces stated significantly more preferences than those who did not (an increase of 2.09 preferences vs. on-ly 0.62 without suggestions, p < 0.01, in supervised studies [65], and an increase of 1.46 vs. 0.64 without suggestions, p < 0.002, for online users [50]) and users who used the suggestion interfaces also reached signifi-cantly higher decision accuracy (80 vs. 45 percent without suggestions, p < 0.01, in supervised user studies [50]).

6. Preference Revision

Preference revision is the process of changing one or more desired char-acteristics of a product that a user has stated previously, the degree to which such characteristics should be satisfied, or any combination of the two. In [43] 28 subjects (10 females) were recruited to participate in a user study in which the user was asked to find his or her most preferred apart-ment from a list of available candidates. The user’s preferences could be specified on a total of six attributes: type, price, area, bathroom, kitchen and distance to work place. Each participant was first asked to make a

22

choice, and then used the decision aid tool to perform tradeoffs among his/her preferences until the desired item was chosen. In this user study, every user changed at least one initial preference during the entire search process for finding a product. Many users change preferences because there is rarely an outcome that satisfies all of the initial preferences. Two frequently encountered cases often require preference revision: 1) when a user cannot find an outcome that satisfies all of her stated preferences and must choose a partially satisfied one, or 2) when a user has too many pos-sibilities and must further narrow down the space of solutions. Even though both activities can be treated as the process of query refinement, the real challenge is to help users specify the correct query in order to find the target item. Here we present a unified framework of treating both cases as a tradeoff process because finding an acceptable solution requires choosing an outcome that is desirable in some respects but perhaps not so attractive in others.

6.1 Preference Conflicts and Partial Satisfaction

A user, who inputs a query for a spacious apartment with a low price range and obtains “nothing found” as a reply, learns very little about how to state more suitable preferences.

The current industry practice manages preference conflicts by browsing-based interaction techniques. A user is only allowed to enter her prefer-ences one at a time starting from the point where all of the product space is available. As she specifies more preferences, she essentially drills down to a sub product space until either she selects her target in the displayed op-tions or no product space remains. For example, if someone desires a note-book with minimal weight (less than 2 kilos), then after specifying the weight requirement, she is only allowed to choose those notebooks weigh-ing less than 2 kilos. If the price of these lightweight notebooks is very high, she is likely to miss a tradeoff alternative that may weigh 2.5 kilos and cost much less. This interaction style has become very popular in comparison shopping websites (see www.shopping.com, www.pricegrabber.com, www.yahoo.shopping.com). Even though the sys-tem designers have promptly prevented users from specifying conflicting preferences, this interaction style is very limited. Users are unable to spec-ify contextual preferences and especially tradeoffs among several attrib-utes. If a user enters the set of preferences successively for each attribute, the space of matching products could suddenly become null with the mes-sage “no matching products can been found.” At this point, the user may

23

not know which attribute value to revise among the set of values that she has specified so far, requiring her to backtrack several steps and try differ-ent combinations of preference values on the concerned attributes.

A more sensible method, such as the one used in SmartClient [45,61], manages a user’s preference conflicts by first allowing her to state all of her preferences and then showing her options that maximally satisfy sub-sets of the stated preferences based on partial constraint satisfaction tech-niques [15]. These maximally satisfied products educate users about avail-able options and facilitate them in specifying more reasonable preferences. In the same spirit, McCarthy et al. propose to educate users about product knowledge by explaining the products that do exist instead of justifying why the system failed to produce a satisfactory outcome [31]. FindMe sys-tems rely on background information from the product catalog and explain preference conflicts on a higher level [8,9]. In the case of a user wanting both a fuel-efficient and high-powered car, FindMe attempts to illustrate the tradeoff between horsepower and fuel efficiency. This method of showing partially satisfied solutions is also called soft navigation by Stolze [60].

To convince users of the partially satisfied results, we can also adopt the approach used by activedecision.com. It not only shows the partial solu-tions, but also explains in detail how the system satisfies some of users’ preferences and not others. A qualitative user survey about such explana-tion mechanisms was conducted in the form of a carefully constructed questionnaire, based on a series of hypotheses and corresponding applica-ble questions. 53 participants completed the survey, and most of them strongly agreed that the explanation components are more likely to inspire their trust in the recommended solutions [10]. In addition, an alternative explanation technique, the organization interface where partially satisfied products are grouped into a set of categories (Figure 2), was preferred by most subjects, compared to the traditional method where each item is along with an explanation construct [10]. A follow-up comparative user study (with 72 participants) further proved that this interface method can significantly inspire competence-induced user trust in terms of the user’s perceived competence, intention to return and intention to save effort (see some details of the experiment in section 7.3) [44].

24

Fig. 2. Partially Satisfied Products in an Organization Interface.

Guideline 7 (preference conflict management): Consider resolving pref-erence conflicts by showing partially satisfied results with compromises clearly explained to the user.

6.2 Tradeoff Assistance

As catalogs grow in size, it becomes increasingly difficult to find the target item. Users may achieve relatively low decision accuracy unless a tool helps them efficiently view and compare many potentially interesting products. Even though a recommender agent is able to improve decision quality by providing filtering and comparison matrix components [20], a user can still face the bewildering task of selecting the right items to in-clude in the consideration set.

Researchers found that online tools could increase the level of decision accuracy by up to 57% by helping users select and compare options which share tradeoff properties [43]. 28 subjects (10 females) took part in the experiment; each of the participants was first asked to make a choice, and then use the decision aid tool to perform a set of tradeoff navigation tasks. The results showed that after a user has considered an item as the final

25

candidate, the tool can help her/him reach higher decision accuracy by prompting them to see a set of tradeoff alternatives. The same example critiquing interfaces as discussed in Section 5 can be used to assist users to view tradeoff alternatives, for example, “I like this portable PC, but can I find something lighter?” This style of interaction is called tradeoff naviga-tion and is enabled by the “modify” widget together with the “tweaking panel” (see Figure 4 in [49]). Tweaking (used in FindMe [8,8]) was the first tool to implement this tradeoff assistance. It was originally designed to help users navigate to their targets by modifying stated preferences, one at a time. Example critiquing (used in SmartClient [43,49]) is more inten-tional about its tradeoff support, especially for tradeoffs involving more than two participating attributes. In a single interaction, a user can state her desire to improve the values of certain attributes, compromise on others, or any combination of the two.

Reilly et al. introduced another style of tradeoff support with dynamic critiquing methods [51]. Critiques are directional feedback at the attribute level that users can select in order to improve a system’s recommendation accuracy. For example, after recommending a Canon digital camera, the system may display “we have more matching cameras with the following: 1) less optimal zoom and thinner and lighter weight; 2) different manufac-turer and lower resolution and cheaper; 3) larger screen size and more memory and heavier.” Dynamic critiquing is an approach of automatically generating useful compound critiques so that users can indicate their pref-erence on multiple attributes simultaneously. The experiment in [51] shows that the dynamic critiquing approach has the ability to reduce the in-teraction session length by up to 40% compared to the approach with only unit critiques.

Although originally designed to support navigation in recommender systems, the unit and compound critiques described in [51] correspond to the simple and complex tradeoffs defined in [49]. They are both mecha-nisms to help users compare and evaluate the recommended item with a set of tradeoff alternatives. However, the dynamic critiquing method provides system-proposed tradeoff support because it is the system which produces and suggests the tradeoff categories, whereas example critiquing provides a mechanism for users to initiate their own tradeoff navigation (called user motivated critiques in [11]).

A recent study compared the performance of user-motivated vs. system-proposed approaches [11]. A total of 36 (5 females) volunteers participated in the experiment. It was performed in a within-subjects design, and each participant was asked to evaluate two interfaces with the respective two approaches one after the other. All three evaluation criteria stated in sec-

26

tion 2.2 were used: decision accuracy, user interaction effort and user con-fidence. The results indicate that the user-motivated tradeoff method en-ables users to achieve a higher level of decision accuracy with less cogni-tive effort, mainly due to its flexibility in allowing users to freely combine unit and compound critiques. In addition, the confidence in choice made with the user-motivated critique method is higher, resulting in users’ in-creased intention to purchase the product they have found and return to the agent in the future. We thus propose:

Guideline 8 (tradeoff assistance): In addition to providing the search func-tion, consider providing users with tradeoff assistance in the interface using either system-proposed or user motivated approaches. The latter is likely to provide users with more flexibility in choosing their tradeoff desires and thus enable them to achieve higher decision accuracy and confidence.

27

7. Display Strategies

At least three display strategies are currently employed in preference-based search and recommender tools: recommending items one at a time, showing top k matching results (where k is a small number between 3 and 30), or displaying products with explanations on how ranking scores are computed. We discuss these various strategies using the effort, accuracy, and confidence evaluation framework discussed in Section 2.

7.1 Recommending One Item at a Time

The advantage of such recommender systems is that it is relatively easy to design the display, users are not likely to be overwhelmed by excessive information, and the interface can be easily adapted to small display de-vices such as mobile phones. The obvious disadvantage is that a user may not be able to find her target choice quickly. As mentioned in Section 5, a novice user’s initial preferences are likely to be uncertain. Thus the ini-tially recommended results may not include her target choice. Either a user has to interact with the system much longer due to the small result set, or if a user exhausts her interaction effort before reaching the final target, she is likely to achieve very low decision accuracy. Thus we propose:

Guideline 9: Showing one search result or recommending one item at a time allows for a simple display strategy which can be easily adapted to small-display devices; however, it is likely to engage users in longer interaction sessions or only allow them to achieve relatively low decision accuracy.

7.2 Recommending K-best Items

Some product search tools present a set of top-k alternatives to the us-ers. We call this style of display the k-best interface. Commercial tools employing this strategy can be found at ActiveDecision.com (k >10). Aca-demic prototypes include those used by SmartClient (7<=k<=30) [14,49], ATA (k=3) [28], ExpertClerk (k=3) [57], FirstCase (k=3) [37] and Top-Case (k=3) [38].

When k approaches 10, the issue of ordering the alternatives becomes important. The most commonly used method is to select the best k items

28

based on how well they match users’ stated preferences using utility scores (see multi-attribute utility theory [24]). We can also use the “k nearest neighbor” retrieval algorithm (or simply k-NN) [12] to rank the k items, such as those used in the case-based reasoning field [25]. The k items are displayed in descending order from the highest utility score or rank to the lowest (activedecision.com, SmartClient). This method has the advantages of displaying a relatively high number of options without overwhelming the users, pre-selecting the items based on how well they match the stated preferences of a user, and achieving relatively high decision accuracy [43,49].

Pu and Kumar compared an example critiquing based system (k=7 rank ordered by utility scores) with a system using the ranked list display me-thod (k=n rank ordered on user selected attribute values such as price) [49]. 22 volunteers participated in the user study. Each of them was asked to test two interfaces (example critiquing and ranked list) in random order by performing a list of given tasks. The results showed that while users performed the instructed search tasks more easily using example critiquing (less task time and smaller error rate, with statistical significance) and achieved higher decision accuracy [43], more of them expressed a higher level of confidence that the answers they found were correct for the ranked list interface. Further analysis of users’ comments recorded during the user study revealed that the confidence issue depends largely on the way items were ordered and how many of them were displayed. Many users felt that the EC system (displaying only 7 items) was hiding something from them and that the results returned by the EC interface did not correspond to their ranking of products. With the help of a pilot study, it was observed that us-ers generally did not scroll down to view the additional products displayed, but their confidence level increased and the interaction time was not af-fected. Therefore we suggest the following guideline for the top-k display strategy:

Guideline 10: Displaying more products and ranking them in a natural order is likely to increase users’ sense of control and confidence.

7.3 Explanation Interfaces

When it comes to suggesting decisions, such as which camera to buy, the recommender system’s ability to establish trust with users and convince them of its recommendations is a crucial design factor. Researchers started

29

investigating the user confidence issue and other subjective factors in a more formal framework involving trust relationships between the system and the user. It is widely accepted that trust in a technological artifact (like the recommender agent) can also be conceptualized as competence, be-nevolence, and integrity, similar to trust in a person. Trust is further seen as a long term relationship between the user and the organization that the recommender system represents [10]. When a user trusts a recommender system, she is more likely to purchase items and return to the system in the future. A carefully designed qualitative survey with 53 users revealed that an important construct of trust formation is an interface’s ability to explain its results [10], as mentioned in section 6.1.

The explanation interface can be implemented in various ways. For ex-ample, ActiveDecision.com uses the tool tip with a “why” label to explain how each of the recommended products matches a user’s stated prefer-ences, similar to the interface shown in Figure 3. Alternatively, it is possi-ble to design an organization-based explanation interface where the best matching item is displayed at the top of the interface along with several categories of tradeoff alternatives [44]. Each category is labeled with a title explaining the characteristics of the items the respective category contains (Figure 4).

30

Fig. 3. A generic recommendation interface with simple “why” labels.

31

Fig. 4. The more trust-inspiring organization interface.

In order to understand whether the organization interface is a more ef-fective way to explain recommendations, a significant-scale empirical study was conducted to compare the organization interface with the tradi-tional “why” interface in a within-subjects design. A total of 72 volunteers (19 females) were recruited as participants in the user study. The re-sults showed that the organization interface significantly increases user perception of its competence, which more effectively inspires uses’ trust and enhances their intention to save cognitive effort and use the interface again in the future [44]. Moreover, the study found that the actual time spent looking for a product did not have a significant impact on users’ sub-jective perceptions. This indicates that less time spent on the interface, while very important in reducing decision effort, cannot be used alone in

32

predicting what users may subjectively experience. Five principles for the effective design of organization interfaces were developed and an algo-rithm was presented for generating the content of such interfaces [44]. Here we propose:

Guideline 11: Consider designing interfaces which are capable of explain-ing how ranking scores are computed, because they are likely to inspire user trust.

8. A Cost-Effect Model on Interaction Guidelines

We have developed a set of guidelines to ensure the design of usable product search tools relying on a general framework of three evaluation criteria: (i) decision accuracy, (ii) user interaction effort, and (iii) user de-cision confidence. As these three criteria cannot be optimized independ-ently of each other, we offer a model of tradeoff (or cost-effect analysis) that encapsulates the interaction of these parameters for the benefit of en-gaging users in the interaction process and helping them achieve the most optimal decision outcome. We first consider the fact that product search tools must serve the needs of a significant and heterogeneous user popula-tion. They must adapt to the characteristics of individual users, in particu-lar to their willingness to put in continuous interaction effort in order to obtain their desired results. In our theoretical model, we let e be the users’ increasing interaction effort over time. With this effort, users hope to ob-tain an increasingly high perceived decision accuracy, a. Each individual user may have different values for the following parameters:

• Confidence threshold θ: the amount of perceived accuracy that is re-quired for the user to be satisfied with the current result of the search process;

• Effort threshold ε: the amount of effort the user is willing to spend to obtain a recommendation;

• Effort increment threshold δ: the additional effort the user is willing to expend after observing an increase in perceived accuracy.

A poorly designed tool can lose users due to an insufficient level of per-ceived accuracy for the confidence threshold, or an increase in perceived accuracy that does not justify the interaction effort. Figure 5 refers to a hy-pothetical tool where users are asked to specify their preferences in a dia-logue interface consisting of a list of questions. It shows the perceived ac-

33

curacy achieved by the tool as a function of effort, measured in interaction cycles. The number of cycles the user is willing to spend is given as the ef-fort threshold ε plus the perceived accuracy achieved so far multiplied by the effort increment threshold δ. Thus, the user will leave the process as soon as the effort limit, indicated as the dashed line, crosses the accu-racy/effort curve.

Fig. 5. Perceived accuracy as a function of interaction effort for an interview-based tool.

When a user encounters such an interface, she is likely to perceive a

significant amount of effort to complete the preference questions before obtaining the first response from the system. The nature of the questions may also cause her to feel the frustration that her stated preferences do not correspond to the available options because at this point no product do-main knowledge has been revealed. Furthermore, the interface does not show marked increases in perceived accuracy, and does not motivate the user to increase her patience. Even though she believes that she can revise her preferences, she may not be willing to put in the required effort, and consequently leaves the tool without being sufficiently confident about the real benefit of the preference elicitation process. Note that in this example, the user will leave the process before reaching the confidence threshold and the interaction has thus not been a success. With slightly more pa-tience, the user could have obtained a result above the confidence thresh-

34

old and with acceptable effort, namely at the third intersection of the effort limit line with the curve. However, since this was not apparent to the user in this example, this point will not be reached. Similar problems can occur with other interface designs that do not pay attention to ensuring a signifi-cant increase in perceived accuracy per effort invested.

To avoid this pitfall, the tool should ensure that the perceived accuracy is a concave function of effort, as shown in Figure 6. Such a function ex-ploits a user’s effort threshold to the best possible degree: if the increase in perceived accuracy becomes insufficient to keep the user interested in us-ing the tool, she will not interact with the tool at a later stage either.

Guidelines for (a) Guidelines for (b) Guidelines for (c)

1. Any effort; 2. Any order; 3. Any preferences.

4. Showing example options; 5. Showing diverse examples; 6. Suggesting options with look-ahead principle.

7. Preference conflict management; 8. Tradeoff assistance.

Guidelines for (a) – (c) 9. Showing one search result at a time is good for small-display devices, but it is likely to achieve relatively low decision accuracy; 10. Displaying more products and ranking them in a natural order is likely to increase users’ sense of control and confidence; 11. Designing interfaces which are capable of explaining how ranking scores are com-puted can inspire user trust.

35

Fig. 6. Perceived accuracy as a function of interaction effort for an example-based tool, and guidelines that apply to achieving the desired concave shape in the different stages.

A concave function could only be achieved by ensuring instant feedback to the user’s effort, and by placing the steps that result in the greatest in-crease in perceived accuracy at the beginning of the interaction. An early increase in perceived accuracy also serves to convince the user to stay with the system longer, as indicated by the dashed line in Figure 6. Instant feed-back is ensured by example-based interaction and the general guidelines 9-11 of showing multiple solutions in a structured and confidence-inspiring way. In general, it can be assumed that users will themselves choose to add the information that they believe to maximize their decision accuracy, and so user initiative is key to achieving a concave curve. In the first phase (a), the system can achieve the biggest accuracy gains by exploiting users’ ini-tial preferences. However, it is important at this stage to avoid asking them questions that they cannot accurately answer (guideline 1). Furthermore, the curve can be made steeper by letting the user formulate these initial preferences with as little effort as possible. We therefore derived guide-lines (2 and 3) to make this possible.

Once these initial preferences have been obtained, the biggest increase in perceived accuracy during phase (b) can be obtained by completing the initial preferences with others of which the user was not initially aware. This can be stimulated by showing examples (guideline 4), and by choos-ing them to specifically educate the user about available options (guide-lines 5 and 6). This provides the main cost-effect tradeoff for the second phase of a typical interaction.

Finally, in the third phase (c) the set of preferences can be fine-tuned by adjusting their relative weights and making tradeoffs. This can be sup-ported by tools that show partial solutions (guideline 7) and actively sup-port decision tradeoffs among preferences (guideline 8).

As the tool cannot verify when the user transitions between the phases, and in fact the transition may be gradual, it should provide continuous support for each of them, but always encourage actions that are likely to increase perceived accuracy as much as possible. Thus, adjustment of tra-deoff weights should be shown less prominently than the possibility to add new preferences.

These requirements are best addressed by the example-based recom-mender tools as described in Section 5. More precisely, the incremental es-tablishment and refinement of the user’s preference model increases the true decision accuracy. To keep the user engaged in the interaction process and convinced to accept the result of the search, this true accuracy must also be perceived by the user. This is supported by showing several results

36

at the same time (guideline 9), which tends to correct inaccuracies, by pro-viding structure to their display (guideline 10), and by providing explana-tions (guideline 11). These are necessary elements to motivate users to put in enough effort in achieving an accurate decision as much as possible.

9. Conclusion

This article presents eleven essential guidelines that should be observed when designing interactive preference-based recommender systems. In presenting and justifying the guidelines, we provided a broad and in-depth review of our prior work and that of other researchers in the field on user interaction issues with such recommender systems. Most importantly, a framework of three evaluation criteria was proposed to determine the us-ability of such systems: decision accuracy, user interaction effort, and user confidence. Within this framework, we have selected techniques, which have been validated through empirical studies, to demonstrate how to im-plement the guidelines. Emphasis was given to those techniques that achieve a good balance on all of the criteria. Adopting these guidelines, therefore, should significantly increase the usability of product search sys-tems and consequently the wide adoption of such systems in e-commerce environments.

References

[1]. G. Adomavicius. A. Tuzhilin, Toward the next generation of recommender systems: a sur-vey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering 17 (6) (2005) 734-749.

[2]. R. Agrawal, T. Imielinski, A. Swami. Mining association rules between sets of items in large databases. Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, ACM Press, 1993, 207–216.

[3]. N.J. Belkin, W.B. Croft. Information filtering and information retrieval: two sides of the same coin? Communications of the ACM 35 (12) (1992) 29-38.

[4]. D. Bernoulli. Exposition of a new theory on the measurement of risk (original 1738). Eco-nometrica 22 (1) (1954) 23-36.

[5]. J. Blythe. Visual exploration and incremental utility elicitation. Proceedings of the 18th National Conference on Artificial Intelligence, AAAI press, 2002, 526-532.

[6]. K. Bradley, B. Smyth. Improving recommendation diversity. Proceedings of the 12th Irish Conference on Artificial Intelligence and Cognitive Science, 2001, 85-94.

[7]. R. Burke. Hybrid recommender systems: survey and experiments. User Modeling and Us-er-Adapted Interaction 12 (4) (2002) 331-370.

37

[8]. R. Burke, K. Hammond, E. Cooper. Knowledge-based navigation of complex information spaces. Proceedings of the 13th National Conference on Artificial Intelligence, AAAI press, 1996, 462-468.

[9]. R. Burke, K. Hammond, B. Young. The FindMe approach to assisted browsing. IEEE Ex-pert: Intelligent Systems and Their Applications 12 (4) (1997) 32-40.

[10]. L. Chen, P. Pu. Trust building in recommender agents. Proceedings of the Workshop on Web Personalization, Recommender Systems and Intelligent User Interfaces at the 2nd In-ternational Conference on E-Business and Telecommunication Networks, 2005, 135-145.

[11]. L. Chen, P. Pu. Evaluating critiquing-based recommender agents. Proceedings of Twenty-first National Conference on Artificial Intelligence (AAAI-06), 2006, 157-162.

[12]. T.M. Cover, P.E. Hart. Nearest neighbor pattern classification. IEEE Transactions on In-formation Theory, IT-13, 1967, 21-27.

[13]. B. Faltings, P. Pu, M. Torrens, P. Viappiani. Designing example-critiquing interaction. Proceedings of the 9th International Conference on Intelligent User Interfaces (IUI’04), ACM Press, 2004, 22-29.

[14]. B. Faltings, M. Torrens, P. Pu. Solution generation with qualitative models of preferences. Computational Intelligence 20 (2) (2004), 246-263.

[15]. E.C. Freuder, R.J. Wallace. Partial constraint satisfaction. Artificial Intelligence 58 (1-3) (1992) 21-70.

[16]. M. Goker, C. Thompson. The adaptive place advisor: a conversational recommendation system. Proceedings of the 8th German Workshop on Case Based Reasoning, 2000.

[17]. D. Goldberg, D. Nichols, B.M. Oki, D. Terry. Using collaborative filtering to weave an in-formation tapestry. Communications of the ACM (35) 12, Special issue on information fil-tering (1992) 61-70.

[18]. N. Good, J.B. Schafer, J.K. Konstan, A. Borchers, B.M. Sarwar, J.L. Herlocker, J. Riedl. Combining collaborative filtering with personal agents for better recommendations. Pro-ceedings of the 16th National Conference on Artificial Intelligence (AAAI’99), AAAI press, 1999, 439-446.

[19]. V.A. Ha, P. Haddawy. Problem-focused incremental elicitation of multi-attribute utility models. In Shenoy, P. (ed.), Proceedings of the 13th Conference on Uncertainty in Artifi-cial Intelligence (UAI’97), 1997, 215-222.

[20]. G. Haubl, V. Trifts. Consumer decision making in online shopping environments: the ef-fects of interactive decision aids. Marketing Science 19 (1) (2000) 4-21.

[21]. I.N. Herstein, J. Milnor. An axiomatic approach to measurable utility. Econometrica 21 (2) (1953) 291-297.

[22]. W. Hill, L. Stead, M. Rosenstein, G. Furnas. Recommending and evaluating choices in a virtual community of use. Proceedings of the CHI '95 Conference on Human Factors in Computing Systems, 1995, 194-201.

[23]. R.L. Keeney. Value-Focused Thinking: A Path to Creative Decision Making, Harvard University Press (1992).

[24]. R.L. Keeney, H. Raiffa. Decisions with Multiple Objectives: Preferences and Value Tra-deoffs, New York: Wiley (1976).

[25]. J. L. Kolodner. Case-Based Reasoning. San Mateo, CA: Morgan Kaufmann (1993). [26]. B. Krulwich. Lifestyle finder: intelligent user profiling using large-scale demographic da-

ta. Artificial Intelligence Magazine 18 (2) (1997) 37-45. [27]. K. Lang. Newsweeder: learning to filter news. Proceedings of the 12th International Con-

ference on Machine Learning, 1995, 331-339. [28]. G. Linden, S. Hanks, N. Lesh. Interactive assessment of user preference models: the auto-

mated travel assistant. Proceedings of the 6th International Conference on User Modeling (UM’97), New York: Springer Wien New York, 1997, 67-78.

[29]. J. Marschak. Rational Behavior, Uncertain Prospects, and Measurable Utility. Economet-rica 18 (2) (1950) 111-141.

38

[30]. K. McCarthy, L. McGinty, B. Smyth, J. Reilly. A live-user evaluation of incremental dy-namic critiquing. Proceedings of the 6th International Conference on Case-Based Reason-ing (ICCBR’05), 2005, 339-352.

[31]. K. McCarthy, J. Reilly, L. McGinty, B. Smyth. Thinking positively – explanatory feed-back for conversational recommender systems. Proceedings of the Workshop on Explana-tion in CBR at the 7th European Conference on Case-Based Reasoning (ECCB’04), 2004, 115-124.

[32]. K. McCarthy, J. Reilly, L. McGinty, B. Smyth. Experiments in dynamic critiquing. Pro-ceedings of the 10th International Conference on Intelligent User Interfaces (IUI’05), New York: ACM Press, 2005, 175-182.

[33]. L. McGinty, B. Smyth. On the role of diversity in conversational recommender systems. Proceedings of the 5th International Conference on Case-Based Reasoning (ICCBR’03), 2003, 276-290.

[34]. S.M. McNee, S.K. Lam, J. Konstan, J. Riedl. Interfaces for eliciting new user preferences in recommender systems. Proceedings of User Modeling Conference, Springer, 2003, 178--187.

[35]. S.M. McNee, J. Riedl, J. Konstan. Being accurate is not enough: how accuracy metrics have hurt recommender systems. In CHI '06 Extended Abstracts on Human Factors in Computing Systems (CHI '06), ACM, New York, NY, 1097-1101.

[36]. D. McSherry. Diversity-conscious retrieval. In, Craw, S., Preece, A. (eds.), Proceedings of the 6th European Conference on Advances in Case-Based Reasoning, London: Springer-Verlag, 2002, 219-233.

[37]. D. McSherry. Similarity and compromise. Proceedings of the 5th International Conference on Case-Based Reasoning (ICCBR’03), Springer-Verlag, 2003, 291-305.

[38]. D. McSherry. Explanation in recommender systems. Workshop Proceedings of the 7th European Conference on Case-Based Reasoning (ECCBR’04), 2004, 125-134.

[39]. P. Mongin. Expected Utility Theory. Handbook of Economic Methodology, Edward Elgar, 1998, 342-350.

[40]. J.W. Payne, J.R. Bettman, E.J. Johnson. The Adaptive Decision Maker, Cambridge Uni-versity Press (1993).

[41]. J.W. Payne, J.R. Bettman, D.A. Schkade. Measuring constructed preferences: towards a building code. Journal of Risk and Uncertainty 19 (1999) 243-270.

[42]. B. Price, P.R. Messinger. Optimal recommendation sets: covering uncertainty over user preferences. Proceedings of the 20th National Conference on Artificial Intelligence (AAAI’05), 2005, 541-548.

[43]. P. Pu, L. Chen. Integrating tradeoff support in product search tools for e-commerce sites. Proceeding of the 6th ACM Conference on Electronic Commerce (EC’05), ACM Press, 2005, 269-278.

[44]. P. Pu, L. Chen. Trust building with explanation interfaces. Proceedings of the 11th Interna-tional Conference on Intelligent User Interface (IUI’06), 2006, 93-100.

[45]. P. Pu, B. Faltings. Enriching buyers' experiences: the SmartClient approach. Proceedings of the SIGCHI conference on Human factors in computing systems (CHI’00), New York: ACM Press, 2000, 289-296.

[46]. P. Pu, B. Faltings. Decision tradeoff using example-critiquing and constraint program-ming. Constraints: an International Journal 9 (4) (2004) 289-310.

[47]. P. Pu, B. Faltings, M. Torrens. Effective interaction principles for online product search environments. Proceedings of the IEEE/WIC/ACM International Joint Conference on In-telligent Agent Technology and Web Intelligence, 2004, 724-727.

[48]. P. Pu, B. Faltings, M. Torrens. User-involved preference elicitation. Working Notes of the Workshop on Configuration. Eighteenth International Joint Conference on Artificial Intel-ligence (IJCAI’03), 2003, 56-63.

[49]. P. Pu, P. Kumar. Evaluating example-based search tools. Proceedings of the 5th ACM Con-ference on Electronic Commerce (EC’04), ACM Press, 2004, 208-217.

39

[50]. P. Pu, P. Viappiani, B. Faltings. Stimulating decision accuracy using suggestions. SIGCHI conference on Human factors in computing systems (CHI’06), 2006, 121-130

[51]. J. Reilly, K. McCarthy, L. McGinty, B. Smyth. Dynamic critiquing. Proceedings of the 7th European Conference on Case-Based Reasoning (ECCBR’04), Springer, 2004, 763-777.

[52]. P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, J. Riedl. GroupLens: an open architec-ture for collaborative filtering of Netnews. CSCW '94: Conference on Computer Supported Cooperative Work, ACM, 1994, 175-186.

[53]. E. Rich. User modeling via stereotypes. Cognitive Science 3 (1979) 329-354. [54]. J.B. Schafer, J.A. Konstan, J. Riedl. Recommender systems in e-commerce.

Proceedings of the ACM Conference on Electronic Commerce, ACM, 1999, 158-166. [55]. U. Shardanand, P. Maes. Social information filtering: algorithms for automating "Word of

Mouth". Proceedings of the Conference on Human Factors in Computing Systems (CHI '95), 1995, 210-217.

[56]. P. Schoemaker. The Expected Utility Model: Its Variants, Purposes, Evidence and Limita-tions. Journal of Economic Literature 20 (2) (1982), 529-563.

[57]. H. Shimazu. ExpertClerk: navigating shoppers’ buying process with the combination of asking and proposing. Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI’01), 2001, 1443-1448

[58]. B. Smyth, P. McClave. Similarity vs. diversity. Proceedings of the 4th International Con-ference on Case-Based Reasoning (ICCBR’01), Springer-Verlag, 2001, 347-361.

[59]. S. Spiekermann, C. Parachiv. Motivating human-agent interaction: transferring insights from behavioral marketing to interface design. Journal of Electronic Commerce Research 2 (3) (2002) 255-285.

[60]. M. Stolze. Soft navigation in electronic product catalogs. International Journal on Digital Libraries 3 (1) (2000) 60-66.

[61]. M. Torrens, B. Faltings, P. Pu. SmartClients: constraint satisfaction as a paradigm for sca-leable intelligent information systems. International Journal of Constraints 7 (1) (2002) 49-69.

[62]. A. Tversky. Contrasting rational and psychological principles in choice. Wise Choices: Decisions, Games, and Negotiations, Boston, MA: Harvard Business School Press (1996) 5-21.

[63]. A. Tversky, I. Simonson. Context-dependent preferences. Management Science 39 (10) (1993) 1179-1189.

[64]. P. Viappiani, B. Faltings, P. Pu. Evaluating preference-based search tools: a tale of two approaches. Proceedings of the Twenty-first National Conference on Artificial Intelligence (AAAI-06), 2006, 205-210.

[65]. P. Viappiani, B. Faltings, V. Schickel-Zuber, P. Pu. Stimulating preference expression us-ing suggestions. Mixed-Initiative Problem-Solving Assistants, AAAI Fall Symposium Se-ries, AAAI press, 2005, 128-133.

[66]. J. von Neumann, O. Morgenstern. The Theory of Games and Economic Behavior, Prince-ton University Press (1944).

[67]. M.D. Williams, F.N. Tou. RABBIT: an interface for database access. Proceedings of the ACM '82 Conference, ACM Press, 1982, 83-87.

[68]. J. Zhang, P. Pu. Performance evaluation of consumer decision support systems. Interna-tional Journal of E-Business Research 2 (2006) Idea Group Publishing.

[69]. C.N. Ziegler, S.M. McNee, J.A. Konstan, G. Lausen. Improving recommendation lists through topic diversification. Proceedings of the 14th International World Wide Web Con-ference (WWW’05), 2005, 22-32.

[70]. I. Zukerman, D.W. Albrecht. Predictive statistical models for user modeling. User Model-ing and User-Adapted Interaction 11 (1-2) (2001) 5-18.

Usability Guidelines for Preference-Based Product …Usability Guidelines for Preference-Based...

Documents

Transcript of Usability Guidelines for Preference-Based Product …Usability Guidelines for Preference-Based...