Guidelines for Designing and Evaluating Explanations for ...

33
Guidelines for Designing and Evaluating Explanations for Recommender Systems Nava Tintarev and Judith Masthoff Abstract This chapter gives an overview of the area of explanations in rec- ommender systems. We approach the literature from the angle of evaluation: that is, we are interested in what makes an explanation “good”, and sug- gest guidelines as how to best evaluate this. We identify seven benefits that explanations may contribute to a recommender system, and relate them to criteria used in evaluations of explanations in existing systems, and how these relate to evaluations with live recommender systems. We also discuss how ex- planations can be affected by how recommendations are presented, and the role the interaction with the recommender system plays w.r.t. explanations. Finally, we describe a number of explanation styles, and how they are related to the underlying algorithms. Examples of explanations in existing systems are mentioned throughout. 1 Introduction In recent years, there has been an increased interest in more user-centered evaluation metrics for recommender systems such as those mentioned in [38]. It has also been recognized that many recommender systems functioned as black boxes, providing no transparency into the working of the recommen- dation process, nor offering any additional information to accompany the recommendations beyond the recommendations themselves [26]. Explanations can provide that transparency, exposing the reasoning and data behind a recommendation. This is the case with some of the explana- Nava Tintarev Telef´ onica Research, Barcelona, Spain e-mail: [email protected] Judith Masthoff University of Aberdeen, Aberdeen, U.K. e-mail: j.masthoff@abdn.ac.uk 1

Transcript of Guidelines for Designing and Evaluating Explanations for ...

Page 1: Guidelines for Designing and Evaluating Explanations for ...

Guidelines for Designing andEvaluating Explanations forRecommender Systems

Nava Tintarev and Judith Masthoff

Abstract This chapter gives an overview of the area of explanations in rec-ommender systems. We approach the literature from the angle of evaluation:that is, we are interested in what makes an explanation “good”, and sug-gest guidelines as how to best evaluate this. We identify seven benefits thatexplanations may contribute to a recommender system, and relate them tocriteria used in evaluations of explanations in existing systems, and how theserelate to evaluations with live recommender systems. We also discuss how ex-planations can be affected by how recommendations are presented, and therole the interaction with the recommender system plays w.r.t. explanations.Finally, we describe a number of explanation styles, and how they are relatedto the underlying algorithms. Examples of explanations in existing systemsare mentioned throughout.

1 Introduction

In recent years, there has been an increased interest in more user-centeredevaluation metrics for recommender systems such as those mentioned in [38].It has also been recognized that many recommender systems functioned asblack boxes, providing no transparency into the working of the recommen-dation process, nor offering any additional information to accompany therecommendations beyond the recommendations themselves [26].

Explanations can provide that transparency, exposing the reasoning anddata behind a recommendation. This is the case with some of the explana-

Nava TintarevTelefonica Research, Barcelona, Spain e-mail: [email protected]

Judith MasthoffUniversity of Aberdeen, Aberdeen, U.K. e-mail: [email protected]

1

Page 2: Guidelines for Designing and Evaluating Explanations for ...

2 Nava Tintarev and Judith Masthoff

tions hosted on Amazon, such as: “Customers Who Bought This Item AlsoBought . . . ”. Explanations can also serve other aims such as helping to in-spire user trust and loyalty, increase satisfaction, make it quicker and easierfor users to find what they want, and persuade them to try or purchase arecommended item. In this way, we distinguish between different explanationcriteria such as e.g. explaining the way the recommendation engine works(transparency), and explaining why the user may or may not want to try anitem (effectiveness). An effective explanation may be formulated along thelines of “You might (not) like Item A because...”. In contrast to the Amazonexample above, this explanation does not necessarily describe how the rec-ommendation was selected - in which case it is not transparent.

This chapter offers guidelines for designing and evaluating explanations inrecommender systems as summarized in Section 2. Expert systems can besaid to be the predecessors of recommender systems. In Section 3 we there-fore briefly relate research on evaluating explanations in expert systems toevaluations of explanations in recommender systems. We also identify thedevelopments in recommender systems which may have caused a revived in-terest in explanation research since the days of expert systems.

Up until now there has been little consensus as to how to evaluate ex-planations, or why to explain at all. In Section 4, we list seven explanatorycriteria, and describe how these have been measured in previous systems.These criteria can also be understood as advantages that explanations mayoffer to recommender systems, answering the question of why to explain. Inthe examples for effective and transparent explanations above, we saw thatthe two evaluation criteria could be mutually exclusive. In this section, wewill describe cases where explanation criteria could be considered contradic-tory, and cases where they could be considered complementary.

In Section 5, we consider that the underlying recommender system affectsthe evaluation of explanations, and discuss this in terms of the evaluationmetrics normally used for recommender systems (e.g. accuracy and coverage).We mention and illustrate examples of explanations throughout the chapter,and offer an aggregated list of examples in commercial and academic recom-mender systems in Table 6. We will see that explanations have been presentedin various forms, using both text and graphics. Additionally, explanations arenot decoupled from recommendations themselves or the way in which userscan interact with recommendations. In Section 6 we mention different waysof presenting recommendations and their effect on explanations. Section 7describes how users can interact and give input to a recommender system,and how this affects explanations. Both are factors that need to be consideredin the evaluation of explanations.

In addition, the underlying algorithm of a recommender engine will in-fluence the types of explanations that can be generated, even though theexplanations selected by the system developer may or may not reflect the un-derlying algorithm. This is particularly the case for computationally complexalgorithms for which explanations may be more difficult to generate, such

Page 3: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 3

as collaborative filtering [26, 29]. In this case, the developer must considerthe trade-offs between e.g. satisfaction (as an extension of understandabil-ity) and transparency. In Section 8, we relate the most common explanationstyles and how they relate to the underlying algorithms. Finally, we concludewith a summary and future directions in Section 9.

2 Guidelines

The content of this chapter is divided into sections which each elaborate onthe following design guidelines for explanations in recommender systems.

• Consider the benefit(s) you would like to obtain from the explanations,and the best metric to evaluate on the associated criteria (Section 4)

• Be aware that the evaluation of explanations is related to, and may beconfounded with, the functioning of the underlying recommendation en-gine, as measured by criteria commonly used for evaluating recommendersystems (Section 5).

• Think about how the way that you present the recommendations them-selves affects which types of explanations are more suitable (Section 6)

• Keep in mind how the interaction model that you select affects and inter-acts with the explanations (Section 7).

• Last, but certainly not least, consider how the underlying algorithm mayaffect the type of explanations you can generate (Section 8).

3 Explanations in Expert Systems

Explanations in intelligent systems are not a new idea: explanations haveoften been considered as part of the research in the area of expert systems[3, 30, 34, 24, 65]. This research has largely been focused on what kind ofexplanations can be generated and how these have been implemented in realworld systems [3, 30, 34, 65]. Overall, there are few evaluations of the explana-tions in these systems. When they did occur evaluations of explanations havelargely focused on user acceptance of the system such as [10] or acceptance ofthe systems’ conclusions [66]. An exception is an evaluation in MYCIN whichconsidered the decision support of the system as a whole [24]. In contrast,the commercial applications of recommender systems, previously unseen inexpert systems, has extended the evaluation criteria for explanations beyondacceptance.

Also, developments in recommender systems have revived explanation re-search, after a decline of studies in expert systems in the 90’s. One suchdevelopment is the increase in data: due to the growth of the web, there arenow more users using the average (recommender) system. Systems are also

Page 4: Guidelines for Designing and Evaluating Explanations for ...

4 Nava Tintarev and Judith Masthoff

no longer developed in isolation of each other, making the best possible reuseof code (open source projects) and datasets (e.g. the MovieLens1 and Netflixdatasets2). In addition, new algorithms have been adapted and developed(e.g. kNN [8, 49], latent semantic analysis [17, 28]), which mitigate domaindependence, and allow for greater generalizability. One sign of the revivedinterest in explanation research is the success of a recent series of workshopson explanation aware computing (see e.g. [50, 51]).

For further reading, see the following reviews on expert systems withexplanatory capabilities for three of the most common inference methods:heuristic-based methods [33], bayesian networks [32], and case-based reason-ing [18].

4 Defining Criteria

Surveying the literature for explanations in recommender systems, we seethat recommender systems with explanatory capabilities have been evalu-ated according to different criteria, and identify seven different criteria forexplanations of single item recommendations. Table 1 states these criteria,which are similar to those desired (but not evaluated on) in expert systems,c.f. MYCIN [5]. In Table 2, we summarize previous evaluations of explana-tions in recommender systems, and the criteria by which they have beenevaluated. Works that have no clear criteria stated, or have not evaluatedthe system on the explanation criteria which they state, are omitted fromthis table.

For example, in Section 3 we mentioned that expert systems were com-monly evaluated in terms of user acceptance and the decision support of thesystem as a whole. User acceptance can be defined in terms of our criteriaof satisfaction or persuasion. If the evaluation measures acceptance with thesystem as whole, such as [10] who asked questions such as “Did you like theprogram?”, this reflects user satisfaction. If rather the evaluation measuresuser acceptance of advice or explanations, as in [66], the criterion can be saidto be persuasion.

It is important to identify these criteria as distinct, even if they may in-teract, or require certain trade-offs. Indeed, it would be hard to create expla-nations that do well on all criteria, in reality it is a trade-off. For instance,in our work we have found that while personalized explanations may lead togreater user satisfaction, they do not necessarily increase effectiveness [60].Other times, criteria that seem to be inherently related are not necessarilyso, for example it has been found that transparency does not necessarily aidtrust [15]. For these reasons, while an explanation in Table 2 may have been

1 http://www.grouplens.org/node/73, retrieved July 20092 http://www.netflixprize.com/, retrieved July 2009

Page 5: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 5

evaluated for several criteria, it may not have achieved them all.The type of explanation that is given to a user is likely to depend on the

criteria of the designer of a recommender system. For instance, when build-ing a system that sells books one might decide that user trust is the mostimportant aspect, as it leads to user loyalty and increases sales. For selectingtv-shows, user satisfaction could be more important than effectiveness. Thatis, it is more important that a user enjoys the service, than that they arepresented the best available shows.

In addition, some attributes of explanations may contribute toward achiev-ing multiple goals. For instance, one can measure how understandable an ex-planation is, which can contribute to e.g. user trust, as well as satisfaction.

In this section we describe seven criteria for explanations, and suggestevaluation metrics based on previous evaluations of explanation facilities, oroffer suggestions of how existing measures could be adapted to evaluate theexplanation facility in a recommender system.

Table 1 Explanatory criteria and their definitions

Aim Definition

Transparency (Tra.) Explain how the system works

Scrutability (Scr.) Allow users to tell the system it is wrong

Trust Increase users’ confidence in the system

Effectiveness (Efk.) Help users make good decisions

Persuasiveness (Pers.) Convince users to try or buy

Efficiency (Efc.) Help users make decisions faster

Satisfaction (Sat.) Increase the ease of use or enjoyment

4.1 Explain How the System Works: Transparency

An anecdotal article in the Wall Street Journal titled “If TiVo Thinks YouAre Gay, Here’s How to Set It Straight” describes users’ frustration with ir-relevant choices made by a video recorder that records programs it assumesits owner will like, based on shows the viewer has recorded in the past3.For example, one user, Mr. Iwanyk, suspected that his TiVo thought he wasgay since it inexplicably kept recording programs with gay themes. This userclearly deserved an explanation.

An explanation may clarify how a recommendation was chosen. In expertsystems, such as in the domain of medical decision making, the importanceof transparency has also been recognized [5]. Transparency or the heuristic

3 http://online.wsj.com/article email/SB1038261936872356908.html, retrieved Feb. 12,2009

Page 6: Guidelines for Designing and Evaluating Explanations for ...

6 Nava Tintarev and Judith Masthoff

Table 2 The criteria by which explanations in recommender systems have been evaluated.System names are mentioned if given, otherwise we only note the type of recommendeditems. Works that have no clear criteria stated, or have not evaluated the system on theexplanation criteria which they state, are omitted from this table. Note that while a systemmay have been evaluated for several criteria, it may not have achieved all of them. Also, forthe sake of completeness we have distinguished between multiple studies using the samesystem.

Tra. Scr. Trust Efk. Per. Efc. Sat.System (type of items)

(Internet providers) [20] X X X

(Digital cameras, notebooks computers) [45] X

(Digital cameras, notebooks computers) [46] X X

(Music) [52] X

(Movies) [60] X X X

Adaptive Place Advisor (restaurants) [57] X X

ACORN (movies) [64] X

CHIP (cultural heritage artifacts) [14] X X X

CHIP (cultural heritage artifacts) [15] X X X

iSuggest-Usability (music) [27] X X

LIBRA (books) [6] X

MovieLens (movies) [26] X X

Moviexplain (movies) [55] X X

myCameraAdvisor [62] X

Qwikshop (digital cameras) [35] X X

SASY (e.g. holidays) [16] X X X

Tagsplanations (movies) [61] X X

of “Visibility of System Status” is also an established usability principle [40],and its importance has also been highlighted in user studies of recommendersystems [52].

Vig et al. differentiate between transparency and justification [61]. Whiletransparency should give an honest account of how the recommendations areselected and how the system works, justification can be descriptive and decou-pled from the recommendation algorithm. The authors cite several reasonsfor opting for justification rather than genuine transparency. For examplesome algorithms that are difficult to explain (e.g. latent semantic analysiswhere the distinguishing factors are latent and may not have a clear inter-pretation), protection of trade secrets by system designers, and the desire forgreater freedom in designing the explanations.

Cramer et al. have investigated the effects of transparency on other evalu-ation criteria such as trust, persuasion (acceptance of items) and satisfaction(acceptance) in an art recommender [14, 15]. Transparency itself was evalu-ated in terms of its effect on actual and perceived understanding of how thesystem works [15]. While actual understanding was based on user answers tointerview questions, perceived understanding was extracted from self-reportsin questionnaires and interviews.

The evaluation of transparency has also been coupled with scrutability

Page 7: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 7

(Section 4.2) and trust (Section 4.3), but we will see in these sections thatthese criteria can be distinct from each other.

4.2 Allow Users to Tell the System it is Wrong:Scrutability

Explanations may help isolate and correct misguided assumptions or steps.When the system collects and interprets information in the background, asis the case with TiVo, it becomes all the more important to make the reason-ing available to the user. Following transparency, a second step is to allowa user to correct reasoning, or make the system scrutable [16]. Explanationsshould be part of a cycle, where the user understands what is going on inthe system and exerts control over the type of recommendations made, bycorrecting system assumptions where needed [53]. Scrutability is related tothe established usability principle of User Control [40]. See Figure 1 for anexample of a scrutable holiday recommender.

While scrutability is very closely tied to the criteria of transparency, it de-serves to be uniquely identified. There are explanations that are transparent,but not scrutable such as the explanation in Table 4. Here, the user cannotchange (scrutinize) the ratings that affected this recommendation directly inthe interface. In addition, the scrutability may reflect certain portions, ratherthe entire workings of the recommender engine. Ihe explanations in this Ta-ble are scrutable, but not (fully) transparent even if they offer some formof justification. For example, there is nothing in Table 4 that suggests thatthe underlying recommendations are based on a Bayesian classifier. In sucha case, we can imagine that a user attempts to scrutinize a recommendersystem, and manages to change their recommendations, but still does notunderstand exactly what happens within the system.

Czarkowski found that users were not likely to scrutinize on their own,and that extra effort was needed to make the scrutability tool more visible[16]. In addition, it was easier to get users perform a given scrutinizationtask such as changing the personalization (e.g. “Change the personalisationso that only Current Affairs programs are included in your 4:30-5:30 sched-ule.”) Their evaluation included metrics such as task correctness, and if userscould express an understanding of what information was used to make rec-ommendations for them. They understood that adaptation in the systemwas based on their personal attributes stored in their profile, that their pro-file contained information they volunteered about themselves, and that theycould change their profile to control the personalization [16].

Page 8: Guidelines for Designing and Evaluating Explanations for ...

8 Nava Tintarev and Judith Masthoff

Fig. 1 Scrutable holiday recommender , [16]. The explanation is in the circled area, andthe user profile can be accessed via the “why” links.

4.3 Increase Users’ Confidence in the System: Trust

Trust is sometimes linked with transparency: previous studies indicate thattransparency and the possibility of interaction with recommender systemsincreases user trust [20, 52]. Trust in the recommender system could also bedependent on the accuracy of the recommendation algorithm [37]. A study ofusers’ trust (defined as perceived confidence in a recommender system’s com-petence) suggests that users intend to return to recommender systems whichthey find trustworthy [11]. We note however, that there is a case where trans-parency and trust were not found to be related [15].

We do not claim that explanations can fully compensate for poor recom-mendations, but good explanations may help users make better decisions (seeSection 4.5 on effectiveness). A user may also be more forgiving, and moreconfident in recommendations, if they understand why a bad recommenda-tion has been made and can prevent it from occurring again. A user mayalso appreciate when a system is “frank” and admits that it is not confidentabout a particular recommendation.

In addition, the interface design of a recommender system may affect itscredibility. In a study of factors determining web page credibility, the largestproportion of users’ comments (46.1%) referred to the appeal of the over-all visual design of a site, including layout, typography, font size and colorschemes [22]. Likewise the perceived credibility of a Web article was signifi-cantly affected by the presence of a photograph of the author [21]. So, whilerecommendation accuracy, and the criteria of transparency are often linkedto the evaluation of trust, design is also a factor that needs to be consideredas part of the evaluation.

Questionnaires can be used to determine the degree of trust a user placesin a system. An overview of trust questionnaires can be found in [41] whichalso suggests and validates a five dimensional scale of trust. Note that thisvalidation was done with the aim of using celebrities to endorse products,but was not conducted for a particular domain. Additional validation may

Page 9: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 9

be required to adapt this scale to a particular recommendation domain.A model of trust in recommender systems is proposed in [11, 46], and the

questionnaires in these studies consider factors such as intent to return to thesystem, and intent to save effort. Also [62] query users about trust, but focuson trust related beliefs such as the perceived competence, benevolence andintegrity of a virtual adviser. Although questionnaires can be very focused,they suffer from the fact that self-reports may not be consistent with userbehavior. In these cases, implicit measures (although less focused) may revealfactors that explicit measures do not.

One such implicit measure could be loyalty, a desirable bi-product of trust.One study compared different interfaces for eliciting user preferences in termsof how they affected factors such as loyalty [37]. Loyalty was measured interms of the number of logins and interactions with the system. Among otherthings, the study found that allowing users to independently choose whichitems to rate affected user loyalty. It has also been thought that Amazon’sconservative use of recommendations, mainly recommending familiar items,enhances user trust and has led to increased sales [54].

4.4 Convince Users to Try or Buy: Persuasiveness

Explanations may increase user acceptance of the system or the given recom-mendations [26]. Both definitions qualify as persuasion, as they are attemptsto gain benefit for the system rather than for the user.

[15] evaluated the acceptance of recommended items in terms of how manyrecommended items were present in a final selection of six favorites. In a studyof a collaborative filtering- and rating-based recommender system for movies,participants were given different explanation interfaces (e.g. Figure 4.4)[26].This study directly inquired how likely users were to see a movie (with iden-tifying features such as title omitted) for 21 different explanation interfaces.Persuasion was thus a numerical rating on a 7-point Likert scale.

In addition, it is possible to measure if the evaluation of an item haschanged, i.e. if the user rates an item differently after receiving an explana-tion. Indeed, it has been shown that users can be manipulated to give a ratingcloser to the system’s prediction [13]. This study was in the low investmentdomain of movie rental, and it is possible that users may be less influencedby incorrect predictions in high(er) cost domains such as cameras4. It is alsoimportant to consider that too much persuasion may backfire once users re-alize that they have tried or bought items that they do not really want.

Persuasiveness can be measured in a number of ways, For example, it canbe measured as the difference between two ratings: the first being a previousrating, and the second a re-rating for the same item but with an explana-

4 In [59] participants reported that they found incorrect overestimation less useful in highcost domains compared to low cost domains.

Page 10: Guidelines for Designing and Evaluating Explanations for ...

10 Nava Tintarev and Judith Masthoff

tion interface [13]. Another possibility would be to measure how much usersactually try or buy items compared to users in a system without an expla-nation facility. These metrics can also be understood in terms of the conceptof “conversion rate” commonly used in e-Commerce, operationally defined asthe percentage of visitors who take a desired action.

Fig. 2 One out of twenty-one interfaces evaluated for persuasiveness - a histogram sum-marizing the ratings of similar users (neighbors) for the recommended item grouped bygood (5 and 4’s), neutral (3s), and bad (2s and 1s), on a scale from 1 to 5 [26].

4.5 Help Users Make Good Decisions: Effectiveness

Rather than simply persuading users to try or buy an item, an explanationmay also assist users to make better decisions. Effectiveness is by definitionhighly dependent on the accuracy of the recommendation algorithm. An effec-tive explanation would help the user evaluate the quality of suggested itemsaccording to their own preferences. This would increase the likelihood that theuser discards irrelevant options while helping them to recognize useful ones.For example, a book recommender system with effective explanations wouldhelp a user to buy books they actually end up liking. Bilgic and Mooney em-phasize the importance of measuring the ability of a system to assist the userin making accurate decisions about recommendations based on explanationssuch as those in Figure 3 and Tables 3, 4 and 5 [6]. Effective explanationscould also serve the purpose of introducing a new domain, or the range ofproducts, to a novice user, thereby helping them to understand the full rangeof options [20, 45].

Vig et al. measure perceived effectiveness: “This explanation helps medetermine how well I will like this movie.”[61]. Effectiveness of explanations

Page 11: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 11

Fig. 3 The Neighbor Style Explanation - a histogram summarizing the ratings of similarusers (neighbors) for the recommended item grouped by good (5 and 4’s), neutral (3s),and bad (2s and 1s), on a scale from 1 to 5. The similarity to Figure 4.4 in this studywas intentional, and was used to highlight the difference between persuasive and effectiveexplanations [6].

Table 3 The keyword style explanation by [6]. This recommendation is explained in termsof keywords that were used in the description of the item, and that have previously beenassociated with highly rated items. “Count” identifies the number of times the keywordoccurs in the item’s description, and “strength” identifies how influential this keyword isfor predicting liking of an item.

Word Count Strength Explain

HEART 2 96.14 Explain

BEAUTIFUL 1 17.07 Explain

MOTHER 3 11.55 Explain

READ 14 10.63 Explain

STORY 16 9.12 Explain

Table 4 A more detailed explanation for the “strength” of a keyword which shows afterclicking on “Explain” in Table 3. In practice “strength” probabilistically measures howmuch more likely a keyword is to appear in a positively rated item than a negatively ratedone. It is based on the user’s previous positive ratings of items (“rating”), and the numberof times the keyword occurs in the description of these items (“count”) [6].

Title Author Rating Count

Hunchback of Notre Dame Victor Hugo, Walter J. Cobb 10 11

Till We Have Faces: A Myth Retold C.S. Lewis, Fritz Eichenberg 10 10

The Picture of Dorian Gray Oscar Wilde, Isobel Murray 8 5

can also be calculated as the absence of a difference between the liking ofthe recommended item prior to, and after, consumption. For example, in aprevious study, users rated a book twice, once after receiving an explanation,and a second time after reading the book [6]. If their opinion on the book didnot change much, the system was considered effective. This study exploredthe effect of the whole recommendation process, explanation inclusive, oneffectiveness. The same metric was also used to evaluate whether personal-ization of explanations (in isolation of a recommender system) increased their

Page 12: Guidelines for Designing and Evaluating Explanations for ...

12 Nava Tintarev and Judith Masthoff

effectiveness in the movie domain [59].While this metric considers the difference between the before and after

ratings, it does not discuss the effects of over- contra underestimation. In ourwork we found that users considered overestimation to be less effective thanunderestimation, and that this varied between domains. Specifically, overes-timation was considered more severely in high investment domains comparedto low investment domains. In addition, the strength of the effect on per-ceived effectiveness varied depending on where on the scale the predictionerror occurred [59].

Another way of measuring the effectiveness of explanations has been totest the same system with and without an explanation facility, and evaluateif subjects who receive explanations end up with items more suited to theirpersonal tastes [14].

Other work evaluated explanation effectiveness using a metric from mar-keting [25], with the aim of finding the single best possible item (rather than“good enough items” as above) [12]. Participants interacted with the sys-tem until they found the item they would buy. They were then given theopportunity to survey the entire catalog and to change their choice of item.Effectiveness was then measured by the fraction of participants who found abetter item when comparing with the complete selection of alternatives in thedatabase. So, using this metric, a low fraction represents high effectiveness.

Effectiveness is the criterion that is most closely related to accuracy mea-sures such as precision and recall [14, 55, 57]. In systems where items areeasily consumed, e.g. internet news, these can be translated into recognizingrelevant items and discarding irrelevant options respectively. For example,there have been suggestions for an alternative metric of “precision” based onthe number of profile concepts matching with user interests, divided by thenumber of concepts in their profile [14].

4.6 Help Users Make Decisions Faster: Efficiency

Explanations may make it faster for users to decide which recommendeditem is best for them. Efficiency is another established usability principle, i.e.how quickly a task can be performed [40]. This criterion is one of the mostcommonly addressed in the recommender systems literature (See Table 2)given that the task of recommender systems is to find needles in haystacksof information.

Efficiency may be improved by allowing the user to understand the relationbetween competing options. [35, 39, 45] use so called critiquing, a sub-classof knowledge-based algorithms based on trade-offs between item properties,which lends itself well to the generation of explanations. In the domain ofdigital cameras, competing options may for example be described as ”LessMemory and Lower Resolution and Cheaper” [35]. This way users are quickly

Page 13: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 13

able to use this explanation to find a cheaper camera if they are willing tosettle for less memory and lower resolution.

Efficiency is often used in the evaluation of so-called conversational rec-ommender systems, where users continually interact with a recommendersystem, refining their preferences (see also Section 7.1). In these systems, theexplanations can be seen to be implicit in the dialog. Efficiency in these sys-tems can be measured by the total amount of interaction time, and numberof interactions needed to find a satisfactory item [57]. Evaluations of expla-nations based on improvements in efficiency are not limited to conversationalsystems however. Pu and Chen for example, compared completion time fortwo explanatory interfaces, and measured completion time as the amount oftime it took a participant to locate a desired product in the interface [45].

Other metrics for efficiency also include the number of inspected explana-tions, and number of activations of repair actions when no satisfactory itemsare found [20, 48]. Normally, it is not sensible to expose users to all possiblerecommendations and their explanations, and so users can choose to inspect(or scrutinize) a given recommendation by asking for an explanation. In amore efficient system, the users would need to inspect fewer explanations.Repair actions consist of feedback from the user which changes the type ofrecommendation they receive, as outlined in the sections on scrutability (Sec-tion 4.2). Examples of user feedback/repair actions can be found in Section7.

4.7 Make the use of the system enjoyable: Satisfaction

Explanations have been found to increase user satisfaction with, or accep-tance of, the overall recommender system [20, 26, 52]. The presence of longerdescriptions of individual items has been found to be positively correlatedwith both the perceived usefulness [58], and ease of use of the recommendersystem [52]. Also, many commercial recommender systems such as those seenin Table 6 are primarily sources of entertainment. In these cases, any extrafacility should take notice of the effect on user satisfaction. Figure 4.7 givesan example of an explanation evaluated on the criterion of satisfaction.

When measuring satisfaction, one can directly ask users whether the sys-tem is enjoyable to use. Tanaka-Ishii and Frank in their evaluation of a multi-agent system describing a Robocup soccer game ask users whether they preferthe system with or without explanations [56]. Satisfaction can also be mea-sured indirectly by measuring user loyalty [37, 20] (see also Section 4.3), andlikelihood of using the system for a search task [15].

In measuring explanation satisfaction, it is important to differentiate be-

Page 14: Guidelines for Designing and Evaluating Explanations for ...

14 Nava Tintarev and Judith Masthoff

tween satisfaction with the recommendation process5, and the recommendedproducts (persuasion) [15, 20]. One (qualitative) way to measure satisfactionwith the process would be to conduct a user walk-through for a task such asfinding a satisfactory item. In such a case, it is possible to identify usabilityissues and even apply quantitative metrics such as the ratio of positive tonegative comments; the number of times the evaluator was frustrated; thenumber of times the evaluator was delighted; the number of times and wherethe evaluator worked around a usability problem etc.

It is also arguable that users would be satisfied with a system that offerseffective explanations, confounding the two criteria. However, a system thataids users in making good decisions, may have other disadvantages that de-crease the overall satisfaction (e.g. requiring a large cognitive effort on thepart of the user). Fortunately, these two criteria can be measured by distinctmetrics.

Fig. 4 An explanations for an internet provider, describing the provider in terms of userrequirements: “This solution has been selected for the following reasons . . . ” [20]

5 Evaluating the Impact of Explanations on theRecommender System

We have now identified seven criteria by which explanations in recommendersystems can be evaluated, and given suggestions of how such evaluations canbe performed. To some extent, these criteria assume that we are evaluat-ing only the explanation component. It also seems reasonable to evaluatethe system as a whole. In that case we might measure the general systemusability and accuracy, which will depend on both the recommendation al-gorithm as well as the impact of the explanation component. Therefore, inthis section, we describe the interaction between the recommender engineand our explanation criteria, organized by the evaluation metrics commonlyused in recommender system evaluations: accuracy, learning rate, coverage,novelty/serendipity and acceptance.

5 Here we mean the entire recommendation process, inclusive of the explanations. However,in Section 5 we highlight that evaluation of explanations in recommender systems areseldom fully independent of the underlying recommendation process.

Page 15: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 15

Fig. 5 Confidence display for a recommendation, [26] - the movie is strongly recommended(5/5), and there is a large amount of information to support the recommendation (4.5/5).

5.1 Accuracy Metrics

Accuracy metrics regard the ability of the recommendation engine to predictcorrectly, but accuracy is likely to interact with explanations too. For exam-ple, the relationship between transparency and accuracy is not self-evident:Cramer et al. found that transparency led to changes in user behavior thatultimately decreased recommendation accuracy [14].

The system’s own confidence in its recommendations is also related to ac-curacy and can be reflected in explanations. An example of an explanationaimed to help users understand (lack of) accuracy, can be found in con-fidence displays such as Figure 5. These can be used to explain e.g. poorrecommendations in terms of insufficient information used for forming therecommendation. For further work on confidence displays see also [36].

Explanations can also help users understand how they would relate to aparticular item, possibly supplying additional information that helps the usermake more informed decisions (effectiveness). In the case of poor accuracy,the risk of missing good items, or trying bad ones increases while explana-tions can help decrease this risk. By helping users to correctly identify itemsas good or bad, the accuracy of the recommender system as a whole may alsoincrease.

5.2 Learning Rate

The learning rate represents how quickly a recommender system learns auser’s preferences, and how sensitive it is to changes in preferences. Learningrate is likely to affect user satisfaction as users would like a recommendersystem to quickly learn their preferences, and be sensitive to short term as wellas long term interests. Explanations can increase satisfaction by clarifyingor hinting that the system considers changes in the user’s preferences. Forexample, the system can flag that the value for a given variable is gettingclose to its threshold for incurring a change, but that it has not reached ityet. A system can also go a step further, and allow the user to see just how itis learning and changing preferences (transparency), or make it possible fora user to delete old preferences (scrutability). For example, the explanation

Page 16: Guidelines for Designing and Evaluating Explanations for ...

16 Nava Tintarev and Judith Masthoff

facility can request information that would help it learn/change quicker, suchas asking if a user’s favorite movie genre has changed from action to comedy.

5.3 Coverage

Coverage regards the range of items which the recommender system is ableto recommend. Explanations can help users understand where they are inthe search space. By directing the user to rate informative items in under-explored parts of the search space, explanations may increase the overlap be-tween certain items or features (compared to sparsity). Ultimately, this mayincrease the overall coverage for potential recommendations. Understandingthe remaining search options is related to the criterion of transparency: arecommender system can explain why certain items are not recommended.It may be impossible or difficult to retrieve an item (e.g. for items thathave a very particular set of properties in a knowledge-based system, or theitem does not have many ratings in a collaborative-filtering system). Alterna-tively, the recommender system may function under the assumption that theuser is not interested in the item (e.g. if their requirements are too narrowin a knowledge-based system, or if they belong to a very small niche in acollaborative-based system). An explanation can explain why an item is notavailable for recommendation, and even how to remedy this and allow theuser to change their preferences (scrutability).

Coverage may also affect evaluations of the explanatory criteria of effec-tiveness. For example, if a user’s task is not only to find a “good enough”item, but the best item for them, the coverage needs to be sufficient to ensurethat “best” items are included in the recommendations. Depending on howmuch time retrieving these items takes, coverage may also affect efficiency.

5.4 Acceptance

It is possible to confound acceptance, or satisfaction with a system with othertypes of satisfaction. If users are satisfied with a system with an explanationcomponent, it remains unclear whether this is due to: satisfaction with theexplanation component, satisfaction with recommendations, or general designand visual appeal. Satisfaction with the system due to the recommendationsis connected to accuracy metrics, or even novelty and diversity, in the sensethat sufficiently good recommendations need to be given to a user in orderto keep them satisfied. Although explanations may help increase satisfaction,or tolerance toward the system, they cannot function as a substitute for e.g.good accuracy. Indeed, this is true for all the mentioned explanatory criteria.An example of an explanation striving toward the criterion of satisfaction may

Page 17: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 17

be: “Please bare with me, I still need to learn more about your preferencesbefore I can make an accurate recommendation.”

6 Presenting Recommendations

We summarize the ways of presenting recommendations that we have seen forthe systems summarized in this paper. While there are a number of deviationsin the appearance of the graphical user interface, the actual structure ofoffering recommendations can be classified into the following categories: topitem, top n-items, similar to top item(s), predicted ratings for all items andstructured overview. We describe these in further detail below, and illustratehow explanations may be used in each case.

6.1 Top Item

Perhaps the simplest way to present a recommendation is by offering theuser the best item. The way in which this item is selected could then be usedas part of the explanation. Let us imagine a user who is interested in sportitems, and appreciates football, but not tennis or hockey. The recommendersystem could then offer a recent football item, for example regarding the finalin the world cup. The generated explanation may then be the following: ”Youhave been watching a lot of sports, and football in particular. This is the mostpopular and recent item from the world cup.”

Note that this example uses the user’s viewing history in order to formthe explanation. A system could also use information that the user specifiesmore directly, e.g. how much they like football.

6.2 Top N-Items

The system may also present several items at once. In a large domain such asnews, it is likely that a user has many interests. In this case there are severalitems that could be highly interesting to the user. If the football fan men-tioned above is also interested in technology news, the system might presentseveral sports stories alongside a couple of technology items. Thus, the ex-planation generated by the system might be along the lines of: ”You havewatched a lot of football and technology items. You might like to see the localfootball results and the gadget of the day.”

The system may also present several items on the same theme, such asseveral football results. Note that while this system should be able to ex-

Page 18: Guidelines for Designing and Evaluating Explanations for ...

18 Nava Tintarev and Judith Masthoff

plain the relation between chosen items, it should still be able to explain therational behind each single item.

6.3 Similar to Top Item(s)

Once a user shows a preference for one or more items, the recommender sys-tem can offer similar items. The recommender system could present a singleitem, a list of preferred items, or even each user’s most recently viewed items.For each item, it can present one or more similar items, and may show ex-planations similar to the ones in Sections 6.1 and 6.2. For example, giventhat a user liked a book by Charles Dickens such as Great Expectations, thesystem may present a recommendations in the following way: ”You mightalso like...Oliver Twist by Charles Dickens”.

A recommender system can also offer recommendations in a social context,taking into account users that are similar to you. For example a recommenda-tion can be presented in the following manner; ”People like you liked..OliverTwist by Charles Dickens”.

6.4 Predicted Ratings for All Items

Rather than forcing selections on the user, a system may allow its users tobrowse all the available options. Recommendations are then presented aspredicted ratings on a scale (say from 0 to 5) for each item. A user may thenstill find items with low predicted ratings, and can counteract predictions byrating the affected items, or directly modifying the user model, i.e. changingthe system’s view of their preferences. This allows the user to tell the systemwhen it is wrong, fulfilling the criteria of scrutability (see Section 4.2). Let usre-use our example of the football and technology fan. This type of systemmight on average offer higher predictions for football items than hockey items.A user might then ask why a certain item, for example local hockey results, ispredicted to have a low rating. The recommender system might then generatean explanation like: ”This is a sports item, but it is about hockey. You do notseem to like hockey!”.

If the user is interested in local hockey results, but not in results fromother countries, they might modify their user model to limit their interest inhockey to local sports.

Page 19: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 19

6.5 Structured Overview

Pu and Chen [45] suggest a structure which displays trade-offs between items.The best matching item is displayed at the top. Below it several categoriesof trade-off alternatives are listed. Each category has a title explaining thecharacteristics of the items in it, e.g. ”[these laptops]...are cheaper and lighter,but have lower processor speed”. The order of the titles depends on how wellthe category matches the user’s requirements.

Yee et al. [67] used a multi-faceted approach for museum search and brows-ing. This approach considers several aspects of each item, such as location,date and material, each with a number of levels. The user can see how manyitems there are available at each level for each aspect. Using multiple aspectsmight be a suitable approach for a large domain with many varying objects.

The advantage of a structured overview is that the user can see ”where”they are in the search space, and possibly how many items can be found ineach category. This greatly facilitates both navigation and user comprehen-sion of the available options.

Fig. 6 Organizational Structure, [45]

7 Interacting with the Recommender System

This section is dedicated to different ways in which a user can interact with arecommender system to influence the recommendations that they are given.This interaction is what distinguishes conversational systems from “single-shot” recommendations. They allow users to elaborate their requirementsover the course of an extended dialog [47] rather than each user interactionbeing treated independently of previous history.

While interaction can occur in a recommender system regardless of the

Page 20: Guidelines for Designing and Evaluating Explanations for ...

20 Nava Tintarev and Judith Masthoff

explanatory criterion, it may be arguable that this type of interaction isparticularly relevant to the explanation criteria of transparency, effectiveness,and scrutability. Justifying recommendations, or making them transparent,to the user is only half of the solution, the second half is making the systemscrutable (see Section 4.2) by allowing the user to make changes. By allowingusers to change their preferences the explanation facility may also increaseeffectiveness when this leads to more relevant recommendations.

There are different ways in which a user can give feedback to the systemto let it know how it is doing. Here we expand on the four ways suggested by[23], supplying examples of current applications6. Note that although thereare more unobtrusive ways to elicit user preferences, e.g. via usage data [42]or demographics [1], this section focuses on explicit feedback from users.

7.1 The User Specifies their Requirements

By allowing a user to directly specify their requirements it is possible to cir-cumvent the type of faulty assumptions that can be made by a system wherethe interests of a user are based on the items they decide to see, or how theyinteract with them. An indirect approach depends on the way it collects data,and is sometimes hard to change in the intended direction. Remember Mr.Iwanyk? This user tried to tame TiVo’s gay fixation by recording war moviesand other “guy stuff”. As a result it started to give him documentaries onJoseph Goebbels and Adolf Eichmann. It stopped thinking he was gay anddecided he was a crazy guy reminiscing about the Third Reich.

One movie recommender system allows users to specify their preferencesin plain English [63], p. 77:

“User: I feel like watching a thriller.System: Can you tell me one of your favorite thriller movies.User: Uhm, I’m not sure.System: Okay. Can you tell me one of your favorite actors or actresses?User: I think Bruce Willis is good.System: I see. Have you seen Pulp Fiction?User: No.System: Pulp Fiction is a thriller starring Bruce Willis.”

This dialog does not make use of the user’s previous interests, nor does itexplain directly. That is, there is no sentence that claims to be a justificationof the recommendation. It does however do so indirectly, by reiterating (andsatisfying) the user’s requirements. The user should then be able to interact

6 A fifth section on mixed interaction interfaces is appended to the end of this original list.

Page 21: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 21

with the recommender system, and give their opinion of the recommendation,thus allowing further refinement.

7.2 The User Asks for an Alteration

A more direct approach is to allow users to explicitly ask for alterations torecommended items, for instance using a structured overview (see also Section6.5). This approach helps the users to find what they want quicker. Userscan see how items compare, and see what other items are still available if thecurrent recommendation should not meet their requirements. Have you everput in a search for a flight, and been told to try other dates, other airportsor destinations? This answer does not explain which of your criteria needsto be changed, requiring you to go through a tiring trial-and-error process.If you can see the trade-offs between alternatives from the start, the initialproblem can be circumvented.

Some feedback facilities allow users to see how requirement constraintsor critiques affect their remaining options. One such system explains thedifference between a selected camera and remaining cameras. For example, itdescribes competing cameras with ”Less Memory and Lower Resolution andCheaper” [35].

Instead of simply explaining to a user that no items fitting the descriptionexist, these systems show what types of items do exist. These methods havethe advantage of helping users find good enough items, even if some of theirinitial requirements were too strict.

7.3 The User Rates Items

To change the type of recommendations they receive, the user may want tocorrect predicted ratings, or modify a rating they made in the past. Ratingsmay be explicitly inputted by the user, or inferred from usage patterns. Ina book recommender system a user could see the influence (in percentage)their previous ratings had on a given recommendation [6]. The influence basedexplanation showed which rated titles influenced the recommended book themost (see Table 5). Although this particular system did not allow the user tomodify previous ratings, or degree of influence, in the explanation interface,it can be imagined that users could directly change their rating here. Notehowever, that it would be much harder to modify the degree of influence, asit is computed: any modification is likely to interfere with the regular func-tioning of the recommendation algorithm.

Page 22: Guidelines for Designing and Evaluating Explanations for ...

22 Nava Tintarev and Judith Masthoff

Table 5 Influence of previous book ratings, on the current book recommendation [6]

BOOK YOUR RATING Outof 5

INFLUENCE Out of100

Of Mice and Men 4 54

1984 4 50

Till We Have Faces: A Myth Re-told

5 50

Crime and Punishment 4 46

The Gambler 5 11

7.4 The User Gives Their Opinion

A common usability principle is that it is easier for humans to recognizeitems, than to draw them from memory. Therefore, it is sometimes easier fora user to say what they want or do not want, when they have options in frontof them. The options mentioned can be simplified to be mutually exclusive,e.g. either a user likes an item or they do not. It is equally possible to createan explanation facility using a sliding scale.

In previous work in recommender systems a user could for example specifywhether they think an item is interesting or not, if they would like to see moresimilar items, or if they have already seen the item previously [7, 54].

7.5 Mixed Interaction Interfaces

[12] evaluated an interface where users could both specify their requirementsfrom scratch, or make alterations to existing requirements generated by thesystem, and found that the mixed interface increases efficiency, satisfactionas well as effectiveness of decisions compared to only making alterations toexisting sets of requirements.

[37] evaluated a hybrid interface for rating items. In this study users wereable to rate items suggested by the system, as well as search for items torate themselves. The mixed-initiative model did not outperform either ratingmodel in terms of the accuracy of resulting recommendations. On the con-trary, allowing users to search for items to rate increased loyalty (measuredin terms of returns to the system, and number of items rated).

Page 23: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 23

Table 6 Examples of explanations in commercial and academic systems, ordered byexplanation style (case-based, collaborative, content, conversational, demographic andknowledge/utility-based.)

System Example explanation Explanation style

iSuggest-Usability [27] See e.g. Figure 9 Case-based

LoveFilm.com “Because you have selected or highly rated:Movie A”

Case-based

LibraryThing.com “Recommended By User X for Book A” Case-based

Netflix.com A list of similar movies the user has ratedhighly in the past

Case-based

Amazon.com “Customers Who Bought This Item AlsoBought . . . ”

Collaborative

LIBRA [6] Keyword style (Tables 3 and 4); Neighbor style(Figure 3); Influence style (Figure 5)

Collaborative

MovieLens [26] Histogram of neighbors (Figure 4.4) and Con-fidence display (Figure 5)

Collaborative

Amazon.com “Recommended because you said you ownedBook A”

Content-based

CHIP [15] “Why is ‘The Tailor’s Workshop recom-mended to you’? Because it has the followingthemes in common with artworks that you like:* Everyday Life * Clothes . . . ”

Content-based

Moviexplain [55] See Table 7 Content-based

MovieLens: “Tagsplana-tions” [61]

Tags ordered by relevance or preference (seeFigure 8)

Content-based

News Dude [7] “This story received a [high/low] relevancescore, because it contains the words f1, f2, andf3.”

Content-based

OkCupid.com Graphs comparing two users according to di-mensions such as “more introverted”; compar-ison of how users have answered different ques-tions

Content-based

Pandora.com “Based on what you’ve told us so far, we’replaying this track because it features a leisurelytempo . . . ”

Content-based

Adaptive place Advisor[57]

Dialog e.g. “Where would you like to eat?”“Oh, maybe a cheap Indian place.”

Conversational

ACORN [64] Dialog e.g. “What kind of movie do you feellike?” “I feel like watching a thriller.”

Conversational

INTRIGUE [1] “For children it is much eye-catching, it re-quires low background knowledge, it requiresa few seriousness and the visit is quite short.For yourself it is much eye-catching and it hashigh historical value. For impaired it is mucheye-catching and it has high historical value.”

Demographic

Qwikshop [35] “Less Memory and Lower Resolution andCheaper”

Knowledge/utility-based

SASY [16] “. . . because your profile has: *You are single;*You have a high budget” (Figure 1)

Knowledge/utility-based

Top Case [39] “Case 574 differs from your query only in priceand is the best case no matter what transport,duration, or accommodation you prefer”

Knowledge/utility-based

(Internet Provider) [20] “This solution has been selected for the fol-lowing reasons: *Webspace is available for thistype of connection . . . ” (Figure 4.7)

Knowledge/utility-based

”Organizational Struc-ture” [45]

Structured overview: “We also recommend thefollowing products because: *they are cheaperand lighter, but have lower processor speed.”(Figure 6)

Knowledge/utility-based

myCameraAdvisor [62] e.g “. . . cameras capable of taking picturesfrom very far away will be more expensive . . . ”

Knowledge/utility-based

Page 24: Guidelines for Designing and Evaluating Explanations for ...

24 Nava Tintarev and Judith Masthoff

8 Explanation Styles

Table 6 summarizes the most commonly used explanation styles (case-based,content-based, collaborative-based, demographic-based, knowledge and utility-based with examples of each. In this section we describe each style: theircorresponding inputs, processes and generated explanations. For commer-cial systems where this information is not public, we offer educated guesses.While conversational systems are included in the Table, we refer to Section7.1 which refers to explanations in conversational systems as more of an in-teraction style than a particular algorithm.

The explanation style for a given explanation may, or may not, reflectthe underlying algorithm by which they are computed. That is to say thatthe explanations may also follow the “style” of a particular algorithm irre-spective of whether or not this is how they have been retrieved/computed.For example, it is possible to have a content-based explanation for a rec-ommendation engine using collaborative filtering. Consequently this type ofexplanation would not be consistent with the criterion of transparency, butmay support other explanatory criteria. In the following sections we will givefurther examples of how explanation styles can be inspired by these commonalgorithms.

For describing the interface between the recommender system and expla-nation component we use the notation used in [9]: U is the set of users whosepreferences are known, and u ∈ U is the user for whom recommendationsneed to be generated. I is the set of items that can be recommended, andi ∈ I is an item for which we would like to predict u’s preferences.

8.1 Collaborative-Based Style Explanations

For collaborative-based style explanations the assumed input to the recom-mender engine are user u’s ratings of items in I. These ratings are used toidentify users that are similar in ratings to u. These similar users are oftencalled “neighbors” as nearest-neighbors approaches are commonly used tocompute similarity. Then, a prediction for the recommended item is extrap-olated from the neighbors’ ratings of i.

Commercially, the most well known usage of collaborative-style explana-tions are the ones used by Amazon.com: “Customers Who Bought This ItemAlso Bought . . . ”. This explanation assumes that the user is viewing an itemwhich they are already interested in. The system finds similar users (whobought this item), and retrieves and recommends items that similar usersbought. The recommendations are presented in the format of similar to topitem. In addition, this explanation assumes an interaction model, wherebyratings are implicitly inferred through purchase behavior.

Herlocker et al. suggested 21 explanation interfaces using text as well as

Page 25: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 25

graphics [26]. These interfaces varied with regard to content and style, but anumber of these explanations directly referred to the concept of neighbors.Figure 4.4 for example, shows how neighbors rated a given (recommended)movie, a bar chart with “good”, “ok” and “bad” ratings clustered into dis-tinct columns. Again, we see that this explanation is given for a specific wayof recommending items, and a particular interaction model: this is a singlerecommendation (either top item or one item out of a top-N list), and assumesthat the users are supplying rating information for items.

8.2 Content-Based Style Explanation

For content-based style explanations the assumed input to the recommenderengine are user u’s ratings (for a sub-set) of items in I. These ratings arethen used to generate a classifier that fits u’s rating behavior and use it on i.A prediction for the recommended item is based on how well it fits into thisclassifier. E.g. if it is similar to other highly rated items.

If we simplify this further, we could say that content-based algorithmsconsider similarity between items, based on user ratings but considering itemproperties. In the same spirit, content-based style explanations are based onthe items’ properties. For example, [55] justifies a movie recommendation ac-cording to what they infer is the user’s favorite actor (see Table 7). Whilethe underlying approach is in fact a hybrid of collaborative and content-basedapproaches, the explanation style suggests that they compute the similaritybetween movies according to the presence of features in highly rated movies.They elected to present users with several recommendations and explanations(top-N) which may be more suitable if the user would like to make a selec-tion between movies depending on the information given in the explanations(e.g. feeling more like watching a movie with Harrison Ford over one starringBruce Willis). The interaction model is based ratings of items.

A more domain independent approach is suggested by [61] who use therelationship between tags and items (tag relevance) and the relationship be-tween tags and users (tag preference) to make recommendations (see Figure8). Tag preference can be seen as a form of content-based explanation, as itis based on a user’s ratings of movies with that tag. Here, showing recom-mendations as a single top item allows the user to view many of the tags thatare related to the item. The interaction model is again based on numericalratings.

The commercial system Pandora, explains its recommendations of songsaccording to musical properties such as tempo and tonality. These featuresare inferred from users ratings of songs. Figure 7 shows an example of this7. Here, the user is offered one song at a time (top item) and gives their

7 http://www.pandora.com - retrieved Nov. 2006

Page 26: Guidelines for Designing and Evaluating Explanations for ...

26 Nava Tintarev and Judith Masthoff

opinion as “thumbs-up” or “thumbs-down” which also can be considered asnumerical ratings.

Fig. 7 Pandora explanation: “Based on what you’ve told us so far, we’re playing thistrack because it features a leisurely tempo . . . ”

Table 7 Example of an explanation in Moviexplain, using features such as actors, whichoccur for movies previously rated highly by this user, to justify a recommendation[55]

Recommended movie title The reason is theparticipant

who appears in

Indiana Jones and the Last Crusade(1989)

Ford, Harrison 5 movies you have rated

Die Hard 2 (1990) Willis, Bruce 2 movies you have rated

Fig. 8 Tagsplanation with both tag preference and relevance, but sorted by tag relevance

Page 27: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 27

8.2.1 Case-Based Style Explanations

A content-based explanation can also omit mention of significant propertiesand focus primarily on the items used to make the recommendation. Theitems used are thus considered cases for comparison, resulting in case-basedstyle explanations.

In this chapter we have already seen a type of case-based style explanation,the “influence based style explanation” of [6] in Figure 5. Here, the influenceof an item on the recommendation is computed by looking at the differencein the score of the recommendation with and without that item. In this case,recommendations were presented as top item, assuming a rating based in-teraction. [27] computed the similarity between recommended items8, andused these similar items as justification for a top item recommendation inthe “learn by example” explanations (see Figure 9).

8.3 Knowledge and Utility-Based Style Explanations

For knowledge and utility-based style explanations the assumed input to therecommender engine are description of user u’s needs or interests. The rec-ommender engine then infers a match between the item i and u’s needs. Oneknowledge-based recommender system takes into consideration how cameraproperties such as memory, resolution and price reflect the available optionsas well as a user’s preferences [35]. Their system may explain a camera rec-ommendation in the following manner: “Less Memory and Lower Resolutionand Cheaper”. Here recommendations are presented as a form of structuredoverview describing the competing options, and the interaction model as-sumes that users ask for alterations in the recommended items.

Similarly, in the system described in [39] users gradually specify (and mod-ify) their preferences until a top recommendation is reached. This system cangenerate explanations such as the following for a recommended holiday titled“Case 574”: “Top Case: Case 574 differs from your query only in price andis the best case no matter what transport, duration, or accommodation youprefer”.

It is arguable that there is a certain degree of overlap between knowledge-based and content-based explanations. This is particularly the case for case-based style explanations (Section 8.2.1) which can be derived from eithertype of algorithm depending on the details of the implementation.

8 The author does not specify which similarity metric was used, though it is likely to be aform of rating based similarity measure such as cosine similarity.

Page 28: Guidelines for Designing and Evaluating Explanations for ...

28 Nava Tintarev and Judith Masthoff

Fig. 9 Learn by example, or case based reasoning, [27]

8.4 Demographic Style Explanations

For demographic-based style explanations, the assumed input to the rec-ommender engine is demographic information about user u. From this, therecommendation algorithm identifies users that are demographically similarto u. A prediction for the recommended item i is extrapolated from how thesimilar users rated this item, and how similar they are to u.

Surveying a number of systems which use a demographic-based filter e.g.[1, 31, 44], we could only find one which offers an explanation facility: “Forchildren it is much eye-catching, it requires low background knowledge, it re-quires a few seriousness and the visit is quite short. For yourself it is mucheye-catching and it has high historical value. For impaired it is much eye-catching and it has high historical value.”[1]. In this system recommendationswere offered as a structured overview, categorizing places to visit accordingto their suitability to different types of travelers (e.g. children, impaired).Users can then add these items to their itinerary, but there is no interactionmodel that modifies subsequent recommendations

To our knowledge, there are no other systems that make use of demo-graphic style explanations. It is possible that this is due to the sensitivity ofdemographic information; anecdotally we can imagine that many users wouldnot want to be recommended an item based on their gender, age or ethnicity(e.g. “We recommend you the movie Sex in the City because you are a femaleaged 20-40.”).

Page 29: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 29

9 Summary and future directions

In this chapter, we offer guidelines for the designers of explanations in recom-mender systems. Firstly, the designer should consider what benefit the expla-nations offer, and thus which criteria they are evaluating the explanations for(e.g. transparency, scrutability, trust, efficiency, effectiveness, persuasion orsatisfaction). The developer may select several criteria which may be relatedto each other, but may also be conflicting. In the latter case, it is particu-larly important to distinguish between these evaluation criteria. It is only inmore recent work that these trade-offs are being shown and becoming moreapparent [15, 60].

In addition, the system designer should consider the metrics they are goingto use when evaluating the explanations, and the dependencies the explana-tions may have with different parts of the system, such as the way recom-mendations are presented (e.g. top item, top N-items, similar to top item(s),predicted ratings for all items, structured overview), the way users interactwith the explanations (e.g. the user specifies their requirements, asks for analteration, rates items, gives their opinion, or uses a hybrid interaction inter-face) and the underlying recommender engine.

To offer a single example of the relation between explanations and otherrecommender system factors, we can imagine a recommender engine with lowrecommendation accuracy. This may affect all measurements of effectivenessin the system, as users do not really like the items they end up being rec-ommend. These measurements do not however reflect the effectiveness of theexplanations themselves. In this case, a layered approach to evaluation [43],where explanations are considered in isolation from the recommendation al-gorithm as seen in [60], may be warranted. Similarly, thought should be givenhow the method of presenting recommendations, and the method of interac-tion may affect the (evaluation of) explanations.

We offered examples of the most common explanation styles from existingsystems, and explain how these can be related to the underlying algorithm(e.g. content-based, collaborative, demographic, or knowledge/utility-based).To a certain extent these types of explanations can be reused (likely at thecost of transparency) for hybrid recommendations, and other complex rec-ommendation methods such as latent semantic analysis, but these areas ofresearch remain largely open. Preliminary works for some of these areas canbe found in e.g. [19] (explaining Markov decision processes) and [29] (explain-ing latent semantic analysis models).

Also, future work will likely involve more advanced interfaces for expla-nations. For example, the “treemap” structure (see Figure 10 9) offers anoverview of the search space [4]. This type of overview may also be used forexplanation. Assume for example, that a user is being recommended the piece“The Votes Obama Truly Needs”, so that this rectangle is highlighted. This

9 http://www.marumushi.com/apps/newsmap/index.cfm, retrieved Jan. 2009

Page 30: Guidelines for Designing and Evaluating Explanations for ...

30 Nava Tintarev and Judith Masthoff

Fig. 10 Newsmap - a treemap visualization of news. Different colors represent topic areas,square and font size to represent importance to the current user, and shades of each topiccolor to represent recency.

interface “explains” that this item is being recommended because the user isinterested in current US news (orange color), it is popular (big square), andthat it is recent (bright color).

Last, but certainly not least, researchers are starting to find that explana-tions are part of a cyclical process. The explanations affect a user’s mentalmodel of the recommender system, and in turn the way they interact with theexplanations. In fact this may also impact the recommendation accuracy neg-atively [2, 15]. For example [2] saw that recommendation accuracy decreasedas users removed keywords from their profile for a news recommender system.Understanding this cycle will likely be one of the future strands of research.

References

1. Adrissono, L., Goy, A., Petrone, G., Segnan, M., Torasso, P.: Intrigue: Personalizedrecommendation of tourist attractions for desktop and handheld devices. AppliedArtificial Intelligence 17, 687–714 (2003)

2. Ahn, J.W., Brusilovsky, P., Grady, J., He, D., Syn, S.Y.: Open user profilesfor adaptive news systems: help or harm? In: WWW ’07: Proceedings of the16th international conference on World Wide Web, pp. 11–20. ACM Press, NewYork, NY, USA (2007). DOI http://dx.doi.org/10.1145/1242572.1242575. URLhttp://dx.doi.org/10.1145/1242572.1242575

3. Andersen, S.K., Olesen, K.G., Jensen, F.V.: HUGIN—a shell for building Bayesianbelief universes for expert systems. Morgan Kaufmann Publishers Inc., San Francisco,CA, USA (1990)

4. Bederson, B., Shneiderman, B., Wattenberg, M.: Ordered and quantum treemaps:Making effective use of 2d space to display hierarchies. ACM Transactions on Graphics21(4), 833–854. (2002)

Page 31: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 31

5. Bennett, S.W., Scott., A.C.: The Rule-Based Expert Systems: The MYCIN Exper-iments of the Stanford Heuristic Programming Project, chap. 19 - Specialized Ex-planations for Dosage Selection, pp. 363–370. Addison-Wesley Publishing Company(1985)

6. Bilgic, M., Mooney, R.J.: Explaining recommendations: Satisfaction vs. promotion. In:Proceedings of the Wokshop Beyond Personalization, in conjunction with the Interna-tional Conference on Intelligent User Interfaces, pp. 13–18 (2005)

7. Billsus, D., Pazzani, M.J.: A personal news agent that talks, learns, and explains.In: Proceedings of the Third International Conference on Autonomous Agents, pp.268–275 (1999)

8. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithmsfor collaborative filtering. In: Conference on Uncertainty in Artifical Intelligence, pp.43–52 (1998). URL citeseer.ist.psu.edu/breese98empirical.html

9. Burke, R.: Hybrid recommender systems: Survey and experiments. User Modeling andUser-Adapted Interaction 12(4), 331–370 (2002)

10. Carenini, G., Mittal, V., Moore, J.: Generating patient-specific interactive naturallanguage explanations. Proc Annu Symp Comput Appl Med Care pp. 5–9 (1994)

11. Chen, L., Pu, P.: Trust building in recommender agents. In: WPRSIUI in conjunctionwith Intelligent User Interfaces, pp. 93–100 (2002)

12. Chen, L., Pu, P.: Hybrid critiquing-based recommender systems. In: Intelligent UserInterfaces, pp. 22–31 (2007)

13. Cosley, D., Lam, S.K., Albert, I., Konstan, J.A., Riedl, J.: Is seeing believ-ing?: how recommender system interfaces affect users’ opinions. In: CHI, Rec-ommender systems and social computing, vol. 1, pp. 585–592 (2003). URLhttp://doi.acm.org/10.1145/642611.642713

14. Cramer, H., Evers, V., Someren, M.V., Ramlal, S., Rutledge, L., Stash, N., Aroyo,L., Wielinga, B.: The effects of transparency on perceived and actual competence ofa content-based recommender. In: Semantic Web User Interaction Workshop, CHI(2008)

15. Cramer, H.S.M., Evers, V., Ramlal, S., van Someren, M., Rutledge, L., Stash, N.,Aroyo, L., Wielinga, B.J.: The effects of transparency on trust in and acceptance ofa content-based art recommender. User Model. User-Adapt. Interact 18(5), 455–496(2008). URL http://dx.doi.org/10.1007/s11257-008-9051-3

16. Czarkowski, M.: A scrutable adaptive hypertext. Ph.D. thesis, University of Sydney(2006)

17. Deerwester, S., Dumain, S., Furnas, G., Landauer, T., Harshman, R.: Indexing bylatent semantic analysis. Journal of the Society for Information Science 6, 391–407(1990)

18. Doyle, D., Tsymbal, A., Cunningham, P.: A review of explanation and explanation incase-based reasoning. Tech. rep., Department of Computer Science, Trinity College,Dublin (2003)

19. Elizalde, F., Sucar, E.: Expert evaluation of probabilitic explanations. In: Workshopon Explanation-Aware Computing associated with IJCAI (2009)

20. Felfernig, A., Gula, B.: Consumer behavior in the interaction with knowledge-basedrecommender applications. In: ECAI 2006 Workshop on Recommender Systems, pp.37–41 (2006)

21. Fogg, B., Marshall, J., Kameda, T., Solomon, J., Rangnekar, A., Boyd, J., Brown, B.:Web credibility research: A method for online experiments and early study results. In:CHI 2001, pp. 295–296 (2001)

22. Fogg, B.J., Soohoo, C., Danielson, D.R., Marable, L., Stanford, J., Tauber,E.R.: How do users evaluate the credibility of web sites?: a study with over2,500 participants. In: Proceedings of DUX’03: Designing for User Experiences,no. 15 in Focusing on user-to-product relationships, pp. 1–15 (2003). URLhttp://doi.acm.org/10.1145/997078.997097

Page 32: Guidelines for Designing and Evaluating Explanations for ...

32 Nava Tintarev and Judith Masthoff

23. Ginty, L.M., Smyth, B.: Comparison-based recommendation. Lecture Notesin Computer Science 2416, 731–737 (2002). URL http://link.springer-ny.com/link/service/series/0558/papers/2416/24160575.pdf

24. Hance, E., Buchanan, B.: Rule-based expert systems: the MYCIN experiments of theStanford Heuristic Programming Project. Addison-Wesley (1984)

25. Haubl, G., Trifts, V.: Consumer decision making in online shopping environments: Theeffects of interactive decision aids. Marketing Science 19, 4–21 (2000)

26. Herlocker, J.L., Konstan, J.A., Riedl, J.: Explaining collaborative filtering recommen-dations. In: ACM conference on Computer supported cooperative work, pp. 241–250(2000)

27. Hingston, M.: User friendly recommender systems. Master’s thesis, Sydney University(2006)

28. Hofmann, T.: Collaborative filtering via gaussian probabilistic latent semantic analysis.In: Annual ACM Conference on Research and Development in Information Retrieval -Proceedings of the 26th annual international ACM SIGIR conference on Research anddevelopment in information retrieval (2003)

29. Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets.In: ICDM (2008)

30. Hunt, J.E., Price, C.J.: Explaining qualitative diagnosis. Engineering Applications ofArtificial Intelligence 1(3), Pages 161–169 (1988)

31. Krulwich, B.: The infofinder agent: Learning user interests through heuristic phraseextraction. IEEE Intelligent Systems 12, 22–27 (1997)

32. Lacave, C., Diez, F.J.: A review of explanation methods for bayesian networks. TheKnowledge Engineering Review 17:2, 107–127 (2002)

33. Lacave, C., Diez, F.J.: A review of explanation methods for heuristic expert systems.The Knowledge Engineering Review 17:2, 107–127 (2004)

34. Lopez-Suarez, A., Kamel, M.: Dykor: a method for generating the content of explana-tions in knowledge systems. Knowledge-based Systems 7(3), 177–188 (1994)

35. McCarthy, K., Reilly, J., McGinty, L., Smyth, B.: Thinking positively - explanatoryfeedback for conversational recommender systems. In: Proceedings of the EuropeanConference on Case-Based Reasoning (ECCBR-04) Explanation Workshop,, pp. 115–124 (2004)

36. McNee, S., Lam S.K.and Guetzlaff, C., Konstan J.A.and Riedl, J.: Confidence dis-plays and training in recommender systems. In: INTERACT IFIP TC13 InternationalConference on Human-Computer Interaction, pp. 176–183 (2003)

37. McNee, S.M., Lam, S.K., Konstan, J.A., Riedl, J.: Interfaces for eliciting new userpreferences in recommender systems. User Modeling pp. pp. 178–187 (2003)

38. McNee, S.M., Riedl, J., Konstan, J.A.: Being accurate is not enough: How accuracymetrics have hurt recommender systems. In: Extended Abstracts of the 2006 ACMConference on Human Factors in Computing Systems (CHI 2006) (2006)

39. McSherry, D.: Explanation in recommender systems. Artificial Intelligence Review24(2), 179 – 197 (2005)

40. Nielsen, J., Molich, R.: Heuristic evaluation of user interfaces. In: ACM CHI’90, pp.249–256 (1990)

41. Ohanian, R.: Construction and validation of a scale to measure celebrity endorsers’perceived expertise, trustworthiness, and attractiveness. Journal of Advertising 19:3,39–52 (1990)

42. O’Sullivan, D., Smyth, B., Wilson, D.C., McDonald, K., Smeaton, A.: Improving thequality of the personalized electronic program guide. User Modeling and User-AdaptedInteraction 14, pp. 5–36 (2004)

43. Paramythis, A., Totter, A., Stephanidis, C.: A modular approach to the evaluation ofadaptive user interfaces. In: S. Weibelzahl, D.N. Chin, G. Weber (eds.) Evaluation ofAdaptive Systems in conjunction with UM’01, pp. 9–24 (2001)

44. Pazzani, M.J.: A framework for collaborative, content-based and demographic filtering.Artificial Intelligence Review 13, 393–408 (1999)

Page 33: Guidelines for Designing and Evaluating Explanations for ...

Title Suppressed Due to Excessive Length 33

45. Pu, P., Chen, L.: Trust building with explanation interfaces. In: IUI’06, Recommen-dations I, pp. 93–100 (2006). URL http://doi.acm.org/10.1145/1111449.1111475

46. Pu, P., Chen, L.: Trust-inspiring explanation interfaces for recommender systems.Knowledge-based Systems 20, 542–556 (2007)

47. Rafter, R., Smyth, B.: Conversational collaborative recommendation - an ex-perimental analysis. Artif. Intell. Rev 24(3-4), 301–318 (2005). URLhttp://dx.doi.org/10.1007/s10462-005-9004-8

48. Reilly, J., McCarthy, K., McGinty, L., Smyth, B.: Dynamic critiquing. In: P. Funk,P.A. Gonzalez-Calero (eds.) ECCBR, Lecture Notes in Computer Science, vol. 3155,pp. 763–777. Springer (2004)

49. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: An openarchitecture for collaborative filtering of netnews. In: Proceedings of ACM 1994 Con-ference on Computer Supported Cooperative Work (1994)

50. Roth-Berghofer, T., Schulz, S., Leake, D.B., Bahls, D.: Workshop on explanation-awarecomputing. In: ECAI (2008)

51. Roth-Berghofer, T., Tintarev, N., Leake, D.B.: Workshop on explanation-aware com-puting. In: IJCAI (2009)

52. Sinha, R., Swearingen, K.: The role of transparency in recommender systems. In:Conference on Human Factors in Computing Systems, pp. 830–831 (2002)

53. Sørmo, F., Cassens, J., Aamodt, A.: Explanation in case-based reasoning perspectivesand goals. Artificial Intelligence Review 24(2), 109 – 143 (2005)

54. Swearingen, K., Sinha, R.: Interaction design for recommender systems. In: DesigningInteractive Systems, pp. 25–28 (2002). URL http://citeseer.nj.nec.com/598450.html

55. Symeonidis, P., Nanopoulos, A., Manolopoulos, Y.: Justified recommendations basedon content and rating data. In: WebKDD Workshop on Web Mining and Web UsageAnalysis (2008)

56. Tanaka-Ishii, K., Frank, I.: Multi-agent explanation strategies in real-time domains.In: 38th Annual Meeting on Association for Computational Linguistics, pp. 158–165(2000). URL citeseer.ist.psu.edu/542829.html

57. Thompson, C.A., Goker, M.H., Langley, P.: A personalized system for conversa-tional recommendations. J. Artif. Intell. Res. (JAIR) 21, 393–428 (2004). URLhttp://www.cs.washington.edu/research/jair/abstracts/thompson04a.html

58. Tintarev, N., Masthoff, J.: Effective explanations of recommendations: User-centereddesign. In: Recommender Systems, pp. 153–156 (2007)

59. Tintarev, N., Masthoff, J.: Over- and underestimation in different product domains.In: Workshop on Recommender Systems associated with ECAI (2008)

60. Tintarev, N., Masthoff, J.: Personalizing movie explanations using commercial meta-data. In: Adaptive Hypermedia (2008)

61. Vig, J., Sen, S., Riedl, J.: Tagsplanations: Explaining recommendations using tags. In:Intelligent User Interfaces (2009)

62. Wang, W., Benbasat, I.: Recommendation agents for electronic commerce: Effects ofexplanation facilities on trusting beliefs. Journal of Managment Information Systems23, 217–246 (2007)

63. Warnestal, P.: Modeling a dialogue strategy for personalized movie recommendations.In: Beyond Personalization Workshop, pp. 77–82 (2005)

64. Warnestal, P.: User evaluation of a conversational recommender system. In: Proceed-ings of the 4th Workshop on Knowledge and Reasoning in Practical Dialogue Systems,pp. 32–39 (2005)

65. Wick, M.R., Thompson, W.B.: Reconstructive expert system explanation. Artif. Intell.54(1-2), 33–70 (1992). DOI http://dx.doi.org/10.1016/0004-3702(92)90087-E

66. Ye, L., Johnson, P., Ye, L.R., Johnson, P.E.: The impact of explanation facilities onuser acceptance of expert systems advice. MIS Quarterly 19(2), 157–172 (1995). URLhttp://www.jstor.org/stable/249686

67. Yee, K.P., Swearingen, K., Li, K., Hearst, M.: Faceted metadata for image search andbrowsing. In: ACM Conference on Computer-Human Interaction (2003)