University of Malta CSA4080: Topic 8 © 2004- Chris Staff 1 of 49 [email protected] CSA4080:...

1 of [email protected] University of Malta

CSA4080: Topic 8© 2004- Chris Staff

CSA4080:Adaptive Hypertext Systems II

Dr. Christopher StaffDepartment of Computer Science & AI

University of Malta

Topic 8: Evaluation Methods



Aims and Objectives

• Background to evaluation methods in user-adaptive systems

• Brief overviews of the evaluation of IR, QA, User Modelling, Recommender Systems, Intelligent Tutoring Systems, Adaptive Hypertext Systems



Background to Evaluation Methods

• Systems need to be evaluated to demonstrate (prove) that the hypothesis on which they are based is correct

• In IR, we need to know that the system is retrieving all and only relevant documents for the given query




• In QA, we need to know the correct answer to questions, and measure performance

• In User Modelling, we need to determine that the model is an accurate reflection of information needed to adapt to the user

• In Recommender Systems, we need to associate user preferences either with other similar users, or with product features




• In Intelligent Tutoring Systems we need to know that learning through an ITS is beneficial or at least not (too) harmful

• In Adaptive Hypertext Systems, we need to measure the system’s ability to automatically represent user interests, to direct the user to relevant information, and to present the information in the best way



Measuring Performance

• Information Retrieval:– Recall and Precision (overall, and also at top-n)

• Question Answering:– Mean Reciprocal Rank




• User Modelling– Precision and Recall: if user is given all and

only relevant info, or if system behaves exactly as user needs, then model is probably correct

– Accuracy and predicted probability: to predict a user’s actions, location, or goals

– Utility: the benefit derived from using system




• Recommender Systems:– Content-based may be evaluated using

precision and recall– Collaborative is harder to evaluate, because it

depends on other users the system knows about• Quality of individual item prediction

• Precision and Recall at top-n




• Intelligent Tutoring Systems:– Ideally, being able to show that student can

learn more efficiently using ITS than without – Usually, show that no harm is done

• Then, “releasing the tutor” and enabling self-paced learning becomes a huge advantage

– Difficult to evaluate • Cannot compare same student with and without ITS• Students who volunteer are usually very motivated




• Adaptive Hypertext Systems:– Can mix UM, IR, RS (content-based) methods

of evaluation– Use empirical approach

• Different sets of users solve same task, one group with adaptivity, the other without

• How to choose participants?



Evaluation Methods: IR

• IR systems’ performance is normally measured using precision and recall– Precision: percentage of retrieved documents

that are relevant– Recall: percentage of relevant documents that

are retrieved

• Who decides which documents are relevant?




• Query Relevance Judgements– For each test query, the document collection is

divided into two sets: relevant and non-relevant– Systems are compared using precision and

recall– In early collections, humans would classify

documents (p3-cleverdon.pdf)• Cranfield collection: 1400 documents/221 queries• CACM: 3204 documents/50 queries




• Do humans always agree on relevance judgements?– No: can vary considerably

(mizzaro96relevance.pdf)– So only use documents on which there is full

agreement




• TExt Retrieval Conference (TREC) (

http://trec.nist.gov)

– Runs competitions every year– QRels and document collection made available

in a number of tracks (e.g., ad hoc, routing, question answering, cross-language, interactive, Web, terabyte, ...)




• What happens when collection grows?– E.g., Web track has 1GB of data! Terabyte track in

the pipeline– Pooling

• Give different systems same document collection to index and queries

• Take the top-n retrieved documents from each• Documents that are present in all retrieved sets are

relevant, others not OR• Assessors judge the relevance of unique documents in

the pool




• Advantages:– Possible to compare system performance– Relatively cheap

• QRels and document collection can be purchased for moderate price rather than organising expensive user trials

– Can use standard IR systems (e.g., SMART) and build another layer on top, or build new IR model

– Automatic and Repeatable




• Common criticisms:– Judgements are subjective

• Same assessor may change judgement at different times!

• Doesn’t effect ranking

– Judgements are binary– Some relevant documents are missed by

pooling (QRels are incomplete)• Doesn’t effect system performance




• Common criticisms (contd.):– Queries are too long

• Queries under test conditions can have several hundred terms

• Average Web query length 2.35 terms (p5-jansen.pdf)




• In massive document collections there may be hundreds, thousands, or even millions of relevant documents

• Must all of them be retrieved?

• Measure precision at top-5, 10, 20, 50, 100, 500 and take weighted average over results (Mean Average Precision)



The E-Measure

Combine Precision and Recall into one number(http://www.dcs.gla.ac.uk/Keith/Chapter.7/Ch.7.html)

PRb

bE

1

11 2

2

+

+−=

P = precisionR = recallb = measure of relative importance of P or RE.g,b = 0.5 means user is twice as interested in

precision as recall

)1/(1

1)1(

11

1

2 +=

−+⎟⎠⎞

⎜⎝⎛

−=

βα

ααRP

E



Evaluation Methods: QA

• The aim in Question Answering is not to ensure that the overwhelming majority of relevant documents are retrieved, but to return an accurate answer

• Precision and recall are not accurate enough

• Usual measure is Mean Reciprocal Rank



Evaluation Methods: QA

• MRR measures the average rank of the first correct answer for each query (1/rank, or 0 if correct answer is not in top-5)

• Ideally, the first correct answer is put into rank 1

qa_report.pdf



Evaluation Methods: UM

• Information Retrieval evaluation has matured to the extent that it is very unusual to find an academic publication without a standard approach to evaluation

• On the other hand, up to 2001, only one-third of user models presented in UMUAI had been evaluated: and most of those were ITS related (see later)

p181-chin.pdf




• Unlike IR systems, it is difficult to evaluate UMs automatically– Unless they are stereotypes/course-grained

classification systems

• So they tend to need to be evaluated empirically– User studies– Want to measure how well participants do with

and without a UM supporting their task




• Difficulties/problems include:– Ensuring a large enough number of participants

to make results statistically meaningful– Catering for participants improving during

rounds– Failure to use a control group– Ensuring that nothing happens to modify

participant’s behaviour (e.g., thinking aloud)




• Difficulties/problems (contd.):– Biasing the results– Not using blind-/double-blind testing when

needed– ...




• Proposed reporting standards– No., source, and relevant background of

participants– independent, dependent and covariant variables– analysis method– post-hoc probabilities– raw data (in the paper, or on-line via WWW)– effect size and power (at least 0.8)

p181-chin.pdf



Evaluation Methods: RS

• Recommender Systems

• Two types of recommender system– Content-based– Collaborative

• Both (tend to) use VSM to plot users/ product features into n-dimensional space



Evaluation Methods: RS

• If we know the “correct” recommendations to make to a user with a specific profile, then we can use Precision, Recall, EMeasure, Fmeasure, Mean Average Precision, MRR, etc.



Evaluation Methods: ITS

• Intelligent Tutoring Systems• Evaluation to demonstrate that learning

through ITS is at least as effective as traditional learning– Cost benefit of freeing up tutor, and permitting

self-paced learning

• Show at a minimum that student is not harmed at all or is minimally harmed




• Difficult to “prove” that individual student learns better/same/worse with ITS than without– Cannot make student unlearn material in

between experiments!

• Attempt to use statistically significant number of students, to show probable overall effect




• Usually suffers from same problems as evaluating UMs, and ubiquitous multimedia systems

• Students volunteer to evaluate ITSs– So are more likely to be motivated and so

perform better– Novelty of system is also a motivator– Too many variables that are difficult to cater for




• However, usually empirical evaluation is performed

• Volunteers work with system• Pass rates, retention rates, etc., may be

compared to conventional learning environment (quantitative analysis)

• Volunteers asked for feedback about, e.g., usability (qualitative analysis)




• Frequently, students are split into groups (control and test) and performance measured against each other

• Control is usually ITS without the I - students must find their own way through learning material– However, this is difficult to assess, because

performance of control group may be worse than traditional learning!




• “Learner achievement” metric (Muntean, 2004)– How much has student learnt from ITS?– Compare pre-learning knowledge to post-

learning knowledge

• Can compare different systems (as long as they use same learning material), but with different users: so same problem as before



Evaluation Methods: AHS

• Adaptive Hypertext Systems

• There are currently no standard metrics for evaluating AHSs

• Best practices are taken from fields like ITS, IR, and UM and applied to AHS

• Typical evaluation is “experiences” of using system with and without adaptive features




• If a test collection existed for AHS (like TREC) what might it look like?– Descriptions of user models + relevance

judgements for relevant links, relevant documents, relevant presentation styles

– Would we need a standard “open” user model description? Are all user models capturing the same information about the user?




– What about following paths through hyperspace to pre-specified points and then having the sets of judgements?

– Currently, adaptive hypertext systems appear to be performing very different tasks, but even if we take just one of the two things that can be adapted (e.g., links), it appears to be beyond our current ability to agree on how adapting links should be evaluated, mainly due to UM!




• HyperContext (HCT) (HCTCh8.pdf)• HCT builds a short-term user model as a

user navigates through hyperspace• We evaluated HCT’s ability to make “See

Also” recommendations• Ideally, we would have had hyperspace

with independent relevance judgements a particular points in path of traversal




• Instead, we used two mechanisms for deriving UM (one using interpretation, the other using whole document)

• After 5 link traversals we automatically generated a query from each user model, submitted it to search engine and found a relevant interpretation/document respectively




• Users asked to read all documents in the path and then give relevance judgement for each “See Also” recommendation

• Recommendations shown in random order• Users didn’t know which was HCT

recommended and which was not• Assumed that if user considered doc to be

relevant, then UM is accurate




• Not really enough participants to make strong claims about HCT approach to AH

• Not really significant differences in RJs between different ways of deriving UM (although both performed reasonably well!)

• However, significant findings if reading time is indication of skim-/deep-reading!




• Should users have been shown both documents?– Could reading two documents, instead of

just one, have effected judgement of doc read second?

• Were users disaffected because it wasn’t a task that they needed to perform?




• Ideally, systems are tested in “real world” conditions in which evaluators are performing tasks

• Normally, experimental set-ups require users to perform artificial tasks, and it is difficult to measure performance because relevance is subjective!




• This is one of the criticisms of the TREC collections, but it does allow systems to be compared - even if the story is completely different once the system is in real use

• Building a robust enough system for use in the real world is expensive

• But then, so is conducting lab based experiments



Modular Evaluation of AUIs

• Adaptive User Interfaces, or User-Adaptive Systems

• Difficult to evaluate “monolithic” systems

• So break up UAS’s into “modules” that can be evaluated separately




• Paramythis, et. al. recommend – identifying the “evaluation objects” - that can

be evaluated separately and in combination– presenting the “evaluation purpose” - the

rationale for the modules and criteria for their evaluation

– identifying the “evaluation process” - methods and techniques for evaluating modules during the AUI life cycle

paramythis.pdf

University of Malta CSA4080: Topic 8 © 2004- Chris Staff 1 of 49 [email protected] CSA4080:...

Documents

Transcript of University of Malta CSA4080: Topic 8 © 2004- Chris Staff 1 of 49 [email protected] CSA4080:...