University of Malta CSA4080: Topic 8 © 2004- Chris Staff 1 of 49 [email protected] CSA4080:...
-
Upload
audrey-chandler -
Category
Documents
-
view
214 -
download
1
Transcript of University of Malta CSA4080: Topic 8 © 2004- Chris Staff 1 of 49 [email protected] CSA4080:...
1 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
CSA4080:Adaptive Hypertext Systems II
Dr. Christopher StaffDepartment of Computer Science & AI
University of Malta
Topic 8: Evaluation Methods
2 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Aims and Objectives
• Background to evaluation methods in user-adaptive systems
• Brief overviews of the evaluation of IR, QA, User Modelling, Recommender Systems, Intelligent Tutoring Systems, Adaptive Hypertext Systems
3 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Background to Evaluation Methods
• Systems need to be evaluated to demonstrate (prove) that the hypothesis on which they are based is correct
• In IR, we need to know that the system is retrieving all and only relevant documents for the given query
4 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Background to Evaluation Methods
• In QA, we need to know the correct answer to questions, and measure performance
• In User Modelling, we need to determine that the model is an accurate reflection of information needed to adapt to the user
• In Recommender Systems, we need to associate user preferences either with other similar users, or with product features
5 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Background to Evaluation Methods
• In Intelligent Tutoring Systems we need to know that learning through an ITS is beneficial or at least not (too) harmful
• In Adaptive Hypertext Systems, we need to measure the system’s ability to automatically represent user interests, to direct the user to relevant information, and to present the information in the best way
6 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Measuring Performance
• Information Retrieval:– Recall and Precision (overall, and also at top-n)
• Question Answering:– Mean Reciprocal Rank
7 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Measuring Performance
• User Modelling– Precision and Recall: if user is given all and
only relevant info, or if system behaves exactly as user needs, then model is probably correct
– Accuracy and predicted probability: to predict a user’s actions, location, or goals
– Utility: the benefit derived from using system
8 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Measuring Performance
• Recommender Systems:– Content-based may be evaluated using
precision and recall– Collaborative is harder to evaluate, because it
depends on other users the system knows about• Quality of individual item prediction
• Precision and Recall at top-n
9 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Measuring Performance
• Intelligent Tutoring Systems:– Ideally, being able to show that student can
learn more efficiently using ITS than without – Usually, show that no harm is done
• Then, “releasing the tutor” and enabling self-paced learning becomes a huge advantage
– Difficult to evaluate • Cannot compare same student with and without ITS• Students who volunteer are usually very motivated
10 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Measuring Performance
• Adaptive Hypertext Systems:– Can mix UM, IR, RS (content-based) methods
of evaluation– Use empirical approach
• Different sets of users solve same task, one group with adaptivity, the other without
• How to choose participants?
11 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: IR
• IR systems’ performance is normally measured using precision and recall– Precision: percentage of retrieved documents
that are relevant– Recall: percentage of relevant documents that
are retrieved
• Who decides which documents are relevant?
12 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: IR
• Query Relevance Judgements– For each test query, the document collection is
divided into two sets: relevant and non-relevant– Systems are compared using precision and
recall– In early collections, humans would classify
documents (p3-cleverdon.pdf)• Cranfield collection: 1400 documents/221 queries• CACM: 3204 documents/50 queries
13 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: IR
• Do humans always agree on relevance judgements?– No: can vary considerably
(mizzaro96relevance.pdf)– So only use documents on which there is full
agreement
14 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: IR
• TExt Retrieval Conference (TREC) (
http://trec.nist.gov)
– Runs competitions every year– QRels and document collection made available
in a number of tracks (e.g., ad hoc, routing, question answering, cross-language, interactive, Web, terabyte, ...)
15 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: IR
• What happens when collection grows?– E.g., Web track has 1GB of data! Terabyte track in
the pipeline– Pooling
• Give different systems same document collection to index and queries
• Take the top-n retrieved documents from each• Documents that are present in all retrieved sets are
relevant, others not OR• Assessors judge the relevance of unique documents in
the pool
16 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: IR
• Advantages:– Possible to compare system performance– Relatively cheap
• QRels and document collection can be purchased for moderate price rather than organising expensive user trials
– Can use standard IR systems (e.g., SMART) and build another layer on top, or build new IR model
– Automatic and Repeatable
17 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: IR
• Common criticisms:– Judgements are subjective
• Same assessor may change judgement at different times!
• Doesn’t effect ranking
– Judgements are binary– Some relevant documents are missed by
pooling (QRels are incomplete)• Doesn’t effect system performance
18 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: IR
• Common criticisms (contd.):– Queries are too long
• Queries under test conditions can have several hundred terms
• Average Web query length 2.35 terms (p5-jansen.pdf)
19 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: IR
• In massive document collections there may be hundreds, thousands, or even millions of relevant documents
• Must all of them be retrieved?
• Measure precision at top-5, 10, 20, 50, 100, 500 and take weighted average over results (Mean Average Precision)
20 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
The E-Measure
Combine Precision and Recall into one number(http://www.dcs.gla.ac.uk/Keith/Chapter.7/Ch.7.html)
PRb
bE
1
11 2
2
+
+−=
P = precisionR = recallb = measure of relative importance of P or RE.g,b = 0.5 means user is twice as interested in
precision as recall
)1/(1
1)1(
11
1
2 +=
−+⎟⎠⎞
⎜⎝⎛
−=
βα
ααRP
E
21 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: QA
• The aim in Question Answering is not to ensure that the overwhelming majority of relevant documents are retrieved, but to return an accurate answer
• Precision and recall are not accurate enough
• Usual measure is Mean Reciprocal Rank
22 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: QA
• MRR measures the average rank of the first correct answer for each query (1/rank, or 0 if correct answer is not in top-5)
• Ideally, the first correct answer is put into rank 1
qa_report.pdf
23 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: UM
• Information Retrieval evaluation has matured to the extent that it is very unusual to find an academic publication without a standard approach to evaluation
• On the other hand, up to 2001, only one-third of user models presented in UMUAI had been evaluated: and most of those were ITS related (see later)
p181-chin.pdf
24 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: UM
• Unlike IR systems, it is difficult to evaluate UMs automatically– Unless they are stereotypes/course-grained
classification systems
• So they tend to need to be evaluated empirically– User studies– Want to measure how well participants do with
and without a UM supporting their task
25 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: UM
• Difficulties/problems include:– Ensuring a large enough number of participants
to make results statistically meaningful– Catering for participants improving during
rounds– Failure to use a control group– Ensuring that nothing happens to modify
participant’s behaviour (e.g., thinking aloud)
26 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: UM
• Difficulties/problems (contd.):– Biasing the results– Not using blind-/double-blind testing when
needed– ...
27 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: UM
• Proposed reporting standards– No., source, and relevant background of
participants– independent, dependent and covariant variables– analysis method– post-hoc probabilities– raw data (in the paper, or on-line via WWW)– effect size and power (at least 0.8)
p181-chin.pdf
28 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: RS
• Recommender Systems
• Two types of recommender system– Content-based– Collaborative
• Both (tend to) use VSM to plot users/ product features into n-dimensional space
29 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: RS
• If we know the “correct” recommendations to make to a user with a specific profile, then we can use Precision, Recall, EMeasure, Fmeasure, Mean Average Precision, MRR, etc.
30 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: ITS
• Intelligent Tutoring Systems• Evaluation to demonstrate that learning
through ITS is at least as effective as traditional learning– Cost benefit of freeing up tutor, and permitting
self-paced learning
• Show at a minimum that student is not harmed at all or is minimally harmed
31 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: ITS
• Difficult to “prove” that individual student learns better/same/worse with ITS than without– Cannot make student unlearn material in
between experiments!
• Attempt to use statistically significant number of students, to show probable overall effect
32 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: ITS
• Usually suffers from same problems as evaluating UMs, and ubiquitous multimedia systems
• Students volunteer to evaluate ITSs– So are more likely to be motivated and so
perform better– Novelty of system is also a motivator– Too many variables that are difficult to cater for
33 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: ITS
• However, usually empirical evaluation is performed
• Volunteers work with system• Pass rates, retention rates, etc., may be
compared to conventional learning environment (quantitative analysis)
• Volunteers asked for feedback about, e.g., usability (qualitative analysis)
34 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: ITS
• Frequently, students are split into groups (control and test) and performance measured against each other
• Control is usually ITS without the I - students must find their own way through learning material– However, this is difficult to assess, because
performance of control group may be worse than traditional learning!
35 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: ITS
• “Learner achievement” metric (Muntean, 2004)– How much has student learnt from ITS?– Compare pre-learning knowledge to post-
learning knowledge
• Can compare different systems (as long as they use same learning material), but with different users: so same problem as before
36 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: AHS
• Adaptive Hypertext Systems
• There are currently no standard metrics for evaluating AHSs
• Best practices are taken from fields like ITS, IR, and UM and applied to AHS
• Typical evaluation is “experiences” of using system with and without adaptive features
37 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: AHS
• If a test collection existed for AHS (like TREC) what might it look like?– Descriptions of user models + relevance
judgements for relevant links, relevant documents, relevant presentation styles
– Would we need a standard “open” user model description? Are all user models capturing the same information about the user?
38 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: AHS
– What about following paths through hyperspace to pre-specified points and then having the sets of judgements?
– Currently, adaptive hypertext systems appear to be performing very different tasks, but even if we take just one of the two things that can be adapted (e.g., links), it appears to be beyond our current ability to agree on how adapting links should be evaluated, mainly due to UM!
39 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: AHS
• HyperContext (HCT) (HCTCh8.pdf)• HCT builds a short-term user model as a
user navigates through hyperspace• We evaluated HCT’s ability to make “See
Also” recommendations• Ideally, we would have had hyperspace
with independent relevance judgements a particular points in path of traversal
40 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: AHS
• Instead, we used two mechanisms for deriving UM (one using interpretation, the other using whole document)
• After 5 link traversals we automatically generated a query from each user model, submitted it to search engine and found a relevant interpretation/document respectively
41 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: AHS
• Users asked to read all documents in the path and then give relevance judgement for each “See Also” recommendation
• Recommendations shown in random order• Users didn’t know which was HCT
recommended and which was not• Assumed that if user considered doc to be
relevant, then UM is accurate
42 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: AHS
• Not really enough participants to make strong claims about HCT approach to AH
• Not really significant differences in RJs between different ways of deriving UM (although both performed reasonably well!)
• However, significant findings if reading time is indication of skim-/deep-reading!
43 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: AHS
• Should users have been shown both documents?– Could reading two documents, instead of
just one, have effected judgement of doc read second?
• Were users disaffected because it wasn’t a task that they needed to perform?
44 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: AHS
• Ideally, systems are tested in “real world” conditions in which evaluators are performing tasks
• Normally, experimental set-ups require users to perform artificial tasks, and it is difficult to measure performance because relevance is subjective!
45 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Evaluation Methods: AHS
• This is one of the criticisms of the TREC collections, but it does allow systems to be compared - even if the story is completely different once the system is in real use
• Building a robust enough system for use in the real world is expensive
• But then, so is conducting lab based experiments
46 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Modular Evaluation of AUIs
• Adaptive User Interfaces, or User-Adaptive Systems
• Difficult to evaluate “monolithic” systems
• So break up UAS’s into “modules” that can be evaluated separately
47 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Modular Evaluation of AUIs
• Paramythis, et. al. recommend – identifying the “evaluation objects” - that can
be evaluated separately and in combination– presenting the “evaluation purpose” - the
rationale for the modules and criteria for their evaluation
– identifying the “evaluation process” - methods and techniques for evaluating modules during the AUI life cycle
paramythis.pdf
48 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Modular Evaluation of AUIs
49 of [email protected] University of Malta
CSA4080: Topic 8© 2004- Chris Staff
Modular Evaluation of AUIs