SBD: Usability Evaluation Chris North cs3724: HCI.

SBD:Usability Evaluation

Chris North

cs3724: HCI

Problem scenarios

summativeevaluation

Information scenarios

claims about current practice

analysis ofstakeholders,field studies

Usability specifications

Activityscenarios

Interaction scenarios

iterativeanalysis ofusability claims andre-design

metaphors,informationtechnology,HCI theory,guidelines

formativeevaluation

DESIGN

ANALYZE

PROTOTYPE & EVALUATE

Scenario-B

ased Design

Evaluation

• Formative vs. Summative

• Analytic vs. Emprical

Usability Engineering

Reqs Analysis

Evaluate

many iterations

Design

Develop

Usability Engineering

Formative evaluation

Summative evaluation

Usability Evaluation

• Analytic Methods:• Usability inspection, Expert review

• Heuristic Evaluation

• Cognitive walk-through

• GOMS analysis

• Empirical Methods:• Usability Testing

– Field or lab

– Observation, problem identification

• Controlled Experiment– Formal controlled scientific experiment

– Comparisons, statistical analysis

User Interface Metrics

• Ease of learning• learning time, …

• Ease of use• perf time, error rates…

• User satisfaction• surveys…

Not “user friendly”

Usability Testing

Usability Testing

• Formative: helps guide design

• Early in design process• when architecture is finalized, then its too late!

• A few users

• Usability problems, incidents

• Qualitative feedback from users

• Quantitative usability specification

Usability Specification Table

Scenario task

Worst case

Planned Target

Best case

(expert)

Observed

Find most expensive house for sale?

1 min. 10 sec. 3 sec. ??? sec

…

Usability Test Setup

• Set of benchmark tasks• Easy to hard, specific to open-ended

• Coverage of different UI features

• E.g. “find the 5 most expensive houses for sale”

• Different types: learnability vs. performance

• Consent forms • Not needed unless video-taping user’s face (new rule)

• Experimenters:• Facilitator: instructs user

• Observers: take notes, collect data, video tape screen

• Executor: run the prototype if faked

• Users• 3-5 users, quality not quantity

Usability Test Procedure• Goal: mimic real life

• Do not cheat by showing them how to use the UI!

• Initial instructions• “We are evaluating the system, not you.”

• Repeat:• Give user a task• Ask user to “think aloud”• Observe, note mistakes and problems• Avoid interfering, hint only if completely stuck

• Interview • Verbal feedback• Questionnaire

• ~1 hour / user

Usability Lab

• E.g. McBryde 102

Data

• Note taking• E.g. “&%$#@ user keeps clicking on the wrong button…”

• Verbal protocol: think aloud• E.g. user thinks that button does something else…

• Rough quantitative measures• HCI metrics: e.g. task completion time, ..

• Interview feedback and surveys

• Video-tape screen & mouse

• Eye tracking, biometrics?

Analyze• Initial reaction:

• “stupid user!”, “that’s developer X’s fault!”, “this sucks”

• Mature reaction:• “how can we redesign UI to solve that usability problem?”

• the user is always right

• Identify usability problems• Learning issues: e.g. can’t figure out or didn’t notice feature

• Performance issues: e.g. arduous, tiring to solve tasks

• Subjective issues: e.g. annoying, ugly

• Problem severity: critical vs. minor

Cost-Importance Analysis

• Importance 1-5: (task effect, frequency)• 5 = critical, major impact on user, frequent occurance• 3 = user can complete task, but with difficulty• 1 = minor problem, small speed bump, infrequent

• Ratio = importance / cost• Sort by this• 3 categories: Must fix, next version, ignored

Problem Importance Solutions Cost Ratio I/C

Refine UI• Simple solutions vs. major redesigns• Solve problems in order of: importance/cost• Example:

• Problem: user didn’t know he could zoom in to see more…• Potential solutions:

– Better zoom button icon, tooltip– Add a zoom bar slider (like moosburg)– Icons for different zoom levels: boundaries, roads, buildings– NOT: more “help” documentation!!! You can do better.

• Iterate• Test, refine, test, refine, test, refine, …• Until? Meets usability specification

Project: Usability Evaluation

• Usability Evaluation:• >=3 users: Not (tainted) HCI students

• Simple data collection (Biometrics optional!)

• Exploit this opportunity to improve your design

• Report:• Procedure (users, tasks, specs, data collection)

• Usability problems identified, specs not met

• Design modifications

Controlled Experiments

Usability test vs. Controlled Expm.• Usability test:

• Formative: helps guide design• Single UI, early in design process• Few users• Usability problems, incidents• Qualitative feedback from users

• Controlled experiment:• Summative: measure final result• Compare multiple UIs• Many users, strict protocol• Independent & dependent variables• Quantitative results, statistical significance

What is Science?

• Measurement

• Modeling

Scientific Method

1. Form Hypothesis

2. Collect data

3. Analyze

4. Accept/reject hypothesis

• How to “prove” a hypothesis in science?• Easier to disprove things, by counterexample

• Null hypothesis = opposite of hypothesis

• Disprove null hypothesis

• Hence, hypothesis is proved

Empirical Experiment

• Typical question:• Which visualization is better in which situations?

Spotfire vs. TableLens

Cause and Effect

• Goal: determine “cause and effect”• Cause = visualization tool (Spotfire vs. TableLens)

• Effect = user performance time on task T

• Procedure:• Vary cause

• Measure effect

• Problem: random variation

• Cause = vis tool OR random variation?

Realworld

Collecteddata

random variation

uncertain conclusions

Stats to the Rescue

• Goal: • Measured effect unlikely to result by random variation

• Hypothesis:• Cause = visualization tool (e.g. Spotfire ≠ TableLens)

• Null hypothesis:• Visualization tool has no effect (e.g. Spotfire = TableLens)

• Hence: Cause = random variation

• Stats:• If null hypothesis true, then measured effect occurs

with probability < 5% (e.g. measured effect >> random variation)

• Hence:• Null hypothesis unlikely to be true

• Hence, hypothesis likely to be true

Variables

• Independent Variables (what you vary), and treatments (the variable values):

• Visualization tool» Spotfire, TableLens, Excel

• Task type» Find, count, pattern, compare

• Data size (# of items)» 100, 1000, 1000000

• Dependent Variables (what you measure)• User performance time• Errors• Subjective satisfaction (survey)• HCI metrics

Example: 2 x 3 design

• n users per cell

Task1 Task2 Task3

Spot-fire

Table-Lens

Ind Var 1: Vis. Tool

Ind Var 2: Task Type

Measured user performance times (dep var)

Groups

• “Between subjects” variable• 1 group of users for each variable treatment

• Group 1: 20 users, Spotfire

• Group 2: 20 users, TableLens

• Total: 40 users, 20 per cell

• “With-in subjects” (repeated) variable• All users perform all treatments

• Counter-balancing order effect

• Group 1: 20 users, Spotfire then TableLens

• Group 2: 20 users, TableLens then Spotfire

• Total: 40 users, 40 per cell

Issues

• Eliminate or measure extraneous factors

• Randomized

• Fairness• Identical procedures, …

• Bias

• User privacy, data security

• IRB (internal review board)

Procedure

• For each user:• Sign legal forms

• Pre-Survey: demographics

• Instructions» Do not reveal true purpose of experiment

• Training runs

• Actual runs» Give task

» measure performance

• Post-Survey: subjective measures

• * n users

Data

• Measured dependent variables

• Spreadsheet:

User Spotfire TableLens

task 1

task 2

task 3

task 1

task 2

task 3

Step 1: Visualize it

• Dig out interesting facts

• Qualitative conclusions

• Guide stats

• Guide future experiments

Step 2: Stats

Task1 Task2 Task3

Spot-fire

37.2 54.5 103.7

Table-Lens

29.8 53.2 145.4Ind Var 1: Vis. Tool

Ind Var 2: Task Type

Average user performance times (dep var)

TableLens better than Spotfire?

• Problem with Averages: lossy• Compares only 2 numbers

• What about the 40 data values? (Show me the data!)

Spotfire TableLens

Avg Perf time (secs)

The real picture

• Need stats that compare all data

Spotfire TableLens

Avg Perf time (secs)

Statistics

• t-test• Compares 1 dep var on 2 treatments of 1 ind var

• ANOVA: Analysis of Variance• Compares 1 dep var on n treatments of m ind vars

• Result: • p = probability that difference between treatments is random

(null hypothesis)

• “statistical significance” level

• typical cut-off: p < 0.05

• Hypothesis confidence = 1 - p

In Excel

p < 0.05

• Woohoo!

• Found a “statistically significant” difference

• Averages determine which is ‘better’

• Conclusion:• Cause = visualization tool (e.g. Spotfire ≠ TableLens)

• Vis Tool has an effect on user performance for task T …

• “95% confident that TableLens better than Spotfire …”

• NOT “TableLens beats Spotfire 95% of time”

• 5% chance of being wrong!

• Be careful about generalizing

p > 0.05

• Hence, no difference? • Vis Tool has no effect on user performance for task T…?• Spotfire = TableLens ?

• NOT!• Did not detect a difference, but could still be different• Potential real effect did not overcome random variation• Provides evidence for Spotfire = TableLens, but not proof• Boring, basically found nothing

• How?• Not enough users• Need better tasks, data, …

Data Mountain

• Robertson, “Data Mountain” (Microsoft)

•

Data Mountain: Experiment

• Data Mountain vs. IE favorites

• 32 subjects

• Organize 100 pages, then retrieve based on cues

• Indep. Vars:• UI: Data mountain (old, new), IE

• Cue: Title, Summary, Thumbnail, all 3

• Dependent variables:• User performance time

• Error rates: wrong pages, failed to find in 2 min

• Subjective ratings

Data Mountain: Results

•Spatial Memory!

•Limited scalability?

SBD: Usability Evaluation Chris North cs3724: HCI.

Documents

Transcript of SBD: Usability Evaluation Chris North cs3724: HCI.