CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

77
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002

Transcript of CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Page 1: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

CS276AText Information Retrieval, Mining, and

Exploitation

Lecture 107 Nov 2002

Page 2: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Information Access in Context

Stop

High-LevelGoal

Synthesize

Done?

Analyze

yes

no

User

Information Access

Page 3: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Exercise

Observe your own information seeking behavior WWW University library Grocery store

Are you a searcher or a browser? How do you reformulate your query?

Read bad hits, then minus terms Read good hits, then plus terms Try a completely different query …

Page 4: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Correction:Address Field vs. Search Box

Are users typing urls into the search box ignorant?

.com / .org / .net / international urls cnn.com vs. www.cnn.com Full url with protocol qualifier vs. partial url

Page 5: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Today’s Topics

Information design and visualization Evaluation measures and test collections Evaluation of interactive information

retrieval Evaluation gotchas

Page 6: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Information Visualization and Exploration

Tufte Shneiderman Information foraging: Xerox PARC / PARC Inc.

Page 7: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Edward Tufte

Information design bible: The visual display of quantitative information

The art and science of how to display (quantitative) information visually

Significant influence on User Interface design

Page 8: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

The Challenger Accident

On January 28, 1986, the space shuttle Challenger explodes shortly after takeoff.

Seven crew members die.

One of the causes: an O ring failed due to cold temperatures.

How could this happen?

Page 9: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

How O-Rings were presented

Time scale is shown – instead of temperature scale!

“Needless junk” (rockets don’t show information)

Graphic does not help answer question: why do o-rings fail?

Page 10: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Tufte: Principles for Information Design

Omit needless junk  Show what you mean  Don't obscure the meaning and order of

scales  Make comparisons of related images

possible  Claim authorship, and think twice when

others don't  Seek truth 

Page 11: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Tufte’s O-Ring Visualization

Page 12: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Tufte: Summary

“Like poor writing, bad graphical displays distort or obscure the data, make it harder to understand or compare, or otherwise thwart the communicative effect which the graph should convey.”

Bad decisions are made based on bad information design.

Tufte’s influence on UI design Examples of the best and worst in

information visualization: http://www.math.yorku.ca/SCS/Gallery/noframes.html

Page 13: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Shneiderman: Information Visualization

How to design user interfaces How to engineer user interfaces for software Task by type taxonomy

Page 14: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Shneiderman on HCI

Well-designed interactive computer systems promote: Positive feelings of success, competence,

and mastery. Allow users to concentrate on their work,

rather than on the system.

Marti Hearst

Page 15: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Task by Type Taxonomy: Data Types

1-D linear: seesoft 2-D map: multidimensional scaling (terms, docs,

etc) 3-D world: cat-a-cone Multi-dim: table lens Temporal: topic detection Tree: hierarchies a la Yahoo Network: network graphs of sites (kartoo)

Page 16: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Task by Type Taxonomy: Tasks

Overview: gain an overview of the entire collection

Zoom: zoom in on items of interest Filter: filter out uninteresting items Details-on-demand: select an item or group and

get details when needed Relate: view relationships among items History: keep a history of actions to support,

undo, replay Extract: allow extraction of subcollections and

the query parameters

Page 17: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Exercise

If your project has a UI component: Which data types are being displayed? Which tasks are you supporting?

Page 18: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Xerox PARC: Information Foraging

Metaphor from ecology/biology People looking for information = animals

foraging for food Predictive model that allows principled way of

designing user interfaces The main focus is:

What will the user do next? How can we support a good choice for the next

action? Rather than:

Evaluation of a single user-system interaction

Page 19: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Foraging Paradigm

Energy

Food ForagingFood ForagingBiological, behavioral, and cultural designs areBiological, behavioral, and cultural designs are

adaptive to the extent theyadaptive to the extent theyoptimize the optimize the rate of energy intakerate of energy intake..

George Robertson, Microsoft

Page 20: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Information Foraging Paradigm

Information

Information ForagingInformation ForagingInformation access and visualization technologies areInformation access and visualization technologies are

adaptive to the extent theyadaptive to the extent theyoptimize the optimize the rate of gain of valuable informationrate of gain of valuable information

George Robertson, Microsoft

Page 21: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Searching Patches

George Robertson, Microsoft

Page 22: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Information Foraging: Theory

G – information/food gained g – average gain per doc/patch TB – total time between docs/patches tb – average time between docs/patches TW – total time within docs/patches tw – average time to process doc/patch lambda = 1/tb – prevalence of information/food

Page 23: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Information Foraging: Theory

R = G / (TB + TW) – rate of gain R = lambda TB g / ( TB + lambda TB tw) R = lambda g / ( 1 + lambda tw) Goodness measure of UI = R = rate of gain Optimize UI by increasing R

Increase prevalence lambda (asymptotic improvement)

Decrease tw (time it takes to absorb doc/food) Better model: different types of docs/patches Model can be used to find optimal UI parameters

Page 24: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Cost-of-Knowledge Characteristic Function

Improve productivity: Less time or more output

Card, Pirolli, and Mackinlay

Page 25: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Creating Test Collectionsfor IR Evaluation

Page 26: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Test Corpora

Page 27: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Kappa Measure

Kappa measures Agreement among coders Designed for categorical judgments Corrects for chance agreement

Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ] P(A) – proportion of time coders agree P(E) – what agreement would be by chance Kappa = 0 for chance agreement, 1 for total

agreement.

Page 28: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Kappa Measure: Example

Number of docs Judge 1 Judge 2

300 Relevant Relevant

70 Nonrelevant Nonrelevant

20 Relevant Nonrelevant

10 Nonrelevant relevant

P(A)? P(E)?

Page 29: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Kappa Example

P(A) = 370/400 = 0.925 P(nonrelevant) = (10+20+70+70)/800 = 0.2125 P(relevant) = (10+20+300+300)/800 = 0.7878 P(E) = 0.2125^2 + 0.7878^2 = 0.665 Kappa = (0.925 – 0.665)/(1-0.665) = 0.776

For >2 judges: average pairwise kappas

Page 30: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Kappa Measure

Kappa > 0.8 = good agreement 0.67 < Kappa < 0.8 -> “tentative conclusions”

(Carletta 96) Depends on purpose of study

Page 31: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Interjudge Disagreement: TREC 3

Page 32: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.
Page 33: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Impact of Interjudge Disagreement

Impact on absolute performance measure can be significant (0.32 vs 0.39)

Little impact on ranking of different systems or relative performance

Page 34: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Evaluation Measures

Page 35: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Recap: Precision/Recall

Evaluation of ranked results: You can return any number of results

ordered by similarity By taking various numbers of documents

(levels of recall), you can produce a precision-recall curve

Precision: #correct&retrieved/#retrieved Recall: #correct&retrieved/#correct The truth, the whole truth, and nothing but

the truth. Recall 1.0 = the whole truth, precision 1.0 = nothing but the truth.

Page 36: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Recap: Precision-recall curves

Page 37: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

F Measure

F measure is the harmonic mean of precision and recall (strictly speaking F1)

1/F = ½ (1/P + 1/R) Use F measure if you need to optimize a

single measure that balances precision and recall.

Page 38: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

F-Measure

F1(0.956) = max = 0.96

Recall vs Precision and F1

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

Recall

Pre

cis

ion

an

d F

1

Precision

F1

Page 39: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Recall vs Precision and F1

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

Recall

Pre

cis

ion

an

d F

1

Breakeven Point

Breakeven point is the point where precision equals recall.

Alternative single measure of IR effectiveness.

How do you compute it?

Page 40: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Area under the ROC Curve

True positive rate = recall = sensitivity

False positive rate = fp/(tn+fp). Related to precision. fpr=0 <-> p=1

Why is the blue line “worthless”?

Page 41: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Precision Recall Graph vs ROCPrecision Recall Curve vs ROC Curve

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

Recal l = True Posi tive Rate (ROC Mir ror , PR Curve); False Posi tive Rate (ROC)

Precision Recall Curve

ROC Mirror Image

ROC Curve

Page 42: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Unit of Evaluation

We can compute precision, recall, F, and ROC curve for different units.

Possible units Documents (most common) Facts (used in some TREC evaluations) Entities (e.g., car companies)

May produce different results. Why?

Page 43: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Critique of Pure ReasonRelevance

Relevance vs Marginal Relevance A document can be redundant even if it is

highly relevant Duplicates The same information from different sources Marginal relevance is a better measure of

utility for the user. Using facts/entities as evaluation units

more directly measures true relevance. But harder to create evaluation set See Carbonell reference

Page 44: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Evaluation ofInteractive Information Retrieval

Page 45: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Evaluating Interactive IR

Evaluating interactive IR poses special challenges Obtaining experimental data is more

expensive Experiments involving humans require

careful design. Control for confounding variables Questionnaire to collect relevant subject

data Ensure that experimental setup is close to

intended real world scenario Approval for human subjects research

Page 46: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

IIR Evaluation Case Study 1

TREC-6 interactive TREC report 9 participating groups (US, Europe, Australia) Control system (simple IR system) Each group ran their system and the control

system 4 users at each site 6 queries (= topics) Goal of evaluation: Find best performing

system Why do you need control system for

comparing groups?

Page 47: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Queries (= Topics)

Page 48: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Latin Square Design

Page 49: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Analysis of Variance

Page 50: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Analysis of Variance

Page 51: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Analysis of Variance

Page 52: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Observations

Query effect is largest std for each site High degree of query variability

Searcher effect negligible for 4 our of 10 sites Best Model:

Interactions are small compared too overall error. None of the 10 sites statistically better than control

system!

M1 M2 M3 m4

#sites

3 4 2 1

Page 53: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

IIR Evaluation Case Study 2

Evaluation of relevance feedback Koenemann & Belkin 1996

Page 54: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Why Evaluate Relevance Feedback?

Page 55: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Questions being InvestigatedKoenemann & Belkin 96

How well do users work with statistical ranking on full text?

Does relevance feedback improve results? Is user control over operation of relevance

feedback helpful? How do different levels of user control

effect results?

Credit: Marti Hearst

Page 56: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

How much of the guts should the user see?

Opaque (black box) (like web search engines)

Transparent (see available terms after the r.f. )

Penetrable (see suggested terms before the r.f.)

Which do you think worked best?

Credit: Marti Hearst

Page 57: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Credit: Marti Hearst

Page 58: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Terms available for relevance feedback made visible(from Koenemann & Belkin)

Credit: Marti Hearst

Page 59: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Details on User StudyKoenemann & Belkin 96

Subjects have a tutorial session to learn the system Their goal is to keep modifying the query until they’ve

developed one that gets high precision This is an example of a routing query (as opposed to ad

hoc) Reweighting:

They did not reweight query terms Instead, only term expansion

pool all terms in rel docs take top N terms, where n = 3 + (number-marked-relevant-docs*2) (the more marked docs, the more terms added to

the query)Credit: Marti Hearst

Page 60: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Details on User StudyKoenemann & Belkin 96

64 novice searchers 43 female, 21 male, native English

TREC test bed Wall Street Journal subset

Two search topics Automobile Recalls Tobacco Advertising and the Young

Relevance judgements from TREC and experimenter

System was INQUERY (vector space with some bells and whistles)

Credit: Marti Hearst

Page 61: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Sample TREC query

Credit: Marti Hearst

Page 62: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Evaluation

Precision at 30 documents Baseline: (Trial 1)

How well does initial search go? One topic has more relevant docs than the other

Experimental condition (Trial 2) Subjects get tutorial on relevance feedback Modify query in one of four modes

no r.f., opaque, transparent, penetration

Credit: Marti Hearst

Page 63: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Precision vs. RF condition (from Koenemann & Belkin 96)

Credit: Marti Hearst

Can we concludefrom this chartthat RF is better?

Page 64: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Effectiveness Results

Subjects with R.F. did 17-34% better performance than no R.F.

Subjects with penetration case did 15% better as a group than those in opaque and transparent cases.

Credit: Marti Hearst

Page 65: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Number of iterations in formulating queries (from Koenemann & Belkin 96)

Credit: Marti Hearst

Page 66: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Number of terms in created queries (from Koenemann & Belkin 96)

Credit: Marti Hearst

Page 67: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Behavior Results Search times approximately equal Precision increased in first few iterations Penetration case required fewer iterations to make

a good query than transparent and opaque R.F. queries much longer

but fewer terms in penetrable case -- users were more selective about which terms were added in.

Credit: Marti Hearst

Page 68: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Evaluation Gotchas

No statistical test (!) Lots of pairwise tests Wrong evaluation measure Query variability Unintentionally biased evaluation

Page 69: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Gotchas: Evaluation Measures

KDD cup 2002Optimize model parameter: balance factorArea under ROC curve and BEP have different behaviorsThese two measures intuitively measure the same property.

0 1e-4 1e-2 1

0.4

0.6

0.8

narrow - 1.3%

Are

a u

nd

er

RO

C c

urv

e

B - balance factor

average

0 1e-4 1e-2 1

control - 1.5%

B - balance factor

std dev

0 1e-4 1e-2 1

broad - 2.8%

B - balance factor

random

(b) Area under ROC for Yeast Gene Dataset

-0.1

0

0.1

0.2

0.3

0.4

narrow - 1.3%

Bre

ake

ven

Po

ints

control - 1.5% broad - 2.8%

(a) Breakeven Points for Yeast Gene Dataset

Page 70: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Gotchas: Query variability

Eichmann et al. claim that for their approach to CLIR French is harder than Spanish.

French average precision: 0.149 Spanish average precision: 0.173

Page 71: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Gotchas: Query variability

Queries with Spanish > baseline: 14 Queries with Spanish baseline: 40 Queries with Spanish < baseline: 53 Queries with French > baseline: 20 Queries with French baseline: 22 Queries with French < baseline: 64

Page 72: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Gotchas: Biased Evaluation

Compare two IR algorithms 1. send query, present results 2. send query, cluster results, present

clusters Experiment was simulated (no users)

Results were clustered into 5 clusters Clusters were ranked according to

percentage relevant documents Documents within clusters were ranked

according to similarity to query

Page 73: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Sim-Ranked vs. Cluster-Ranked

Does this show superiority of cluster ranking?

Page 74: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Relevance Density of Clusters

Page 75: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Summary

Information Visualization: A good visualization is worth a thousand pictures.

But to make information visualization work for text is hard.

Evaluation Measures: F measure, break-even point, area under the ROC curve

Evaluating interactive systems is harder than evaluating algorithms.

Evaluation gotchas: Begin with the end in mind

Page 76: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

ResourcesFOA 4.3MIR Ch. 10.8 – 10.10Ellen Voorhees, Variations in Relevance Judgments and the

Measurement of Retrieval Effectiveness, ACM Sigir 98Harman, D.K. Overview of the Third REtrieval Conference

(TREC-3). In: Overview of The Third Text REtrieval Conference (TREC-3). Harman, D.K. (Ed.). NIST Special Publication 500-225, 1995, pp.l-19.

"Assessing agreement on classification tasks: the kappa statistic", Jean Carletta, Computational Linguistics 22(2):249-254, 1996

Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results (1996)  Marti A. Hearst, Jan O. Pedersen

Proceedings of SIGIR-96,http://gim.unmc.edu/dxtests/ROC3.htmPirolli, P. and Card, S. K. (1999). Information Foraging.

Psychological Review 106(4): 643-675.Paul Over, TREC-6 Interactive Track Report, NIST, 1998.

Page 77: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.

Resourceshttp://www.acm.org/sigchi/chi96/proceedings/papers/Koenemann/jk1_txt.htmhttp://otal.umd.edu/oliveJaime Carbonell , Jade Goldstein, The use of MMR, diversity-based

reranking for reordering documents and producing summaries, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p.335-336, August 24-28, 1998, Melbourne, Australia