© Paul Kantor 2002 A Potpurri of topics Paul Kantor Project overview and cartoon How we did at TREC...

© Paul Kantor 2002

A Potpurri of topicsPaul Kantor

• Project overview and cartoon

• How we did at TREC this year

• Generalized Performance Plots

• Remarks on the formal model of decision

© Paul Kantor 2002

1. Accumulated documents

2. Unexpected event

3. Initial Profile

4. Guided Retrieval

5.Clustering

6. Revision and Iteration

Analysts

Retrospective/Supervised/Tracking

1. Accumulated documents

4. Anticipated event

3. Initial Profile

5.. Guided Retrieval

2.Clustering

Prospective/Unsupervised/Detection

Rutgers DIMACS: Automatic Event Finding in Streams of Messages

7. Track New documents

© Paul Kantor 2002

Communication

• The process converges….

• Central limit theorem …

• What???

• Pretty good fit

• Confidence levels

• What???

• And so on

© Paul Kantor 2002

Measures of performance Effectiveness

• 1. Batch Post-hoc learning. Here there is a large set of already discovered documents, and the system must learn to recognize future instances from the same family

• 2. Adaptive learning of defined profiles. Here there is a small group of "seed documents" and thereafter the system must learn while it works. Realistic measures must penalize the system for sending documents that are of no interest to any analyst, to the human experts.

• 3. Discovery of new regions of interest. Here the focus is on unexpected patterns of related documents, which are far enough from established patterns to warrant sending them for human evaluation.

© Paul Kantor 2002

Measures of performance Effectiveness

• Efficiency is measured in both time and space resources required to accomplish a given level of effectiveness. Results are best visualized in a set of two or three dimensional plots, as suggested on the following page.

© Paul Kantor 2002

Efficiency- Effectiveness PlotsM

easu

re o

f E

ffec

tiven

ess

Measure of Time Required (Best Baseline method/Method_plotted)

100%

100%

Strong and slow

Strong and fast

Weak but fast

Not good enough for government

work

© Paul Kantor 2002

The process

Incoming Documents

N; G are relevant

Our System

Sends n to analyst

Analyst: Reports g are relevant

n

g

G

© Paul Kantor 2002

Typical Effectiveness measures• Basic Concepts:

• precision p=g/n: g=number of relevant documents flagged by our system: n = number that the analyst must examine

• recall R=g/G: G=total number that “should” be sent to the analyst, that is the number of relevant documents.

– F-measures: Harmonic mean of precision and recall

• 1/F = a/p+ (1-a)/R =(1/g)(an+(1-a)G) so

• F=g/[an+(1-a)G]

• there is no persuasive argument for using this

• in TREC2002 a=0.8. A 4:1 weighting

© Paul Kantor 2002

Typical measures used

• Utility-Based measures– Pure Measure: U=vg -c(n-g) =-cn+g(v-c)– Note that sending irrelevant documents drives

the score negative. v=2; c=1– “Training Wheels”: To protect groups from

having to report astronomically negative results: U is replaced by

– T11SU = [max{U/2G, -0.5} - 0.5]/[1.5]

© Paul Kantor 2002

How we have done: TREC2002

• Disclaimers and caveats.– We report here only on those results that were

achieved and validated at the TREC2002 conference. These were done primarily to convince ourselves that we can manage the entire data pipeline, and were not selected to represent the best conceptual approaches that we can think of.

© Paul Kantor 2002

Disclaimers and caveats (cont).

• The TREC Adaptive rules are quite confusing to newcomers. It appears, in conference and post-conference discussions that the two top-ranked systems may not have followed the same set of rules as the other competitors. If this is the case, our results will actually be better than those reported here.

© Paul Kantor 2002

Using measure T11SU• Adaptive -- Assessor topics - 9th among all 14

teams - 7th among those known to follow rules.

• Intersection topics - 7th among all 14 teams -- 5th among known to follow the rules.

• Batch. 6th among all 10 groups on Assessor topics; 3rd among all 10 groups on Intersection topics. Scored above median on almost all. Tops on 24 of 50

© Paul Kantor 2002

Fusion of Methods

Paul Kantor and Dmitiry Fradkin (supported in part by ARDA)

© Paul Kantor 2002

Fusion Models

• Each of several systems gives scores to documents. Call these sj(d). Can these be combined so that the resulting score is a more accurate indication of the relevance of the document? The underlying mathematical concept is the conditional score distribution f(s,h) =Prob(document has score s, given relevance h). The “hypothesis” h=R,N (“Relevant or Not”.

© Paul Kantor 2002

Tools

• We have built visualization tools to show these two distributions. It can be shown that all decision making needs only know the so-called ROC curve, which is invariant to any monotone change of the score variable. We have also built tools which show the ROC.

• The simplest form gives a curve with coordinates (d(t), f(t))

© Paul Kantor 2002

ROC

• d(t)=Prob(score >t | document relevant)

• f(t)=Prob(score >t |document not relevant)

© Paul Kantor 2002

Score Distributions

© Paul Kantor 2002

BinWorlds --Very simple models• Documents live in some number (L) of bins. Some bins have only (b)

irrelevant (bad) documents, a few have relevant (good) documents. Documents are delivered randomly from the world, labeled only by their bin numbers. The work has a horizon H, with a payoff v for good documents sent to be judged, and a cost c for bad documents sent to be judged. We consider a hierarchy of models. For example, if only one bin contains good documents, the optimum strategy is either QUIT or continue until seeing one good document, and thereafter only submit documents from this bin to be judged.

• The expected value of the game is given by:

• EV=-CostToLearnRightBin+GainThereafter.

• Since the expected time to learn the right bin is 1+Lb/g

• EV=-c(1+Lb/g)+(H-(1+Lb/g))(vg-cb)/(b+g).

• Increasing Horizon H increases EV, while increasing

• the number of candidate bins, L, makes the game harder.

© Paul Kantor 2002

The essential math• However, if we have failed once on a bin,

perhaps it is not wise to test it again.• At any step on the way to the horizon H the

decision maker can know only these things:• The judgments on submitted documents, and the

stage at which they were submitted. Let ji=j(b,i) be the judgment received when a document from bin b was submitted at time step i.

© Paul Kantor 2002

The challenge• As a result of these judgments, the decision

maker has a present Bayesian estimate of the chance that each bin is the right bin

• Can we find a simple and effective heuristic based on the available history j1 … ji and the time remaining:H=K-i .

© Paul Kantor 2002

Example Heuristic

-120

-70

-20

30

80

1 22 43 64 85 106 127 148 169 190

Series2

Series3

• 5 Bins p=0.2

• If the current bin is the one that has the largest number of failures to date, do not send for judgment. (Yellow line).

• Gains slowly until the correct bin is discovered.

• Alternative is to submit always. (Mauve line)

© Paul Kantor 2002

Future work• Such an heuristic should exist because the

decision rule must be of the form: if the current estimate that a bin is the right one is below some critical value, don’t submit it.

• Note: this is “obvious but not yet proved.”• In more complex models, the chance of success

in the right bin (g); the number of bins L and even the number of good bins may be unspecified.

© Paul Kantor 2002 A Potpurri of topics Paul Kantor Project overview and cartoon How we did at TREC...

Documents

Transcript of © Paul Kantor 2002 A Potpurri of topics Paul Kantor Project overview and cartoon How we did at TREC...