Approximate Randomization tests

25
Approximate Randomization tests February 5 th , 2013

description

Approximate Randomization tests. February 5 th , 2013. Classic t-test. Why ar testing ?. Classic tests often assume a given distribution (student t, normal , …) of the variable This is ≈ok for recall , but not for precision or F-score - PowerPoint PPT Presentation

Transcript of Approximate Randomization tests

Page 1: Approximate Randomization  tests

Approximate Randomization tests

February 5th, 2013

Page 2: Approximate Randomization  tests

Classic t-test

Page 3: Approximate Randomization  tests

Why ar testing?

• Classic tests often assume a given distribution (student t, normal, …) of the variable

• This is ≈ok for recall, but not for precision or F-score

• Possible hypotheses to test with non-parametric tests is limited

Page 4: Approximate Randomization  tests

Illustration

• 30,000 runs, 1000 instances, 500 of class A• True positives (TP): 400 (stdev:80)• False positives (FP): 60 (stdev: 15)• Assumption: true and false positives for class

A are normally distributed. This is already an approximation since TP and FP are restricted by 0 and the number of instances.

Page 5: Approximate Randomization  tests

Definitions

• Recall = truly predicted A / A in reference = truly predicted A / Cte

If A is normal, recall is normal.• Precision = truly predicted A / A in system A in system is a non-linear combination of TP and FP. Precision is not normal.

• F-score: non-linear combination of recall and precision Not normal.

Page 6: Approximate Randomization  tests
Page 7: Approximate Randomization  tests
Page 8: Approximate Randomization  tests
Page 9: Approximate Randomization  tests
Page 10: Approximate Randomization  tests
Page 11: Approximate Randomization  tests

Approximate randomization test

• No assumption on distribution• Can handle complicated statistics• Only assumption: independence between

shuffled elements• References:– Computer Intensive Methods for Testing

Hypotheses, Noreen, 1989.– More accurate tests for the statistical significance of

results differences, Yeh, 2000.

Page 12: Approximate Randomization  tests

Basic idea

• Exact randomization test

Glass 1 Glass 2 Glass 3 Glass 4

Contents Polish Premium Russian Budget

Expert Polish Premium Budget Russian

Page 13: Approximate Randomization  tests

Exact probability

H0: expert is independent of contents

P(ncorrect ≥ 2) = 7/24 = 0.29

Thus, do not reject H0 because the probability is larger than alpha=0.05.

Page 14: Approximate Randomization  tests

Approximate probability

• The number of permutations is n! => quick increase of number of permutations

• If too much permutations to compute: approximation: P = (nge + 1) / (NS + 1)– nge : number of times pseudostatistic ≥ actual

statistic– NS: number of shuffles– +1: correction for validity

Page 15: Approximate Randomization  tests

DIFFERENT SETUPS

Page 16: Approximate Randomization  tests

Translation to instances

• Each glass is an instance• Contents and expert are two labeling systems• Contents has an accuracy of 100%, expert has

an accuracy of 50%• Statistic is precision, f-score, recall, … instead

of accuracy

Page 17: Approximate Randomization  tests

Stratified shuffling

• For labeled instances, it makes no sense to shuffle the class label of one instance to another

• Only shuffle labels per instance

Page 18: Approximate Randomization  tests

MBT

• Assumpton of independence between instances

• Shuffle per sentence rather than per token

System 1 System 2

This DT NNS

is VBZ VB

nice JJ RB

. . .

Page 19: Approximate Randomization  tests

Term extraction

• Shuffling extracted terms between output of two term extraction systems

Reference System 1 System 2

happy happy sad

good good

lively happy

angry

Page 20: Approximate Randomization  tests

Script• http://www.clips.ua.ac.be/~vincent/software.html#art• http://www.clips.ua.ac.be/scripts/art• Options:

– Exact and approximate randomization tests– Instance based, also for MBT– Term extraction based– Stratified Shuffling– Two sided / one-sided (check code!)

Page 21: Approximate Randomization  tests

Remarks on usage

• It makes no sense to shuffle if exact randomization can be computed

• The value of p depends on NS. The larger NS, the lower p can be

• Validity check– Sign-test– Re-test: to alleviate bad randomization

Page 22: Approximate Randomization  tests

Sign test

• Can be compared with P for accuracy• H0: correctness is

independent ofsystem i.e.P(groen) = 0.5

• Binomial test

System 1 System 2

Page 23: Approximate Randomization  tests

Interpretation (1)Reference System 1 System 2

A A B

B A B

C A B

How much do these two systems differ based on precision for the A label?

- Maximally- Intermediate- Minimally

Page 24: Approximate Randomization  tests

Interpretation (2)Labels PrecisionA

A B C System 1 System 2 Δ

AB AB AB 1/3 0 1/3

BA AB AB 0 1 -1

AB AB BA 1/2 0 1/2

BA BA AB 0 1/2 -1/2

BA AB BA 1/2 0 1/2

AB BA BA 1 0 1

BA BA BA 0 1/3 -1/3

AB BA AB 1/2 0 1/2

Page 25: Approximate Randomization  tests

Conclusion

• Approximate randomization testing can be used for many applications.

• The basic idea is that the actual difference between two systems is (im)probable to occur when all possible permutions of the outputs are evaluated.

• Difference can be computed in many ways as long as the shuffled elements are independent.