New Timo Berthold - Zuse Institute Berlin · 2020. 9. 7. · © 2019 Fair Isaac Corporation....

© 2019 Fair Isaac Corporation. Confidential. This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac Corporation’s express consent. 1

© 2019 Fair Isaac Corporation. Confidential.

This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac Corporation’s express consent.

Timo Berthold



• Performance variability

• Statistical tests

• Best practices


0,25

0,5

1

2

4

8

16

32

64ra

il50

7n

s12

08

40

0g

mu

-35

-40

sa

telli

tes

1-2

5e

ilB1

01

ran

16

x16

cs

ch

ed

01

0s

p9

8ic

roll3

00

0m

ik-2

50

-1-1

00

-1n

eo

s1

3a

flo

w4

0b

no

sw

ot

ne

wd

an

om

ine

-90

-10

ac

c-t

igh

t5n

eo

s-1

10

98

24

ne

os

-84

97

02

da

no

int

ne

os

-16

01

93

6u

nit

ca

l_7

tim

tab

1b

na

tt3

50

pg

5_3

4iis

-bu

pa

-co

vn

eo

s1

8n

s18

30

65

3b

ley_

xl1

ms

pp

16

n3

seq

24

ma

cro

ph

ag

erm

ine

6m

sc

98

-ip

iis-1

00

-0-c

ov

op

m2

-z7

-s2

bie

ns

t2b

inka

r10

_1c

ov1

07

5d

fn-g

win

-UU

Mle

cts

ch

ed

-4-o

bj

ns1

68

83

47

ns1

76

60

74

pig

eo

n-1

0q

iurm

atr

10

0-p

5n

3d

iv3

6n

eo

s-4

76

28

3m

ap

18

pw

-myc

iel4

roc

II-4

-11

ma

p2

0n

et1

2n

eo

s-9

34

27

8rm

atr

10

0-p

10

ns1

75

89

13

min

e-1

66

-5iis

-pim

a-c

ov

ap

p1

-2n

etd

ive

rsio

nm

cs

ch

ed

ne

os

-13

96

12

5a

sh6

08

gp

ia-3

co

lm

zzv1

1zi

b5

4-U

UE

be

as

leyC

3re

blo

ck6

7b

iella

1e

il33

-2n

eo

s-9

16

79

2ro

co

co

C1

0-0

01

00

0n

4-3

sp

98

irtr

ipti

m1

tan

gle

gra

m2

gla

ss4

co

re2

53

6-6

91

ex9

ne

os

-68

61

90

air

04

ba

b5

ne

os

-13

37

30

7vp

ph

ard

en

ligh

t13

en

ligh

t14

m1

00

n5

00

k4r1

tan

gle

gra

m1

30

n2

0b

8

Single solver version-to-version improvement: appx. 30%. HOWEVER:

• ~40% of the instances got slower (~50% got faster)

• Performance drop by up to a factor of 4 (versus improvements of up to factor 45)

Picture courtesy of Thorsten Koch.

Test set: MIPLIB2010 Benchmark (87 instances)

© 2018 Fair Isaac Corporation. Confidential. 5

• Performance variability is a change in performance by seemingly performance-neutral changes, like changing the

• Input format

• Essentially, this leads to a permutation of the problem

• Operating system (Does NOT occur with Xpress)

• Different number of threads (can be overcome by control setting in Xpress)

• This occurs even though the underlying software is deterministic

• Reasons are imperfect tie-breaking, numerical round-off differences on different platforms…

• Related: Adding redundant information (E.g., a constraint σ𝑖=1𝑛 𝑥𝑖 ≤ 𝑛, 𝑥𝑖 ∈ {0,1})


• Performance variability

• Can indicate improvements where there are none

• Can shadow improvements

• Can lead to wrong conclusions (“on some instances our method was bad, but on some, it was good”)

• Running Xpress with ten different random permutations:

• 118/240 instances have a variability of more than a factor of 2

• For 30, it is more than a factor of 10

• Most extreme case: 0.02 seconds versus timeout

• For large test sets, the effect averages out

• Still 4% difference between „best“ and „worst“ permutation on MIPLIB2017


• Performance variability can be utilized to

• Distinguish between structural improvements and random noise

• The more permutations/seeds you run, the clearer the picture for each single instance

• Mimic performance on unknown instances from same application

• Very, VERY often, customers give you one single example

• Performance variability gives you a poor man’s parallelization:

• Fire off X permutations of the problem, stop when the first one solves

• Doesn’t scale very well ;-)


• Performance variability for benchmarking is typically induced by

• Permuting the matrix: High variability 😊 but structural performance “loss” 😒

• Also: more cache misses

• Random seeds: Lower variability 😒 but preserves average performance 😊

• New method: Cyclic shifts

• Light-weight permutation that shifts the rows and columns

• Only two „breakpoints“

• High variability 😊 and preserves average performance😊

• Can be combined with random seeds

• As a solver developer, after each release:

• We change the benchmarking permutation seeds

• The offset to the random seed


• To test a hypothesis, we compare it to the null-hypothesis: “There is no such relationship”

• When the null-hypothesis can be rejected with high probability, the result is deemed significant

• Essentially, this expresses the confidence in the result. How safe is it to draw conclusions from the test results?

• “How likely is it that a random draw would have delivered the same (or an even more extreme) result?”

• Typically, p-values less than 5% are considered significant.

• For a normal distribution, 95% are within mean +/- two standard deviations

• There is a huge variety of statistical tests, require different assumptions, e.g., concerning the distribution. In our case typically: distribution-free (non-parametric)

• Distinguish between nominal data (yes/no) and rational data


• Applied to 2x2 contigency table of paired nominal data

• E.g., does the solver find a solution with/without a certain setting (yes/no, yes/no)

• Considers the cases where both differ (the counter diagonal of the table)

• Compute 𝜒2 =𝑏−𝑐 2

𝑏+𝑐

• For random drawed 𝑏, 𝑐, this would follow a chi-squared-distribution.

• Lookup p-value

• Well-suited for categorical data (found a solution yes/no, solved to optimality yes/no)


• Non-parametric difference test for paired rational data

• Sorts observations by absolute value of the logarithm of their ratio

• Assigns ranks from 1 to n

• Split into positive and negative group, compute rank sums

• The more different those sums, the more likely that the setting with the larger sum outperforms the one with the lower sum

• Wilcoxon statistic: 𝑧 =min 𝑊−,𝑊+ −

𝑁(𝑁+1)

4

𝑁(𝑁+1)(2𝑁+1)

24

• Would follow a normal distribution for randomly drawn data

• Well-suited for numerical data (runtime, number of nodes, PDI)


• Choose a suitable test set

• Are there standard test sets in your field?

• Use large test sets

• Use diverse test sets (within your target domain)

• Use „real“ instances if possible, not randomly generated

• Be careful, what to exclude and how to split test sets

• NEVER EVER do „easy“ vs. „hard“ only w.r.t. one „baseline“ setting

• Some measures might only make sense on „all optimal“ or „all timeout“

• Report number of solved instances

• Explain why certain instances had to be excluded and how many were affected


• Report machine specification, software versions and working limits used

• Make fair comparisons

• What is the state-of-the-art?

• Not only software, but also models

• Does the benchmark solver have the same purpose (e.g., heuristics vs exact solvers)


• State the question(s) that you want to answer and the means of doing so

• @Reviewers: This might not be the same question and the same methodology that you would have chosen....

• Run on otherwise idle machines

• Run at most one job per CPU (and bind to CPU)

• Never trust your own results

• Investigate outliers, negative and positive ones

• Use methods like permutations, statistical tests etc to double-check your results

• Use diverse measures

• Also report negative results


• Performance variability describes

a) Large runtime differences caused by seemingly neutral changes

b) A method to tell whether a result is significant

c) A performance measure

• What is NOT a best practice for computational experiments?

a) Use diverse machine setups

b) Use diverse performance measures

c) Run on otherwise idle machines

• When comparing runtimes, the significance should be checked by

a) A McNemar test

b) A Cyclic shift test

c) A Wilcoxon signed rank test


© 2019 Fair Isaac Corporation. Confidential.

This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac Corporation’s express consent.

Timo Berthold

New Timo Berthold - Zuse Institute Berlin · 2020. 9. 7. · © 2019 Fair Isaac Corporation....

Documents

Transcript of New Timo Berthold - Zuse Institute Berlin · 2020. 9. 7. · © 2019 Fair Isaac Corporation....