How did this get published? Pitfalls in experimental evaluation of computing systems
description
Transcript of How did this get published? Pitfalls in experimental evaluation of computing systems
![Page 1: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/1.jpg)
How did this get published?Pitfalls in experimental evaluation of computing
systems
José Nelson AmaralUniversity of Alberta
Edmonton, AB, Canada
![Page 2: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/2.jpg)
Thing #1
Thing #2
Thing #3
Aggregation
Learning
Reproducibility
SPEC
Research
Evaluate
![Page 3: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/3.jpg)
http://archive.constantcontact.com/fs042/1101916237075/archive/1102594461324.html
http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
So, a computing scientist entered a Store….
![Page 4: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/4.jpg)
So, a computing scientist entered a Store….
http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
$ 3,000.00
$ 200.00
They want $2,700 for the
server and $100 for the
iPod.
I will get both and pay only $2,240
altogether!
![Page 5: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/5.jpg)
http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
So, a computing scientist entered an Store….
Ma’am you are $560
short.
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1$ 3,000.00
$ 200.00
But the average of 10% and 50% is 30% and 70% of $3,200
is $2,240.
![Page 6: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/6.jpg)
So, a computing scientist entered an Store….
Ma’am you cannot take the
arithmetic average of
percentages!
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
$ 3,000.00
$ 200.00
But… I just came from at top CS
conference in San Jose where they do
it!
![Page 7: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/7.jpg)
The Problem with Averages
![Page 8: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/8.jpg)
A Hypothetical Experiment
benchA benchB benchC benchD Arith Avg0
5
10
15
20
25
Execution Time
BaselineTransformed
Benchmark
Tim
e (m
inut
es)
1
5
2
10 10
2
20
4
*With thanks to Iain Ireland
![Page 9: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/9.jpg)
Speedup
![Page 10: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/10.jpg)
Performance Comparison
benchA benchB benchC benchD Arith Avg Geo Mean0
1
2
3
4
5
6
Speedup
Benchmarks
Spee
dup
0.2 0.2
5 5
1
5
benchA
2
10
benchB
10
benchC
20
4
benchD
2.6
The transformed system is, on average, 2.6 times fasterthan the baseline!
![Page 11: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/11.jpg)
Normalized Time
![Page 12: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/12.jpg)
Normalized Time
benchA benchB benchC benchD Arith Avg Geo Mean0
1
2
3
4
5
6
Normalized Time
Benchmark
Nor
mal
ized
Tim
e
The transformed system is, on average, 2.6 times slowerthan the baseline!
5 5
0.2 0.2
2.6
![Page 13: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/13.jpg)
Latency × Throughput
• What matters is latency:
Start Time
• What matters is throughput:
Start Time
![Page 14: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/14.jpg)
Aggregation for Latency:Geometric Mean
![Page 15: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/15.jpg)
Speedup
benchA benchB benchC benchD Arith Avg Geo Mean0
1
2
3
4
5
6
Speedup
Benchmarks
Spee
dup
The performance of the transformed system is, on average, the same as the baseline!
1
5 5
2.6
0.2 0.2
![Page 16: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/16.jpg)
Normalized Time
benchA benchB benchC benchD Arith Avg Geo Mean0
1
2
3
4
5
6
Normalized Time
Benchmark
Nor
mal
ized
Tim
e
The performance of the transformed system is, on average, the same as the baseline!
5 5
0.2 0.2
2.6
1.0
![Page 17: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/17.jpg)
Aggregation for Throughput
benchA benchB benchC benchD Arith Avg0
5
10
15
20
25
Execution Time
BaselineTransformed
Benchmark
Tim
e (m
inut
es)
8.255.25
The throughput of the transformed system is, on average, 1.6 times faster than the baseline.
1
5
2
10 10
2
20
4
![Page 18: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/18.jpg)
The Evidence• A careful reader will find the use of arithmetic average to
aggregate normalized numbers in many top CS conferences.
• Papers that have done that have appeared in:– LCTES 2011– PLDI 2012 (at least two papers)– CGO 2012
• A paper where the use of the wrong average changed a negative conclusion into a positive one.
– 2007 SPEC Workshop• A methodology paper by myself and a student that won the best paper
award.
![Page 19: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/19.jpg)
This is not a new observation…
Communications of the ACM, March 1986, pp. 218-221.
![Page 20: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/20.jpg)
No need to dig dusty papers…
![Page 21: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/21.jpg)
So, the computing scientist returns to the Store….
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1
???
http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
$ 3,000.00
$ 200.00
Hello. I am just back from Beijing. Now I know that
we should take the geometric average of
percentages.
![Page 22: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/22.jpg)
So, a computing scientist entered a Store….
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
$ 3,000.00
$ 200.00
Thus I should get . = 22.36% discount and pay 0.7764×$3,200 =
$2,484.48
Sorry Ma’am, we don’t average
percentages…
![Page 23: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/23.jpg)
So, a computing scientist entered a Store….
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
$ 3,000.00
$ 200.00
The original price is $ 3,200. You pay $ 2,700 + $ 100 = $ 2,800.
If you want an aggregate summary,your discount is 400/3,200 = 12.5%
![Page 24: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/24.jpg)
Disregard to methodology when using automated learning
Thing #2
Learning
![Page 25: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/25.jpg)
Example:Evaluation of Feedback Directed
Optimization (FDO)
![Page 26: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/26.jpg)
http://www.orchardoo.com
Compiler
ApplicationCode
Application Inputs
We have:
We want to measure theeffectiveness of an FDO-basedcode transformation.
![Page 27: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/27.jpg)
ApplicationCode
Application Inputs
http://www.orchardoo.com
Compiler
-I
InstrumentedCode
Profile
Training Set
![Page 28: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/28.jpg)
Evaluation Set
ApplicationCode
http://www.orchardoo.com
Compiler
-FDO
Profile
OptimizedCode
Performance
The FDO transformation produces code thatis XX faster for this application.
Application InputsUn-evaluatedSet
![Page 29: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/29.jpg)
The Evidence
• Many papers that use a single input for training and a single input for testing appeared in conferences (notably CGO).
• For instance, a paper that uses a single input for training and a single input for testing appears in:– ASPLOS 2004
![Page 30: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/30.jpg)
Evaluation Set
ApplicationCode
http://www.orchardoo.com
Compiler
-FDO
Profile
OptimizedCode
PerformancePerformance
PerformancePerformance
Performance
![Page 31: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/31.jpg)
ApplicationCode
http://www.orchardoo.com
Compiler
-FDO
Profile
OptimizedCode
Training Set
Evaluation Set
Profile Profile ProfileProfile
Cross-Validated Evaluation (Berube, SPEC07)
Combined Profiling (Berube, ISPASS12)
![Page 32: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/32.jpg)
Evaluation Set
ApplicationCode
http://www.orchardoo.com
Compiler
-FDO
Profile
OptimizedCode
Performance
Wrong Evaluation!
![Page 33: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/33.jpg)
The Evidence
• For instance, a paper that incorrectly uses the same input for training and testing appeared in:– PLDI 2006
![Page 34: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/34.jpg)
Thing #3
Reproducibility
Expectation:When reproduced, an experimental evaluation should produce similar results.
![Page 35: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/35.jpg)
IssuesThing #3
Reproducibility
Availability of code, data, and precise descriptionof experimental setup.
Lack of incentives for reproducibility studies.
Have the measurements been repeated a sufficientnumber of time to capture measurement variations?
![Page 36: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/36.jpg)
ProgressThing #3
Reproducibility
Program committees/reviewers starting to askquestions about reproducibility.
Steps toward infrastructure to facilitate reproducibility.
![Page 37: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/37.jpg)
SPEC
Research14 industrial organizations
20 universities or research institutes
http://research.spec.org/
SPEC Research Group
![Page 38: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/38.jpg)
SPEC Research Group
SPEC
ResearchPerformance Evaluation
Benchmarks for New Areas
Performance Evaluation Tools
Evaluation Methodology
Repository for Reproducibility
http://research.spec.org/
http://icpe2013.ipd.kit.edu/
![Page 39: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/39.jpg)
Evaluate
Anti Patterns
Evaluation in CS education
Open Letter to PC Chairs
Evaluate Collaboratory:http://evaluate.inf.usi.ch/
![Page 40: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/40.jpg)
Parting Thoughts….
Creating a culture that enables full reproducibility seems daunting…
Initially we could aim for:
Reasonable expectation by a reasonable reader that, if reproduced, the experimentalevaluation would produce similar results.