Java Benchmarking - as easy as two timestamps€¦ · JavaBenchmarking as easy as two timestamps...

Java Benchmarkingas easy as two timestamps

Aleksey Shipilёvaleksey.shipilev@oracle.com, @shipilev

The following is intended to outline our general productdirection. It is intended for information purposes only, and maynot be incorporated into any contract. It is not a commitmentto deliver any material, code, or functionality, and should notbe relied upon in making purchasing decisions. Thedevelopment, release, and timing of any features orfunctionality described for Oracle’s products remains at thesole discretion of Oracle.

Intro: Warming up...

«How much for instantiating a String?»

long time1 = System.nanoTime ();for (int i = 0; i < 1000; i++) {

String s = new String("");}long time2 = System.nanoTime ();System.out.println("Time:" + (time2 - time1 ));

Theory

Theory: Why would people benchmark?

In the name of...1. Holywar: Node.js – But Java... – Node.js!2. Marketing: Check we are meeting the (release) criteria3. Engineering: Isolate a performance phenomena, make a

reference point for improvements4. Science: Understand the performance model, and predict

the future behavior

Theory: In the name of Holywar

My favorite example: Computer Language Benchmarks Game:1

� Most comparisons are hardly fair: e.g. AOT vs. JIT� Measures what exactly? E.g. pidigits measures the speed

of FFI to GNU GMP� Lots of disclaimers these results are misrepresentative of

the real world (alas, nobody reads them or cares enough)� People love it, since it gives you numbers, which you can

then take as your shield and sword in Internet debates

Theory: In the name of Marketing

My favorite example: SPEC benchmarks� Reference benchmark suites, agreed upon by the vendors� Provide the reference points, for which one can set the

success criteria, use in adverts, tweet obnoxiouscompetitive data, etc.

� It does not matter how representative they are – itmatters they are The Benchmarks Born at the FierySummit of Orodruin

Theory: In the name of Engineering

«If you can’t measure it, you can’t optimize it»� Need the conditions where the system is running in a

predictable state, so we are able to quantify improvements� These benchmarks usually focus on particular pieces of

system, and have more resolution than «marketing»benchmarks

Theory: In the name of Science

«Science Town PD: To Explain and Predict»� Derive the sound performance model from the results� Use the performance model to predict the future

behavior: keep calm and deploy to production� The most sweaty, and the most reliable target for

benchmarking

Theory: Why would people benchmark?

In the name of...1. Holywar: Node.js – But Java... – Node.js!2. Marketing: check we are meeting the (release) criteria3. Engineering: isolate a performance phenomena, make a

reference point for improvements4. Science: understand the performance model, and predict

the future behavior

Theory: «Scientific» approach

Ultimate Question

How does a benchmark react on changing the externalconditions?

Or, how far the actual performance model is from themental one?

1. Fool-proof: do these results even make any sense?2. Negative control: benchmark reacts on change, but

shouldn’t?3. Positive control: benchmark should not react on change,

but does?

Theory: «Engineering» approach

Ultimate Question

Why doesn’t my benchmark run faster?

Directly observe if our experimental setup is sane:1. Where are the bottlenecks?2. Do we expect those things to be bottlenecks?3. Are these benchmarks running in the same mode?

Theory: JMH

JMH is a Serious Business:http://openjdk.java.net/projects/code-tools/jmh/

� When used properly, helps to mitigate VM quirks� Aids running lots of benchmarks in different conditions� Internal profiling to quickly triage the issues� JVM languages support: Java, Scala, Groovy, Kotlin� ...or anything else callable from Java (e.g. Nashorn, etc.)

Scientific

Scientific: Story

In this section, we explore some of the methodologyimplications when doing the benchmarks. People tend to think

this story is a deal-breaker when trying to build their ownbenchmark harnesses.

Complete story and narrative is here:http://shipilev.net/blog/2014/nanotrusting-nanotime/

Models: Model Problem

«Jessie, it’s time to cook somebenchmarks...»

«What is the cost ofvolatile write?»

It seems like a very easy question...Let’s measure it! Shall we?

Models: Easy...

public class VolatileWrite {int v; volatile int vv;

@Benchmarkint baseline1 () { return 42; }

@Benchmarkint incrPlain () { return v++; }

@Benchmarkint incrVolatile () { return vv++; }

Models: ...does it!

public class VolatileWrite {int v; volatile int vv;

@Benchmarkint baseline1 () { return 42; } // 2.0 ns

@Benchmarkint incrPlain () { return v++; } // 3.5 ns

@Benchmarkint incrVolatile () { return vv++; } // 15.1 ns

Models: Fatal Flaw

volatile int vv;

@Benchmarkint incrVolatile () { return vv++; }

� Measuring in very unfavorable case, when benchmark ischoked by volatiles. We are pushing the system to its«edge» condition. This almost never happens inproduction.

� What do we really need to know is:«What is the volatile cost in realistic conditions?»

Models: Backoffs

@Param int tokens;

volatile int vv;

@Benchmarkint incrVolatile () {

Blackhole.consumeCPU(tokens ); // burn timereturn vv++;

� «Burn off» a few cycles before doing heavy-weight op� Juggle tokens ⇒ juggle operation mix

Models: Backoffs

� Take a few baselines while we are at it: which one iscorrect?

@Benchmarkvoid baseline_Plain ()

{ BH.consumeCPU(tokens ); }

@Benchmarkint baseline_Return42 ()

{ BH.consumeCPU(tokens ); return 42; }

@Benchmarkint baseline_ReturnPlain ()

{ BH.consumeCPU(tokens ); return v; }

Models: Measuring...

«Bender B.Rodriguez

regrets usingExcel to drawthe charts»

0 10 20 30backoff

baseline_Plainbaseline_Return42baseline_ReturnV

incrPlainincrVolatile

Models: Subtracting baselinePlain

� Absolute volatile cost gets compensated very well!� Can we really subtract the baselines?

0 10 20 30backoff

label baseline_Plain incrPlain incrVolatile

Models: Subtracting baseline_Return42

� We added some code in the baseline, and it runs faster?� Nothing surprising: performance is not usually composable

0 10 20 30backoff

label baseline_Return42 incrPlain incrVolatile

Models: WTF is different?

0 10 20 30backoff

label baseline_Plain incrPlain incrVolatile

@Benchmarkvoid base_Plain () {

BH.consumeCPU(tkns);}.

0 10 20 30backoff

@Benchmarkint base_Ret42 () {

BH.consumeCPU(tkns);return 42;

Models: WTF is different?

0 10 20 30backoff

label baseline_ReturnV incrPlain incrVolatile

@Benchmarkint base_RetV () {

BH.consumeCPU(tkns);return v;

0 10 20 30backoff

@Benchmarkint base_Ret42 () {

BH.consumeCPU(tkns);return 42;

Models: Bottom Line

� Different baselines act differently: they are teststhemselves!

� Therefore, we can just compare plain and volatile:

0 10 20 30backoff

label incrPlain incrVolatile

Models: Conclusion

This is what models are for!

� Explore the system behavior outside the (randomly)chosen configuration points

� Allow to predict the system behavior in future conditions� Catch the experimental setup problems (control!)� Combinatorial experiments help to create different

operation mixes, and derive the individual op costs fromtheir composite performance

Models: You Are Joking, Right?

«Combinatorial experiments help to create different operationmixes, and derive the individual op costs from their composite

performance»

System.nanoTime!Measure each part individually!

Timers: Verifying infrastructure

Why not?

// call continuouslypublic long measure () {

long startTime = System.nanoTime ();work ();return System.nanoTime () - startTime;

Timers: Measuring Latency

Latency = time to call System.nanoTime

@Benchmarkpublic long latency_nanotime () {

return System.nanoTime ();}

Timers: Measuring Granularity

Granularity = the minimum non-zero difference between twoconsecutive calls

private long lastValue;

@Benchmarkpublic long granularity_nanotime () {

long cur;do {

cur = System.nanoTime ();} while (cur == lastValue );lastValue = cur;return cur;

Timers: Typical Case [Linux]

Java(TM) SE Runtime Environment , 1.7.0_45 -b18Java HotSpot(TM) 64-Bit Server VM, 24.45-b08Linux , 3.13.8-1-ARCH , amd64

Running with 1 threads and [-client ]:granularity_nanotime: 26.300 +- 0.205 ns

latency_nanotime: 25.542 +- 0.024 ns

Running with 1 threads and [-server ]:granularity_nanotime: 26.432 +- 0.191 ns

Timers: Typical Case [Solaris]

Java(TM) SE Runtime Environment , 1.8.0- b132Java HotSpot(TM) 64-Bit Server VM, 25.0-b70SunOS , 5.11, amd64

Timers: Typical Case [Windows]

Java(TM) SE Runtime Environment , 1.7.0_51 -b13Java HotSpot(TM) 64-Bit Server VM, 24.51-b03Windows 7, 6.1, amd64

Running with 1 threads and [-client ]:granularity_nanotime: 371,419 +- 1,541 ns

latency_nanotime: 14,415 +- 0,389 ns

Running with 1 threads and [-server ]:granularity_nanotime: 371,237 +- 1,239 ns

latency_nanotime: 14,326 +- 0,308 ns

Timers: Epic Case [Windows]

Java(TM) SE Runtime Environment , 1.8.0- b132Java HotSpot(TM) 64-Bit Server VM, 25.0-b70Windows Server 2008, 6.0, amd64

Timers: Model Experiment

� But if System.nanoTime() is heavy and potentiallynon-scaling, then we run the system into oblivion?

� Let’s figure out when it starts to Detroit:

@Paramint backoff;

@Benchmarkpublic long nanotime () {

Blackhole.consumeCPU(backoff );return System.nanoTime ();

Timers: Seems OK [Linux]

0 4 8 12 16 20 24 28 32 36 40 44 48threads

backoff, nsec

0 1 2 3 4log10(backoff)

System.nanoTime() latency vs. backoff [Linux]

Timers: Double U. Tee. Eff. [Windows]

0 4 8 12 16 20 24 28 32threads

backoff, nsec

System.nanoTime() latency vs. backoff [Windows]

Timers: Paying for Monotonicity [Solaris]

0 4 8 12 16 20 24 28 32threads

backoff, nsec

System.nanoTime() latency vs. backoff [Solaris]

Timers: Typical Case [Mac OS X]

Java(TM) SE Runtime Environment , 1.8.0- b132Java HotSpot(TM) 64-Bit Server VM, 25.0-b70Mac OS X, 10.9.2 , x86_64

Timers: Summing Up

System.nanoTime – is a new String.intern!

� Giving users the nanoTime is handing over a loaded gun� nanoTime is may and should be used in selected cases,

when you can foresee all disadvantages� Most frequently, the direct measurement is not available,

and we have to derive the models from the collateralevidence

Timers: Stop Kidding Already?

Our code blocks are heavy enoughto keep nanoTime() granularity

and latency at bay!

Omission: Heavy Benchmark is Heavy

public long measure () {long ops = 0;long startTime = System.nanoTime ();while(! isDone) {

setup (); // want to skip thiswork ();ops++;

}return ops / (System.nanoTime () - startTime );

Omission: Measuring the Separate Block

public long measure () {long ops = 0;long realTime = 0;while(! isDone) {

setup (); // skip thislong time = System.nanoTime ();

work ();realTime += (System.nanoTime () - time);ops++;

}return ops / realTime;

Omission: Checking Empty setup()...

Measuring the throughput... it grows past the CPU count?!

0 4 8 12 16 20 24 28 32# Threads

ughput, o

External loop timestamps Sum over per−iteration timestamps

Omission: Hint

public long measure () {long ops = 0;long realTime = 0;while(! isDone) {

setup (); // skip thislong time = System.nanoTime ();

work ();realTime += (System.nanoTime () - time);ops++;...WHOOPS, WE DE-SCHEDULE HERE...

}return ops / realTime;

Omission: Basic Example

� Measuring the operation time, 10 ms/op on average ⇒each 𝑖-th thread thinks its individual throughput is 𝜆𝑖 =100 ops/sec

� We have two threads, and therefore𝑁∑︀𝑖=1

𝜆𝑖 = 200 ops/sec

Omission: A Fistful of Threads More

� Each thread still believes 𝜆𝑖 = 100 ops/sec!

� Now we have four threads ⇒𝑁∑︀𝑖=1

Omission: A Fistful of Threads More

� Each thread still believes 𝜆𝑖 = 100 ops/sec!

� Now we have four threads ⇒𝑁∑︀𝑖=1

Omission: Conclusion

"Phillip J. Fry is experiencingthe major safepoint event"

Timers skip the beats, andmay grossly

under/overestimate thedurations.

� Every performance metric thatincludes time is at fault

� Very easy to blow up onoverloaded systems

� Very easy to blow up whenmeasurers coordinate withworkload

S.S.: (TGIF) Thank God It’s Fibonacci

Is there a problem, officer?

public class FibonacciGen {BigInteger n1 = ONE; BigInteger n2 = ZERO;

@Benchmarkpublic BigInteger next() {

BigInteger cur = n1.add(n2);n2 = n1; n1 = cur;return cur;

S.S.: Timing Each Call...

Whoops, this benchmark has no steady state, indeed:

0 25000 50000 75000 100000call #

e to c

all, usec

S.S.: Pitfalls

No steady state – can not use the time-based benchmarks!The longer we measure, the «slower» the result appears:

duration, sec throughput, us/op1 5.013 ± 0.0062 7.087 ± 0.0094 10.021 ± 0.0178 14.159 ± 0.010

S.S.: Pick Your Poison

Time-based benchmarks:� Measuring in God knows what conditions� How should one compare two implementations?

(if you are lucky, and your performance model is linear...)

Work-based benchmarks:� Burning ourselves with timers latency/granularity� Burning ourselves with omission� Burning ourselves with transients

S.S.: Conclusion

«The only winning moveis not to play at all»

Non-steady state benchmarks forceyou to choose between all the bad

options.

Non-steady state benchmarks arethe large P.I.T.A!

S.S.: Palliative Relief

Measure in large batches!

@Setup(Level.Iteration)public void setup () {

n1 = BigInteger.ZERO; n2 = BigInteger.ONE;}

@Benchmark@Measurement(batchSize = 5000)public BigInteger next() {

BigInteger cur = n1.add(n2);n2 = n1; n1 = cur;return cur;

Engineering

Engineering: Comparisons

You want your results to be comparable.

� Every tiny little uncontrolled detail is a free variable� Libraries are the large complexes of tiny details� Language runtimes are galaxies of tiny details

Engineering: Story

This is a weird story of Java vs. Scala comparison coming fromStackOverflow, where people are bound to that believe

tail-recursion optimization is the best thing that happened incomputer science since the sliced bread.

Complete story and narrative is here:http://shipilev.net/blog/2014/java-scala-divided-we-fail/

Engineering: Scala’s @tailrec

@tailrec private defisDivisible(v: Int , d: Int , l: Int): Boolean = {

if (d > l) trueelse (v % d == 0) && isDivisible(v, d + 1, l)

@Benchmarkdef test (): Int = {

var v = 10while(! isDivisible(v, 2, l))

v += 2v

Engineering: Java’s absence-of-tailrec

private boolean isDivisible(int v, int d, int l) {if (d > l) return true;else

return (v % d == 0) && isDivisible(v, d+1, l);}

@Benchmarkpublic int test() {

int v = 10;while(! isDivisible(v, 2, l))

v += 2;return val;

Engineering: Measuring

Benchmark lim Score Score error Units-----------------------------------------------------ScalaBench 1 0.002 0.000 us/opScalaBench 5 0.494 0.005 us/opScalaBench 10 24.228 0.268 us/opScalaBench 15 3457.733 33.070 us/opScalaBench 20 2505634.259 15366.665 us/opJavaBench 1 0.002 0.000 us/opJavaBench 5 0.252 0.001 us/opJavaBench 10 12.782 0.325 us/opJavaBench 15 1615.890 7.647 us/opJavaBench 20 1053187.731 20502.217 us/op

Engineering: Profiling Java

Result: 12.719 +-(99.9%) 0.284 us/op [Average]

....[Thread state distributions].......................91.3% RUNNABLE8.7% WAITING

....[Thread state: RUNNABLE]...........................58.0% 63.5% n.s.JavaBench.isDivisible32.9% 36.1% n.s.JavaBench.test

....[Thread state: WAITING]............................8.7% 100.0% <irrelevant>

Engineering: Profiling Scala

Result: 24.076 +-(99.9%) 0.728 us/op [Average]

....[Thread state distributions].......................91.4% RUNNABLE8.6% WAITING

....[Thread state: RUNNABLE]...........................90.6% 99.1% n.s.ScalaBench.test0.9% 0.9% n.s.generated.ScalaBench_test.test_avgt_jmhLoop

....[Thread state: WAITING]............................8.6% 100.0% <irrelevant>

Engineering: Coarse-grained profilers

Coarse-grained (method-level) profilers are useless indiagnosing the problems in nano- and

micro-benchmarks.

Additional penalty points if they are sampling atsafepoints.

Engineering: JMH perfasm

java -jar benchmarks.jar ... -prof perfasm

Surprisingly easy to marry these three things:1. Linux perf provides light-weight PMU sampling2. JVM debug info maps events back to VM methods3. -XX:+PrintAssembly maps events back to Java code

Actually, there are lots of good profilers already, but most ofthe time you don’t need «big guns» to quickly analyze

benchmarks.

Engineering: Hottest thing in Scala

One true and solid x86 division:

clocks insns code-------------------------------------------------------; n.s.g.ScalaBench_test::test_avgt_jmhLoop...

0.27% 0.17% cltd2.24% 17.26% idiv %ecx

77.99% 66.44% test %edx,%edx...

How can you possibly be 2x faster than this?

Engineering: Hottest thing in Java

clocks insns code-------------------------------------------------------; n.s.JavaBench::isDivisible...

1.68% 2.76% cltd0.06% 0.16% idiv %ecx

27.59% 36.37% test %edx,%edx...0.04% cltd

idiv %r10d12.24% 1.54% test %edx,%edx...0.01% callq <recursive-call>

Engineering: Second hottest thing in Java

clocks insns code-------------------------------------------------------; n.s.g.JavaBench_test::test_avgt_jmhLoop...1.34% 0.21% imul $0x55555556,%rdx,%rdx1.25% 0.20% sar $0x20,%rdx1.15% 2.36% mov %edx,%esi0.95% 1.51% sub %r10d,%esi ; irem

Beautiful trick of substituting the remainder with constantmultiplication and binary shift! 2

Engineering: Second hottest thing in Java

clocks insns code-------------------------------------------------------; n.s.g.JavaBench_test::test_avgt_jmhLoop...1.34% 0.21% imul $0x55555556,%rdx,%rdx1.25% 0.20% sar $0x20,%rdx1.15% 2.36% mov %edx,%esi0.95% 1.51% sub %r10d,%esi ; irem

Beautiful trick of substituting the remainder with constantmultiplication and binary shift! 2

Engineering: Quick Explanation

// inlines twice , specializes for d={2,3}private boolean isDivisble(int v, int d, int l) {

...return (v % d == 0) && isDivisble(v, d+1, l);

@Benchmarkpublic int test() {

int v = 10;while(! isDivisble(v, 2, l))

v += 2;return val;

Engineering: Make «d» unpredictable

Benchmark lim Score Score error Units-----------------------------------------------------ScalaBench 1 0.002 0.000 us/opScalaBench 5 0.489 0.002 us/opScalaBench 10 23.777 0.116 us/opScalaBench 15 3379.870 5.737 us/opScalaBench 20 2468845.944 2413.573 us/opJavaBench 1 0.003 0.000 us/opJavaBench 5 0.465 0.001 us/opJavaBench 10 22.989 0.095 us/opJavaBench 15 3453.116 16.390 us/opJavaBench 20 2518726.451 4374.482 us/op

Engineering: Conclusion

«Days since the lastbenchmarking accident: 0»

(@gvsmirnov)

Benchmarks withoutanalysis make me a really

sad panda.

You show me nice charts:Language A vs. Language

B, Nashorn vs. Rhino, Graalvs. C2, etc, and all I see is

BAYESIAN NOISE

Fin: Conclusion

«If you don’t analyze thebenchmarks, you’ve gonna waste

a good time»

The superficial conclusionsalmost always feed onexisting biases, and arealmost always wrong.

Benchmarks are forunderstanding the Reality,not for reinforcing yourprejudices about the

Universe.

Java Benchmarking - as easy as two timestamps€¦ · JavaBenchmarking as easy as two timestamps...

Documents

Transcript of Java Benchmarking - as easy as two timestamps€¦ · JavaBenchmarking as easy as two timestamps...

Operating Systems, 2011, Danny Hendler & Amnon Meisels 1 Distributed Synchronization: outline Introduction Causality and time o Lamport timestamps.

Dave&Olsen&–Chair&IEEE&1722a& Harman&Internaonal& · 1722 Stream Timestamps Data AVBTP timestamps Generated media clock Outgoing Analog Data 802.1AS Wall Time Clock Generator Listener

Timestamps for Serializabilityibl/450/pdf/view-on-line... · T1(10) Read A Read B Write A T2(20) Read C Read B Write B Read D T3(30) Read C Write C Write E Timestamps in parentheses

Basic Router Configuration...Last configuration change at 17:33:19 GMT Tue Jun 25 2019! version 16.12 service timestamps debug datetime msec localtime show-timezone service timestamps

22-Sept-2005 Google Summer of Code Projects: Lightweight Precision Timestamps Jeff Boote.

Preserving Dates and Timestamps for Incident Handling in ... · Preserving Dates and Timestamps for Incident Handling in Android Smartphones. 10th IFIP International Conference on

Synchronizing the timestamps of concurrent events in ...

High Precision Timestamps Unveiling market microstructure · 2019. 10. 21. · Act as a measure to evaluate and optimise participants’ investment strategies Can be used in algorithm

java.lang.String Catechism - Stay Awhile And Listen · 2015. 2. 28. · Stay Awhile And Listen Aleksey Shipil¼v aleksey.shipilev@oracle.com, @shipilev. The following is intended

TCP Stretch Acknowledgements and Timestamps: …...TCP Stretch Acknowledgements and Timestamps: Findings and Implications for Passive RTT Measurement ∗ Hao Ding Michael Rabinovich

Bag of Timestamps: A Simple and Efficient Bayesian Chronological Mining

TZWorks Prefetch Parser (pf) Users Guide · prefetch file contains information such as: (a) filename, (b) file location, (c) timestamps related to the prefetch entry (created, modified

Ordering and Consistent Cuts - Cornell University · • Integer “timestamps” • Order “Events” • “Happens Before” relation: → Monday, October 15, 12 Lamport created

Necessar(il)y Evil - dealing with benchmarks, ugh · Necessar(il)yEvil dealing with benchmarks, ugh AlekseyShipilev aleksey.shipilev@oracle.com,@shipilev

Java Memory Model - and the pragmatics of it › talks › jvmls-July2014-jmm-pragmatics.pdfJavaMemoryModel...andthepragmaticsofit AlekseyShipilev aleksey.shipilev@oracle.com,@shipilev.

Cleaning Timestamps with Temporal ConstraintsCleaning Timestamps with Temporal Constraints Shaoxu Song Yue Cao Jianmin Wang Tsinghua National Laboratory for Information Science and

cDAQ-9189 Specifications - National Instruments · Timestamps Start Trigger, Reference Trigger, First Sample Digital output Time-based triggers Start Trigger Timestamps Start Trigger,

(The Art of) (Java) Benchmarking - Gentle …(TheArtof)(Java)Benchmarking Gentle Introduction in JMH АлексейШипилёв aleksey.shipilev@oracle.com,@shipilev

2 Computer-Assisted Verification of an Algorithm for Concurrent Timestamps · 2017-08-28 · 2 Computer-Assisted Verification of an Algorithm for Concurrent Timestamps Tsvetomir P.

PImMS a fast event-triggered pixel detector with storage of multiple timestamps