Online Detection and Repair of Cache Contention for the JVMdevietti/slides/remix-pldi2016.pdf ·...

84
REMIX Online Detection and Repair of Cache Contention for the JVM Ariel Eizenberg, Joseph Devietti Shiliang Hu, Gilles Pokam

Transcript of Online Detection and Repair of Cache Contention for the JVMdevietti/slides/remix-pldi2016.pdf ·...

REMIXOnline Detection and Repair of Cache Contention for the JVM

Ariel Eizenberg,

Joseph Devietti

Shiliang Hu,

Gilles Pokam

What’s wrong with this code?

public class Test extends AtomicLongimplements Runnable {

static void main(final String[] args) {Test test1 = new Test(); Test test2 = new Test();Thread t1 = new Thread(test1);Thread t2 = new Thread(test2);t1.start();t2.start();t1.join();t2.join();

}public void run() {

for(int i = 0; i < 100000000; ++ i) set(i);}}

313ms

1 Summary Background REMIX Detection Repair Performance

What’s wrong with this code?

public class Test extends AtomicLongimplements Runnable {

static void main(final String[] args) {Test test1 = new Test(); Thread t1 = new Thread(test1);Test test2 = new Test();Thread t2 = new Thread(test2);t1.start();t2.start();t1.join();t2.join();

}public void run() {

for(int i = 0; i < 100000000; ++ i) set(i);}}

151ms

1 Summary Background REMIX Detection Repair Performance

What happened?

2 Summary Background REMIX Detection Repair Performance

What happened?

• Version A: 313ms

test2test1 t1 t2

2 Summary Background REMIX Detection Repair Performance

test1

What happened?

• Version A: 313ms

test2test1 t1 t2

• Version B: 151ms

test2test1 t1 t2

2 Summary Background REMIX Detection Repair Performance

test1

What happened?

• Version A: 313ms

test2test1 t1 t2

• Version B: 151ms

test2

test1

t1 t2

• More surprises!

test2test1 t1 t2

GC, Class Loader, …

2 Summary Background REMIX Detection Repair Performance

Managed Languages

Runtime has increased

control over execution

opportunities for

dynamic optimization

Less programmer

control over memory

more vulnerable to

performance bugs

3 Summary Background REMIX Detection Repair Performance

REMIX

• First system to automatically detect and repair cache contention in managed languages

• Uses hardware performance counters to detect cache contention

• Automatically repairs cache contention bugs, significantly simplifying programmers job

• Implemented as a GC-like pass within the HotSpot JVM

4 Summary Background REMIX Detection Repair Performance

Outline

Cache Coherency

The Remix system

Evaluation

5 Summary Background REMIX Detection Repair Performance

Cache Coherency

• Multicore processors implement a cache

coherence protocol to keep private caches in sync

• Operates on whole cache lines (usually 64 bytes)

• Cache lines have three key states:

• Read Shared

• Write Exclusive

• Invalid

6 Summary Background REMIX Detection Repair Performance

True vs False sharing

Same Bytes

True Sharing

Core

1

Core

0

X

Core

1

Core

0

X YDifferent Bytes

False Sharing

Cache Line

7 Summary Background REMIX Detection Repair Performance

Intel PEBS Events

• PEBS – Precise Event-Based Sampling

• Available in recent Intel multiprocessors

• Log detailed information about architectural events

8 Summary Background REMIX Detection Repair Performance

PEBS HitM Events

• “Hit-Modified” - A cache miss due to a cache line in Modified state on a different core

Core

0

L1$ X:M

Core

1

L1$X:I

Load X

MissHit

9 Summary Background REMIX Detection Repair Performance

Related Work

• Sheriff [Liu and Berger, OOPSLA 2011]

• Plastic [Nanavati et al., EuroSys 2013]

• Laser [Luo et al., HPCA 2016]

• Cheetah [Liu and Liu, CGO 2016]

• Predator [Liu et. al., PPoPP 2014]

• Intel vTune Amplifier XE

• Oracle Java8 @Contended

detect & repairunmanaged

languages

detection only

manual repair

10 Summary Background REMIX Detection Repair Performance

REMIX

• A modified version of the Oracle HotSpot JVM

• Detects & repairs cache contention bugs at runtime

• Works with all JVM languages, no source code modification required

• Provides performance matching hand-optimization

11 Summary Background REMIX Detection Repair Performance

REMIX System Overview

Java, Scala, etc’

JVM

Linux Kernel

CPU w/HITM PEBS HITM events

Application

Runtime system

OS

Hardware

perf API

REMIX

12 Summary Background REMIX Detection Repair Performance

Detection

13 Summary Background REMIX Detection Repair Performance

Detection

HITM

Events

13 Summary Background REMIX Detection Repair Performance

Detection

HITM

Eventsclassify track complete

native heap & stack

13 Summary Background REMIX Detection Repair Performance

Detection

managed heap

HITM

Eventsclassify track

>threshold

complete

native heap & stack

13 Summary Background REMIX Detection Repair Performance

Detection

managed heap

HITM

Eventsclassify track

>threshold

complete

native heap & stack

no

13 Summary Background REMIX Detection Repair Performance

Detection

managed heap

HITM

Eventsclassify track

>threshold

complete

model

$ lines

native heap & stack

no

yes

13 Summary Background REMIX Detection Repair Performance

Detection

managed heap

HITM

Eventsclassify track

>threshold

complete

model

$ linesclassify

native heap & stack

no

yes

map to

objects

13 Summary Background REMIX Detection Repair Performance

Detection

managed heap

HITM

Eventsclassify track

>threshold

complete

model

$ linesclassify

report

native heap & stack

no

yes

map to

objects

true sharing

13 Summary Background REMIX Detection Repair Performance

Detection

managed heap

HITM

Eventsclassify track

>threshold

complete

model

$ linesclassify

repair

report

native heap & stack

no

yes

map to

objects

false sharing

true sharing

13 Summary Background REMIX Detection Repair Performance

Cache Line Modelling

• Cache lines are modelled with 64-bit bitmaps

• HITM event set the address bit, count hit

• Multiple bits set potential false sharing

• Repair is cheaper than more complete modelling!

• Repair when counter exceeds threshold

14 Summary Background REMIX Detection Repair Performance

Top Level Flow

15 Summary Background REMIX Detection Repair Performance

Top Level Flow

......

......

......

Foo ......

......

......

OriginalHeap

OriginalClasses

NewHeap

ModifiedClasses

Foo

Bar Bar

15 Summary Background REMIX Detection Repair Performance

Top Level Flow

......

......

......

Foo ......

......

......

OriginalHeap

OriginalClasses

NewHeap

ModifiedClasses

Foo

Bar Bar

......

......

......

Foo ......

......

......

OriginalHeap

OriginalClasses

NewHeap

ModifiedClasses

Foo

Bar Bar

15 Summary Background REMIX Detection Repair Performance

Top Level Flow

......

......

......

Foo ......

......

......

OriginalHeap

OriginalClasses

NewHeap

ModifiedClasses

Foo

Bar Bar

......

......

......

Foo ......

......

......

OriginalHeap

OriginalClasses

NewHeap

ModifiedClasses

Foo

Bar Bar

......

......

......

Foo ......

......

......

OriginalHeap

OriginalClasses

NewHeap

ModifiedClasses

Foo

Bar Bar

15 Summary Background REMIX Detection Repair Performance

Top Level Flow

......

......

......

Foo ......

......

......

OriginalHeap

OriginalClasses

NewHeap

ModifiedClasses

Foo

Bar Bar

......

......

......

Foo ......

......

......

OriginalHeap

OriginalClasses

NewHeap

ModifiedClasses

Foo

Bar Bar

......

......

......

Foo ......

......

......

OriginalHeap

OriginalClasses

NewHeap

ModifiedClasses

Foo

Bar Bar

......

......

......

Foo ......

......

......

OriginalHeap

OriginalClasses

NewHeap

ModifiedClasses

Foo

Bar Bar

15 Summary Background REMIX Detection Repair Performance

Padding

• Instance data

• Instance size

• Field list

• OOPMap (reference block map)

• Constant-pool cache

16 Summary Background REMIX Detection Repair Performance

Padding - Inheritance

class A

header (16b)

byte a1

byte a2

class B

extends A

header (16b)

byte a1

byte a2

byte b

class C

extends A

header (16b)

byte a1

byte a2

byte c

17 Summary Background REMIX Detection Repair Performance

Padding - Inheritance

class A

header (16b)

byte a1

byte a2

class B

extends A

header (16b)

byte a1

byte a2

byte b

class C

extends A

header (16b)

byte a1

byte a2

byte c

17 Summary Background REMIX Detection Repair Performance

Padding - Inheritance

class A

header (16b)

byte a1

byte a2

class B

extends A

header (16b)

byte a1

byte a2

byte b

pad 48b

pad 62b

class C

extends A

header (16b)

byte a1

byte a2

byte c

17 Summary Background REMIX Detection Repair Performance

Padding - Inheritance

class A

header (16b)

byte a1

byte a2

class B

extends A

header (16b)

byte a1

byte a2

byte b

pad 48b

pad 62b

pad 48b

pad 62b

class C

extends A

header (16b)

byte a1

byte a2

byte c

17 Summary Background REMIX Detection Repair Performance

Padding - Inheritance

class A

header (16b)

byte a1

byte a2

class B

extends A

header (16b)

byte a1

byte a2

byte b

pad 48b

pad 62b

pad 48b

pad 62b

pad 48b

pad 62b

class C

extends A

header (16b)

byte a1

byte a2

byte c

17 Summary Background REMIX Detection Repair Performance

Repair

• Trace all strong+weak roots in the system

• Traverse heap and find targeted instances

• Live Relocate & pad, store forwarding pointer

• Dead Fix size mismatch

• Adjust all pointers to forwarded objects

• Deoptimize all relevant stack frames

18 Summary Background REMIX Detection Repair Performance

Disruptor & Spring Reactor

• Both are libraries for high-speed inter-thread message passing

19 Summary Background REMIX Detection Repair Performance

Disruptor & Spring Reactor

• Both are libraries for high-speed inter-thread message passing

class Sequence {protected volatile long value;

}

value

19 Summary Background REMIX Detection Repair Performance

56 bytes56 bytes

Disruptor & Spring Reactor

• Both are libraries for high-speed inter-thread message passing

class Sequence {protected long a,b,c,d,e,f,g; // pad beforeprotected volatile long value;protected long i,j,k,l,m,n,o; // pad after

}

• Is this enough? valuea-g i-o

19 Summary Background REMIX Detection Repair Performance

56 bytes56 bytes

Disruptor & Spring Reactor

• Both are libraries for high-speed inter-thread message passing

class Sequence {protected long a,b,c,d,e,f,g; // pad beforeprotected volatile long value;protected long i,j,k,l,m,n,o; // pad after

}

• Is this enough?

• No! – Class loader optimizes aggressively, lays out value to before a.

value a-g i-o

19 Summary Background REMIX Detection Repair Performance

Disruptor & Spring Reactor

• A complex hierarchy forces field order

class LhsPadding {protected long a,b,c,d,e,f,g;

}class Value extends LhsPadding {protected volatile long value;

}class RhsPadding extends Value {protected long i,j,k,l,m,n,o;

}class Sequence extends RhsPadding {

// actual work}

20 Summary Background REMIX Detection Repair Performance

Disruptor + Spring Reactor

0

1

2

3

4

SP

EED

UP

(HIG

HER

IS B

ETTER

)

REMIX Manual Pad

21 Summary Background REMIX Detection Repair Performance

Disruptor + Spring Reactor

0

1

2

3

4

SP

EED

UP

(HIG

HER

IS B

ETTER

)

REMIX Manual Pad

21 Summary Background REMIX Detection Repair Performance

Disruptor + Spring Reactor

0

1

2

3

4

SP

EED

UP

(HIG

HER

IS B

ETTER

)

REMIX Manual Pad

21 Summary Background REMIX Detection Repair Performance

Disruptor + Spring Reactor

0

1

2

3

4

SP

EED

UP

(HIG

HER

IS B

ETTER

)

REMIX Manual Pad

REMIX performs as well as manual

fixes in most cases

21 Summary Background REMIX Detection Repair Performance

Disruptor + Spring Reactor

0

1

2

3

4

SP

EED

UP

(HIG

HER

IS B

ETTER

)

REMIX Manual Pad

REMIX repair threshold is

adjustable

21 Summary Background REMIX Detection Repair Performance

Disruptor + Spring Reactor

0

1

2

3

4

SP

EED

UP

(HIG

HER

IS B

ETTER

)

REMIX Manual Pad

REMIX outperforms when manual

fix adds unnecessary padding

21 Summary Background REMIX Detection Repair Performance

NAS Parallel Benchmarks

• REMIX reveals true-sharing running NAS suite• HITM from true-sharing on JIT counters

• JIT-ing fixes contention

• NAS workloads exceed JIT size limit, not JIT-ed

• REMIX detects this and forces compilation

22 Summary Background REMIX Detection Repair Performance

NAS Parallel Benchmarks

• REMIX reveals true-sharing running NAS suite• HITM from true-sharing on JIT counters

• JIT-ing fixes contention

• NAS workloads exceed JIT size limit, not JIT-ed

• REMIX detects this and forces compilation

7.7 9.0

25.3

0

10

20

30

BT CG FT IS LU MG SP

speed

up

(hig

her

is b

ett

er)

22 Summary Background REMIX Detection Repair Performance

NAS Parallel Benchmarks

• REMIX reveals true-sharing running NAS suite• HITM from true-sharing on JIT counters

• JIT-ing fixes contention

• NAS workloads exceed JIT size limit, not JIT-ed

• REMIX detects this and forces compilation

7.7 9.0

25.3

0

10

20

30

BT CG FT IS LU MG SP

speed

up

(hig

her

is b

ett

er)

22 Summary Background REMIX Detection Repair Performance

No Contention == No Impact

23 Summary Background REMIX Detection Repair Performance

No Contention == No Impact Sp

eed

up

(h

igh

er

is b

ett

er)

0.9

1

1.1

ap

para

t

blo

at

chart

an

tlr

avr

ora

fact

ori

e

h2

hsq

ldb

jyth

on

luse

arc

h

pm

d

scala

axb

act

ors

scala

c

scala

do

c

scala

p

scala

rifo

rm

scala

test

specs

sun

flo

w

tmt

tom

cat

trad

eb

ean

s

trad

eso

ap

xala

n

0.9

1

1.1

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

derb

y

fft.la

rge

fft.sm

all

lu.la

rge

lu.s

mall

mo

nte

-carl

o

mp

eg

au

dio

seri

al

sor.

larg

e

sor.

small

spars

e.la

rge

spars

e.s

mall

sun

flo

w

xml.t

ran

sfo

xml.v

alid

ati…

(DaCapo & ScalaBench)

(SpecJVM)

23 Summary Background REMIX Detection Repair Performance

No Contention == No Impact Sp

eed

up

(h

igh

er

is b

ett

er)

0.9

1

1.1

ap

para

t

blo

at

chart

an

tlr

avr

ora

fact

ori

e

h2

hsq

ldb

jyth

on

luse

arc

h

pm

d

scala

axb

act

ors

scala

c

scala

do

c

scala

p

scala

rifo

rm

scala

test

specs

sun

flo

w

tmt

tom

cat

trad

eb

ean

s

trad

eso

ap

xala

n

0.9

1

1.1

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

derb

y

fft.la

rge

fft.sm

all

lu.la

rge

lu.s

mall

mo

nte

-carl

o

mp

eg

au

dio

seri

al

sor.

larg

e

sor.

small

spars

e.la

rge

spars

e.s

mall

sun

flo

w

xml.t

ran

sfo

xml.v

alid

ati…

(DaCapo & ScalaBench)

(SpecJVM)

23 Summary Background REMIX Detection Repair Performance

No Contention == No Impact Sp

eed

up

(h

igh

er

is b

ett

er)

0.9

1

1.1

ap

para

t

blo

at

chart

an

tlr

avr

ora

fact

ori

e

h2

hsq

ldb

jyth

on

luse

arc

h

pm

d

scala

axb

act

ors

scala

c

scala

do

c

scala

p

scala

rifo

rm

scala

test

specs

sun

flo

w

tmt

tom

cat

trad

eb

ean

s

trad

eso

ap

xala

n

0.9

1

1.1

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

derb

y

fft.la

rge

fft.sm

all

lu.la

rge

lu.s

mall

mo

nte

-carl

o

mp

eg

au

dio

seri

al

sor.

larg

e

sor.

small

spars

e.la

rge

spars

e.s

mall

sun

flo

w

xml.t

ran

sfo

xml.v

alid

ati…

(DaCapo & ScalaBench)

(SpecJVM)

23 Summary Background REMIX Detection Repair Performance

No Contention == No Impact Sp

eed

up

(h

igh

er

is b

ett

er)

0.9

1

1.1

ap

para

t

blo

at

chart

an

tlr

avr

ora

fact

ori

e

h2

hsq

ldb

jyth

on

luse

arc

h

pm

d

scala

axb

act

ors

scala

c

scala

do

c

scala

p

scala

rifo

rm

scala

test

specs

sun

flo

w

tmt

tom

cat

trad

eb

ean

s

trad

eso

ap

xala

n

0.9

1

1.1

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

derb

y

fft.la

rge

fft.sm

all

lu.la

rge

lu.s

mall

mo

nte

-carl

o

mp

eg

au

dio

seri

al

sor.

larg

e

sor.

small

spars

e.la

rge

spars

e.s

mall

sun

flo

w

xml.t

ran

sfo

xml.v

alid

ati…

(DaCapo & ScalaBench)

(SpecJVM)

23 Summary Background REMIX Detection Repair Performance

Padding

0

5

10

15

KB

Original Size Padding

24 Summary Background REMIX Detection Repair Performance

Padding

0

5

10

15

KB

Original Size Padding

24 Summary Background REMIX Detection Repair Performance

Padding

0

5

10

15

KB

Original Size Padding

24 Summary Background REMIX Detection Repair Performance

Performance

0

200

400

600

800

Tim

e (

ms)

Collector Detector Repair

25 Summary Background REMIX Detection Repair Performance

Performance

0

200

400

600

800

Tim

e (

ms)

Collector Detector Repair

25 Summary Background REMIX Detection Repair Performance

Performance

0

200

400

600

800

Tim

e (

ms)

Collector Detector Repair

25 Summary Background REMIX Detection Repair Performance

Performance

0

200

400

600

800

Tim

e (

ms)

Collector Detector Repair

25 Summary Background REMIX Detection Repair Performance

Conclusions

• Cache contention afflicts managed languages

• Dynamic information opens up new avenues for optimization

• Performance counters are a gold mine

• REMIX can simplify programmers’ lives by automatically fixing false sharing bugs

• https://github.com/upenn-acg/REMIX (GPLv2)

26 Summary Background REMIX Detection Repair Performance

Q&A

27 Summary Background REMIX Detection Repair Performance

PEBS HITM Event Accuracy

7.40%

98.60%

W-W Contention

R-W Contention

data address accuracy

28 Summary Background REMIX Detection Repair Performance

Objects Moved

0102030405060708090

33 Summary Background REMIX Detection Repair Performance

sun.misc.Unsafe

• Tracks unsafe access, prevents padding

• REMIX Extended Unsafe :

• Just as fast, enables padding

37 Summary Background REMIX Detection Repair Performance

Time-to-fix

0%

5%

10%

15%

20%

Late

ncy

(%

ru

ntim

e)

34 Summary Background REMIX Detection Repair Performance

Time-to-fix

0%

5%

10%

15%

20%

Late

ncy

(%

ru

ntim

e)

34 Summary Background REMIX Detection Repair Performance

Performance

Benchmark(s) Classes

Loaded

Processing

(ms)

Heap Size

(MB)

DaCapo

Sunflow

1879 1.47 854

Disruptor 700-900 0.5-1.3 40-85

SpringReactor

WorkQueue

1617 0.748 41

SpecJVM serial 1498 1.21 1603

32 Summary Background REMIX Detection Repair Performance

REMIX Speedup vs HITM rate

0

1000

2000

3000

75% 100% 125% 150% 175% 200%

Speedup

HIT

M r

eco

rds/s

ec

Dacapo 9.12Dacapo2006DisruptorScalaBenchSPECjvm2008

31 Summary Background REMIX Detection Repair Performance

Size-Preserving-Klass

• Instances are stored contiguously in memory

• Object sizes not stored directly in the heap• Calculated through the klass pointer in the object

header

• Padding breaks this assumption• Object left behind at old location has wrong size!

• Solution – create a ghost klass with the original size• Modify dead instances to point to this ghost

• Take care not to traverse padded objects in heap until klass size is updated

30 Summary Background REMIX Detection Repair Performance

Padding - Inheritance

class A

header (16b)

byte a

class B extends

A

header (16b)

byte a

byte b

class C extends

A

header (16b)

byte a

byte c

class E extends

C

header (16b)

byte a

byte c

byte d

class E extends

C

header (16b)

byte a

byte c

byte e

29 Summary Background REMIX Detection Repair Performance

Padding - Inheritance

class A

header (16b)

byte a

class B extends

A

header (16b)

byte a

byte b

class C extends

A

header (16b)

byte a

byte c

class E extends

C

header (16b)

byte a

byte c

byte d

class E extends

C

header (16b)

byte a

byte c

byte e

29 Summary Background REMIX Detection Repair Performance

Padding - Inheritance

class A

header (16b)

byte a

class B extends

A

header (16b)

byte a

byte b

class C extends

A

header (16b)

byte a

byte c

class E extends

C

header (16b)

byte a

byte c

byte d

class E extends

C

header (16b)

byte a

byte c

byte e

29 Summary Background REMIX Detection Repair Performance

Padding - Inheritance

class A

header (16b)

byte a

class B extends

A

header (16b)

byte a

byte b

class C extends

A

header (16b)

byte a

byte c

class E extends

C

header (16b)

byte a

byte c

byte d

class E extends

C

header (16b)

byte a

byte c

byte e

pad

pad

29 Summary Background REMIX Detection Repair Performance

Padding - Inheritance

class A

header (16b)

byte a

class B extends

A

header (16b)

byte a

byte b

class C extends

A

header (16b)

byte a

byte c

class E extends

C

header (16b)

byte a

byte c

byte d

class E extends

C

header (16b)

byte a

byte c

byte e

pad

pad

pad

pad

29 Summary Background REMIX Detection Repair Performance

Padding - Inheritance

class A

header (16b)

byte a

class B extends

A

header (16b)

byte a

byte b

class C extends

A

header (16b)

byte a

byte c

class E extends

C

header (16b)

byte a

byte c

byte d

class E extends

C

header (16b)

byte a

byte c

byte e

pad

pad

pad

pad

pad

pad

pad

29 Summary Background REMIX Detection Repair Performance

Padding - Inheritance

class A

header (16b)

byte a

class B extends

A

header (16b)

byte a

byte b

class C extends

A

header (16b)

byte a

byte c

class E extends

C

header (16b)

byte a

byte c

byte d

class E extends

C

header (16b)

byte a

byte c

byte e

pad

pad

pad

pad

pad

pad

pad

pad

pad

pad

pad

pad

pad

29 Summary Background REMIX Detection Repair Performance

Savina Actors Benchmark

• Evaluates actor libraries across 30 benchmarks

• REMIX detects false and true sharing bugs in libraries & benchmarks:

AkkaActor

FuncJava

GparsActor

HabaneroActor

HabaneroSelector

LiftActor

ScalaActor

Scalaz

apsp

asta

r

bank

ing

barb

er big

bito

nics

ort

bndb

uffe

r

cham

eneo

s

cigsm

ok

conc

dict

conc

sll

coun

t

facloc fib

filte

rban

k

fjcre

ate

fjthr

put

logm

ap

nque

enk

philo

sophe

r

ping

pong

pipr

ecision

quicks

ort

radixs

ort

recm

atm

ul

sieve so

r

thre

adrin

g

trapez

oid

uct

0 HITM/s

100 HITM/s

>10K HITM/s

36 Summary Background REMIX Detection Repair Performance

Virtualization

• Although Xen supports perf, it does not yet support virtualization of HW PEBS events

35 Summary Background REMIX Detection Repair Performance