EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni,...

24
EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris, Mateo Valero Barcelona Supercomputing Center, UPC BITS Pilani Microsoft Research Cambridge

Transcript of EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni,...

EazyHTM: Eager-Lazy Hardware Transactional Memory

Saša Tomić, Cristian Perfumo, Chinmay Kulkarni,

Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris, Mateo Valero

Barcelona Supercomputing Center, UPC

BITS Pilani

Microsoft Research Cambridge

2

Why Transactional Memory?

• Lock-based parallel programming has problems– Deadlocks, races, complexity, performance, …

• Transactional Memory (TM) to the rescue– Optimistic concurrency control mechanism– Easy to use– Deadlock free– Supports composability– Protects data in critical sections

• Hardware-TM (HTM), Software-TM (STM) and hybrid

3

HTM terminology

• Atomic section/transaction: group of instructions that appear to take effect instantaneously

• Where are speculative values stored (version management):– in-place, and log the original value, or– buffered in private storage, publish on commit

• Conflict: TX writes where others TX reads– Detection: an action in which we check for

conflicts– Resolution: an action performed to resolve

the conflict• Can be abort, stalling the execution, …

4

• A.k.a. pessimistic• Writes in-place, detects&resolves conflicts on

every access• LogTM [Moore, HPCA06], LogTM-SE [Yen, HPCA07]

Eager HTM

Stall

W

RR

TX 1

TX 2

TX 3

fastcomm

it

Limitedconcurrency

Fast commit

Slow abort

5

• A.k.a. optimistic• Writes buffered, detect&resolve conflicts on

commit• TCC [Hammond, ISCA04], Scalable-TCC [Chafi,

HPCA07]

Lazy HTM

W

RR

TX 1

TX 2

TX 3

complexcommit: validate + write

Fast abort

Complex commit

Good concurrency

The MotivationSplitting conflict management

• Eager-Lazy hardware-software TM exists (FlexTM [Shriraman, ISCA08]):– Software begin, commit and abort– Probabilistic (signature based) conflict detection

• EazyHTM is the first pure-hardware TM6

Conflictdetection

Eager

Lazy

Conflict resolution

Eager Lazy

LogTM

TCC, S-TCCImpossible

EazyHTM Fast commit

Good concurrency

Outline

• Motivation• Contributions• Hardware changes• The Protocol• Evaluation• Conclusions

7

EazyHTM Contributions

• The best of two worlds– Eager conflict detection: simple

commit/exact list of conflicts in advance– Lazy conflict resolution: good concurrency

• Parallel commits of non-conflicting TXs• Designed for CMPs (Chip-Multiprocessors)

– Use cores proximity– MESI/MOESI protocol upgrade (easier

verification)

8

Hardware changes

9

Racers list – 1 bit per coreKillers list – 1 bit per core

SR – 1 bit per lineSM – 1 bit per line

TD – 1 bit per line

Register file checkpoint

Racers list

Killers listCPU

SR Existing cache logic

PrivateCache(s)

SM

TD Existing directory logicDirectory

• tracks conflicts• bit-vector• 32 bits for 32 cores

holds read/write set

read-only optimization bit(details in the paper)

core core core... ... ...

Racers and killers list

• If line is shared between two TXs:– Read-Read

• No conflict– Write-Read, Read-Write, Write-Write

• Writer adds reader TX into “racers” list– “TXs that I have to abort” list, if I commit first

• Reader adds writer TX into “killers” list– “TXs that can abort me” list, if they commit first

• We illustrate only the Write-after-Read (WAR) conflict

10

txMark @A

ACK @A, 0

... ...

no othersharers

EazyHTM Protocol

Conflict Detection (1/2)

11

racers

killers

TX 0

racers

killers

TX 2

sharers @A

Directory

1

2

TX 0 TX 2BTX

BTXRD A

WR ACTX

CTX

ReplacesGETS/GETX

TX 0 TX 2BTX

BTXRD A

WR ACTX

CTX

racers

killers

TX 2

sharers @A

Directory

racers

killers

TX 0

ACK @A, 1txAccessor #2, @A

txMark @A

Reader #0, @A

Potentialconflict

1 othersharer

Writer #2, @A

EazyHTM Protocol

Conflict Detection (2/2)

12

Remember: abort TX#0 on commit

Remember:TX#2 canabort me

1

23

4

5

racers

killers

TX 2

racers

killers

TX 0

sharers @A

Directory

Abort from TX#2

WR @A (commit)

Abort Ack from TX#0

EazyHTM Protocol

Conflict Resolution

13

TX#2 first came to the commit point, abort TX#0!1

12

3

TX 0 TX 2BTX

BTXRD A

WR ACTX

CTX

TX 0 TX 2BTX

BTXWR A

WR BCTX

CTX

TX 0 TX 2BTX

BTXWR A

WR BCTX

CTX

TX 0 TX 2BTX

BTXWR A

WR BCTX

CTX

0 othersharers

EazyHTM Protocol

Disjoint data => parallel commit

14

txMark @B

...

txMark @A

ACK @A, 0

WR @A(commit)

WR @B(commit)

TX#0 works with line @A TX#2 works with line @B

sharers @A

Directorysharers @B

1 1

ACK @B, 022

racers

killers

TX 0

3racers

killers

TX 2

3

...

NO SERIALIZAT

ION0 othersharers

Implementation

• Implemented in M5, full-system simulator (Alpha)

• Private L1 (32KB, 4-way, 64B CL, 2 cycles)• Private L2 (512KB, 8-way, 64B CL, 10

cycles)• Memory (with directory, 100 cycles)• ICN (2D Mesh, 10 cycles per hop)

15

Evaluation

• Evaluated STAMP benchmarks• Compared with Scalable-TCC-like HTM

– Same base simulator– Implemented specialized directory protocol

• Compared with ideal lazy HTM (MESI based)– magical conflict detection– instant conflict resolution– parallel write-back commit

16

17

Kmeans Low

• Small TXs (RS 15 CL; WS 5 CL)

• Low contention(10% aborts)

• Similar profile to “replacing locks with atomic”

• Near ideal performance

• K-means: groups N-dimensional space into K clusters

• Most of the SPLASH-2 suite has similar profile0 5 10 15 20 25 30 35

0

5

10

15

20

25

30

Kmeans-Low

IdealEazyHTMSTCC

processors

sp

ee

du

p

SSCA2

• Small TXs (RS 50 CL, WS 10 CL)

• Low contention(1.2% aborts)

• Near ideal performance

• Scalability affected by barriers, not by contention

• SSCA2: large directed graph operations

18

0 5 10 15 20 25 30 350

0.5

1

1.5

2

2.5

3

3.5

4

4.5

SSCA2

IdealEazyHTMSTCC

processors

sp

ee

du

p

Yada

• Large TXs (260 CL RS, 140 CL WS)

• Moderate contention (35% aborts)

• We can see good performance also for large TXs!

• Yada: delaunay mesh refinement

19

0 5 10 15 20 25 30 350

2

4

6

8

10

12

Yada

IdealEazyHTMSTCC

processors

sp

ee

du

p

Intruder

• Medium TXs (53 CL RS, 20 CL WS)

• High contention (85% aborts)

• Very bad scalability for all HTMs

• Every transaction detects conflicts over and over again – lot of conflict detection messages slow down the execution

• Intruder: signature based network intrusion detection system

20

0 5 10 15 20 25 30 35 400

2

4

6

8

10

12

Intruder

IdealEazyHTMSTCC

processors

sp

ee

du

p

Only high-conflict STAMP

• >50% abort rate only

• High contention high-core-count should be optimized

• Averages:• Labyrinth

• Intruder

• Kmeans-Hi

• Results highly affected by Intruder

21

0 5 10 15 20 25 30 350

2

4

6

8

10

12

High-conflict STAMP

Ideal

EazyHTM

STCC

processors

sp

ee

du

p

Only low-conflict STAMP

• <50% abort rate only

• Low abort rate necessary for scaling

• Excludes:• Labyrinth 8-32

• Intruder 16-32

• Kmeans-Hi 32

22

0 5 10 15 20 25 30 350

2

4

6

8

10

12

Scaling STAMP

IdealEazyHTMSTCC

processors

sp

ee

du

p

Conclusions

• Introduced EazyHTM, a new HTM implementation– Eager conflict detection, lazy conflict resolution– Fast: performs well for low conflict parallel

applications– Minimal changes to directory protocols (easier

verification)– As scalable as standard directory protocol

• EazyHTM mechanism could allow (future work):– Simpler transaction prioritization– Less wasted work– Better performance optimization– Power efficient TM mechanisms

23

Thank you!

Questions? [email protected]

24