EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware...

Post on 27-Jun-2020

5 views 0 download

Transcript of EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware...

EazyHTM: Eager-Lazy Hardware

Transactional Memory

Saša Tomić, Cristian Perfumo, Chinmay Kulkarni,

Adrià Armejach, Adrián Cristal, Osman Unsal,

Tim Harris, Mateo Valero

Barcelona Supercomputing Center, UPC

BITS Pilani

Microsoft Research Cambridge

Why Transactional Memory?

• Lock-based parallel programming has problems

– Deadlocks, races, complexity, performance, …

• Transactional Memory (TM) to the rescue

– Optimistic concurrency control mechanism

– Easy to use

– Deadlock free

– Supports composability

– Protects data in critical sections

• Hardware-TM (HTM), Software-TM (STM) and hybrid

• Lock-based parallel programming has problems

– Deadlocks, races, complexity, performance, …

• Transactional Memory (TM) to the rescue

– Optimistic concurrency control mechanism

– Easy to use

– Deadlock free

– Supports composability

– Protects data in critical sections

• Hardware-TM (HTM), Software-TM (STM) and hybrid

2

HTM terminology

• Atomic section/transaction: group of instructions that

appear to take effect instantaneously

• Where are speculative values stored (version

management):

– in-place, and log the original value, or

– buffered in private storage, publish on commit

• Conflict: TX writes where others TX reads

– Detection: an action in which we check for conflicts

– Resolution: an action performed to resolve the conflict

• Can be abort, stalling the execution, …

3

• A.k.a. pessimistic

• Writes in-place, detects&resolves conflicts on every access

• LogTM [Moore, HPCA06], LogTM-SE [Yen, HPCA07]

Eager HTM

4

Stall

W

RR

TX 1

TX 2

TX 3

fast

commit

Limited

concurrency

Fast commit

Slow abort

• A.k.a. optimistic

• Writes buffered, detect&resolve conflicts on commit

• TCC [Hammond, ISCA04], Scalable-TCC [Chafi, HPCA07]

Lazy HTM

5

W

RR

TX 1

TX 2

TX 3

complex

commit:

validate +

write

Fast abort

Complex

commit

Good

concurrency

The Motivation

Splitting conflict management

• Eager-Lazy hardware-software TM exists (FlexTM [Shriraman, ISCA08]):

– Software begin, commit and abort

– Probabilistic (signature based) conflict detection

• EazyHTM is the first pure-hardware TM

6

Conflict

detection

Eager

Lazy

Conflict resolution

Eager Lazy

LogTM

TCC, S-TCCImpossible

EazyHTM Fast commit

Good

concurrency

Outline

• Motivation

• Contributions

• Hardware changes

• The Protocol

• Evaluation

• Conclusions

7

EazyHTM Contributions

• The best of two worlds

– Eager conflict detection: simple commit/exact list of

conflicts in advance

– Lazy conflict resolution: good concurrency

• Parallel commits of non-conflicting TXs

• Designed for CMPs (Chip-Multiprocessors)

– Use cores proximity

– MESI/MOESI protocol upgrade (easier verification)

8

Hardware changes

9

Racers list – 1 bit per core

Killers list – 1 bit per core

SR – 1 bit per line

SM – 1 bit per line

TD – 1 bit per line

Register file

checkpoint

Racers listRacers list

Killers listKillers listCPU

S

R

S

R Existing cache logicPrivate

Cache(s)S

M

S

M

T

D

T

D Existing directory logicDirectory

• tracks conflicts

• tracks conflicts

• bit-vector

• 32 bits for 32 cores

holds read/write set

read only optimization bit

(details in the paper)

read-only optimization bit

(details in the paper)

core core core... ... ...

Racers and killers list

• If line is shared between two TXs:

– Read-Read

• No conflict

– Write-Read, Read-Write, Write-Write

• Writer adds reader TX into “racers” list

– “TXs that I have to abort” list, if I commit first

• Reader adds writer TX into “killers” list

– “TXs that can abort me” list, if they commit first

• We illustrate only the Write-after-Read (WAR) conflict

10

txMark @A

ACK @A, 0

... ...

no other

sharers

EazyHTM Protocol

Conflict Detection (1/2)

11

racers

killers

TX 0

racers

killers

TX 2

sharers @A

Directory

1

2

TX 0 TX 2

BTX

RD A

CTX

TX 0 TX 2

BTX

BTX

RD A

WR A

CTX

CTX

Replaces

GETS/GETX

TX 0 TX 2

BTX

RD A

CTX

TX 0 TX 2

BTX

BTX

RD A

WR A

CTX

CTX

racers

killers

TX 2

sharers @A

Directory

racers

killers

TX 0

ACK @A, 1txAccessor #2, @A

txMark @A

Reader #0, @A

Potential

conflict

1 other

sharer

Writer #2, @A

EazyHTM Protocol

Conflict Detection (2/2)

12

Remember:

abort TX#0

on commitRemember:

TX#2 can

abort me

1

23

4

5

racers

killers

TX 2

racers

killers

TX 0

sharers @A

Directory

Abort from TX#2

WR @A (commit)

Abort Ack from TX#0

EazyHTM Protocol

Conflict Resolution

13

TX#2 first came to the commit point, abort TX#0!1

1

2

3

TX 0 TX 2

BTX

RD A

CTX

TX 0 TX 2

BTX

BTX

RD A

WR A

CTX

CTX

TX 0 TX 2

BTX

WR A

CTX

TX 0 TX 2

BTX

BTX

WR A

WR B

CTX

CTX

TX 0 TX 2

BTX

WR A

CTX

TX 0 TX 2

BTX

BTX

WR A

WR B

CTX

CTX

TX 0 TX 2

BTX

WR A

CTX

TX 0 TX 2

BTX

BTX

WR A

WR B

CTX

CTX

0 other

sharers

EazyHTM Protocol

Disjoint data => parallel commit

14

txMark @B

...

txMark @A

ACK @A, 0

WR @A

(commit)

WR @B

(commit)

TX#0 works with line @A TX#2 works with line @B

sharers @A

Directorysharers @B

1 1

ACK @B, 022

racers

killers

TX 0

3racers

killers

TX 2

3

...

NO

SERIALIZATION0 other

sharers

Implementation

• Implemented in M5, full-system simulator (Alpha)

• Private L1 (32KB, 4-way, 64B CL, 2 cycles)

• Private L2 (512KB, 8-way, 64B CL, 10 cycles)

• Memory (with directory, 100 cycles)

• ICN (2D Mesh, 10 cycles per hop)

15

Evaluation

• Evaluated STAMP benchmarks

• Compared with Scalable-TCC-like HTM

– Same base simulator

– Implemented specialized directory protocol

• Compared with ideal lazy HTM (MESI based)

– magical conflict detection

– instant conflict resolution

– parallel write-back commit

16

Kmeans Low

• Small TXs (RS 15 CL; WS 5 CL)

• Low contention

(10% aborts)

• Similar profile to

“replacing locks with atomic”

• Near ideal performance

• K-means: groups N-dimensional

space into K clusters

• Most of the SPLASH-2 suite has

similar profile

17

0

5

10

15

20

25

30

0 10 20 30 40

sp

ee

du

p

processors

Kmeans-Low

Ideal

EazyHTM

STCC

SSCA2

• Small TXs (RS 50 CL, WS 10 CL)

• Low contention

(1.2% aborts)

• Near ideal performance

• Scalability affected by barriers,

not by contention

• SSCA2: large directed graph

operations

18

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 10 20 30 40

sp

ee

du

p

processors

SSCA2

Ideal

EazyHTM

STCC

Yada

• Large TXs (260 CL RS, 140 CL

WS)

• Moderate contention

(35% aborts)

• We can see good performance

also for large TXs!

• Yada: delaunay mesh refinement

19

0

2

4

6

8

10

12

0 10 20 30 40

sp

ee

du

p

processors

Yada

Ideal

EazyHTM

STCC

Intruder

• Medium TXs (53 CL RS, 20 CL

WS)

• High contention (85%

aborts)

• Very bad scalability for all HTMs

• Every transaction detects conflicts

over and over again – lot of

conflict detection messages slow

down the execution

• Intruder: signature based network

intrusion detection system

20

0

2

4

6

8

10

12

0 10 20 30 40

sp

ee

du

p

processors

Intruder

Ideal

EazyHTM

STCC

Only high-conflict STAMP

• >50% abort rate only

• High contention high-core-count

should be optimized

• Averages:

• Labyrinth

• Intruder

• Kmeans-Hi

• Results highly affected by

Intruder

21

0

2

4

6

8

10

12

0 10 20 30 40

sp

ee

du

p

processors

High-conflict STAMP

Ideal

EazyHTM

STCC

Only low-conflict STAMP

• <50% abort rate only

• Low abort rate necessary for

scaling

• Excludes:

• Labyrinth 8-32

• Intruder 16-32

• Kmeans-Hi 32

22

0

2

4

6

8

10

12

0 10 20 30 40

sp

ee

du

p

processors

Scaling STAMP

Ideal

EazyHTM

STCC

Conclusions

• Introduced EazyHTM, a new HTM implementation

– Eager conflict detection, lazy conflict resolution

– Fast: performs well for low conflict parallel applications

– Minimal changes to directory protocols (easier verification)

– As scalable as standard directory protocol

• EazyHTM mechanism could allow (future work):

– Simpler transaction prioritization

– Less wasted work

– Better performance optimization

– Power efficient TM mechanisms

23

Thank you!

Questions?

sasa.tomic@bsc.es

24