LogTM-SE: Decoupling Hardware Transactional Memory from Caches

45
© 2007 Multifacet Project University of Wisconsin- Madison LogTM-SE: Decoupling Hardware Transactional Memory from Caches Luke Yen, Jayaram Bobba, Michael R. Marty, Kevin E. Moore, Haris Volos, Mark D. Hill, Michael M. Swift, and David A. Wood Multifacet Project ( www.cs.wisc.edu/multifacet ) Computer Sciences Dept., Univ. of Wisconsin-Madison

description

LogTM-SE: Decoupling Hardware Transactional Memory from Caches. Luke Yen , Jayaram Bobba, Michael R. Marty, Kevin E. Moore, Haris Volos, Mark D. Hill, Michael M. Swift, and David A. Wood. Multifacet Project ( www.cs.wisc.edu/multifacet ) Computer Sciences Dept., Univ. of Wisconsin-Madison. - PowerPoint PPT Presentation

Transcript of LogTM-SE: Decoupling Hardware Transactional Memory from Caches

Page 1: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

© 2007 Multifacet Project University of Wisconsin-Madison

LogTM-SE: Decoupling Hardware Transactional

Memory from Caches

Luke Yen, Jayaram Bobba,Michael R. Marty, Kevin E. Moore, Haris Volos,

Mark D. Hill, Michael M. Swift, and David A. Wood

Multifacet Project (www.cs.wisc.edu/multifacet)Computer Sciences Dept., Univ. of Wisconsin-Madison

Page 2: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project2

Executive Summary

• Hardware Transactional Memory (HTM) Fast– HW handles old/new versions (e.g., write buffer)– HW handles conflict detection (R/W bits & coherence)

• But Closely Coupled to L1 cache– On critical paths & hard for SW to save/restore

• Our Approach: Decoupled, Simple HW, SW control• LogTM Signature Edition (LogTM-SE)

– HW: LogTM’s Log + Signatures (from Illinois Bulk)– SW: Unbounded nesting, thread switching, & paging

Page 3: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project3

Outline

• Motivation – How HTMs accelerate TM– HTM Issues– Our Approach – Logs for version management– Signatures for conflict detection

• LogTM Signature Edition (LogTM-SE) Hardware

• LogTM-SE in Operation

• Conclusions and future work

Page 4: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project4

How HTMs Accelerate TM

• Version Management– Save old values (for abort) & new values (for commit)– Write buffer, cache incoherence, overflow structures

• Conflict Detection– Record read/write sets & detect overlaps– Read/Write (R/W) bits on cache blocks & coherence

Page 5: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project5

Some HTM Issues

• R/W bits in precious L1 cache design

• Replicate R/W bits for SMT?

• Replicate again for (bounded) nesting?

• Save/restore R/W for thread switch?

• Modify for paging (virtual page moves)?

Page 6: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project6

Our Approach

• Decoupled: Decouple HTM state from L1 caches

• Simple HW: Keep HW state simple & SW accessible

• SW Control: Have SW manage rare, complex events

• (Apply classic “systems” principles to HTMs)

Page 7: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project7

Decoupling R / W bits from Cache

BEFORE:

Data Caches

Registers

SMT Thread Context

WriteRead

Tag DataR W R W

R W

R W

AFTER:

R W

R W

Page 8: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project8

Version Management

• Save old values (for abort) & new values (for commit)

• Use LogTM’s Log– Before writing new values into memory, HW writes

old values (and their virtual addresses) into log– Allocated per-thread in virtual memory (like pthread’s stacks)– Log exposed to SW & not tied to a processor

• Why?– Decoupled, Simple HW, SW Control

Page 9: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project9

Version Management Processor Hardware

Data Caches

Registers Register Checkpoint

LogFrame

LogPtr

TMcount

SMT Thread Context

Tag Data

Page 10: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project10

Conflict Detection

• Record read/write sets & detect overlaps

• Adapt Signatures from Bulk [Ceze et al., ISCA 2006]– Over-approximate read/write sets in per-proc. Bloom Filters– Check signatures on coherence events (unlike Bulk)– Replicate for SMT– Exposed to SW for nesting, thread switching, etc.

• Why?– Decoupled, Simple HW, SW Control

Page 11: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project11

Signature Operation

Program:xbegin

LD AST BLD CLD DST C

0000000000000100

00000010

0010010000100100

00100010

Hash Function(s)

00000000

R

W

ABCDExternal ST E

00100100

00100010

ALIASFALSE POSITIVE:CONFLICT!

External ST F

00100100

00100010

NO CONFLICT

Page 12: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project12

Outline

• Motivation

• LogTM Signature Edition (LogTM-SE) Hardware– Single-CMP system– Processor hardware– Experimental Results

• LogTM-SE in Operation

• Conclusions and future work

Page 13: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project13

Single-CMP System

L1 $

Core1

L1$

Core2

L1$

Core14

L1$

Core15

L1$

Core16…

L2 $

DRAM

Interconnect

Page 14: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project14

LogTM-SE Processor Hardware

• Segmented log, like LogTM

• Track R / W sets withR / W signatures– Over-approximate R / W sets– Tracks physical addresses– Summary signature used for

virtualization

• Conflict detection by coherence protocol– Check signatures on every

memory access for SMT

Data Caches

Registers Register Checkpoint

LogFrame

LogPtr

TMcount

Write

SMT Thread Context

Tag Data

Read

SummaryReadSummaryWrite

NO TM STATE

Page 15: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project15

Experimental Methodology

• Infrastructure– Virtutech Simics full-system simulation– Wisconsin GEMS timing modules

• System– 32 transactional threads (16 cores x 2 SMT threads/core)– 32kB 4-way L1 I and D, 64-byte blocks, 1cycle latency– 8MB 8-way unified L2, 34 cycle latency– L2 directory for coherence, maintains full sharer bit vector

• Workloads– Radiosity, Raytrace, Mp3d, Cholesky Berkeley DB

Page 16: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project16

Lock Results

Page 17: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project17

Perfect Signature Results

Perfect signatures similar or better than Locks

Page 18: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project18

Realistic Signature Results

Realistic Signatures similar to Perfect Signatures and Locks

For our workloads, false positives are not a problem

Page 19: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project19

What about scalability?

• Bigger system• Bigger transactions

• False positives are a function of:– Transaction size– Transactional duty cycle– Number of concurrent transactional threads– Filtering due to on-chip directory protocol

• Signatures gracefully degrade to serialization

Page 20: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project20

HTM Issues (Solutions)

• R/W bits in precious L1 cache design– R/W signatures: out of L1 cache & software-accessible

• Replicate R/W bits for SMT?– Easier to replicate signatures for SMT, but false positives

• Replicate again for (bounded) nesting?

• Save/restore R/W for thread switch?

• Modify for paging (virtual page moves)?

Page 21: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project21

Outline

• Motivation

• LogTM Signature Edition (LogTM-SE) Hardware

• LogTM-SE in Operation– Unbounded Nested Transactions– Thread Switching– (Paging)

• Conclusions and future work

Page 22: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project22

Unbounded Nesting Support

• Why?– Composability: libraries– Software Constructs: Retry, OrElse [Harris, PPoPP ‘05]

• What?– Signatures for each nesting level

• How?– One R / W signature set per SMT thread– Save / Restore signatures using Transaction Log

Page 23: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project23

Nested Begin

01001000

01010010 Xact header

Transaction Log

RW

1TMCount

Log Frame

Log Ptr Xact header

Undo entry

Undo entry

Undo entry

xbegin

LD …

ST …

xbegin

0100100001010010

Program Processor State

00000000

00000000

Page 24: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project24

Nested Begin

0100100001010010 Xact header

Transaction Log

RW

2TMCount

Log Frame

Log PtrXact header

Undo entry

Undo entry

Undo entry

xbegin

LD …

ST …

xbegin

0100100001010010

Program Processor State

Page 25: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project25

Partial Abort

0100100001010010 Xact header

Transaction Log

RW

2TMCount

Log Frame

Log PtrXact header

Undo entry

Undo entry

Undo entry

xbegin

LD …

ST …

xbegin

LD …

ST …

ABORT!

0100100001010010

Program Processor State

Undo entry

Undo entry

01110110

01001001

1

Page 26: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project26

Nested Commit

Xact header

Transaction Log

RW

2TMCount

Log Frame

Log PtrXact header

Undo entry

Undo entry

Undo entry

xbegin

LD …

ST …

xbegin

LD …

ST …

xend

0100100001010010

Program Processor State

Undo entry

Undo entry

01110110

01001001

1

01001000

01010010

Page 27: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project27

Unbounded Nesting Support Summary

• Closed nesting:– Begin: save signatures– Abort: restore signatures– Commit: No signature action

• Open nesting:– Begin: save signatures– Abort: restore signatures– Commit: restore signatures

Page 28: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project28

Thread Switching Support

• Why?– Support long-running transactions

• What?– Conflict Detection for descheduled transactions

• How?– Summary Read / Write signatures:

If thread t of process P is scheduled to use an active signature,the corresponding summary signature holds the union of the saved signatures from all descheduled threads from process P.

Updated using TLB-shootdown-like mechanism

Page 29: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project29

Summary0000000000000000

WRSummary

0000000000000000

WR

Handling Thread Switching

01001000

01010010

W

R

0100000

01010010

W

R

0100000

01000010

W

R

00000000

00000000

W

R

Summary00000000

00000000

W

R

Summary0000000000000000

WR

P1 P2

Summary0000000000000000

WR

P3 P4

T1 T2 T3

OS

Page 30: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project30

Handling Thread Switching

01001000

01010010

W

R

0100000

01010010

W

R

0100000

01000010

W

R

00000000

00000000

W

R

Summary00000000

00000000

W

R

Summary0000000000000000

WRSummary

0000000000000000

WRSummary

0000000000000000

WR Summary

0000000000000000

WR

P1 P2 P3 P4

Deschedule

01001000

01010010

01001000

01010010

T1 T2 T3

OS

Page 31: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project31

Handling Thread Switching

01001000

01010010

W

R

0100000

01010010

W

R

0100000

01000010

W

R

00000000

00000000

W

R

Summary01001000

01010010

W

R

Summary0000000000000000

WRSummary

0000000000000000

WRSummary

0000000000000000

WR Summary

0000000000000000

WR

P1 P2 P3 P4

Summary0100100001010010

WR

Summary0100100001010010

WR

Deschedule

T1 T2 T3

OS

Page 32: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project32

Handling Thread Switching

00000000

00000000

W

R

0100000

01010010

W

R

0100000

01000010

W

R

00000000

00000000

W

R

Summary01001000

01010010

W

R

Summary0000000000000000

WR

P1 P2 P3 P4

Summary0100100001010010

WR Summary

0100100001010010

WRSummary

0000000000000000

WR

T2 T3

T1OS

Page 33: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project33

Thread Switching Support Summary

• Summary Read / Write signatures– Summarizes descheduled threads with active

transactions

• One OS structure per process

• Check summary signature on every memory access

• Updated on transaction deschedule– Similar to TLB shootdown

Coherence

Page 34: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project34

Paging Support Summary

Problem:– Changing page frames– Need to maintain isolation on transactional blocks

Solution: On Page-Out:– Save Virtual -> Physical mappingOn Page-In:– If different page frame, update signatures with physical

address of transactional blocks in new page frame.

Page 35: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project35

HTM Issues (Solutions)

• R/W signatures: out of L1 & software-accessible

• Replicate signatures for SMT, but false positives

• Replicate again for (bounded) nesting?– Signatures saved/restored on log

• Save/restore R/W for thread switch?– Signatures saved & distributed in summary signatures

• Modify for paging (virtual page moves)?– (Physical) signatures updated on page-in

Page 36: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project36

Future Work

• Explore scalability of signatures– Bigger system– Bigger transactions

• Modify OS for LogTM-SE

• Optimize OS-support for Reduced Overheads

Page 37: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project37

Executive Summary

• Hardware Transactional Memory (HTM) Fast– HW handles old/new versions (e.g., write buffer)– HW handles conflict detection (R/W bits & coherence)

• But Closely Coupled to L1 cache– On critical paths & hard for SW to save/restore

• LogTM Signature Edition (LogTM-SE)– HW: LogTM’s Log + Signatures (from Illinois Bulk)– SW: Unbounded nesting, thread switching, & paging

• Our Approach: Decoupled, Simple HW, SW control

Page 38: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project38

Google “Wisconsin GEMS”

Works w/ Simics (free to academics) commercial OS/appsSPARC out-of-order, x86 in-order, CMPs, SMPs, & LogTMGPL release of GEMS used in four HPCA 2007 papers

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

istic

Cont

ende

d lo

cks

Trac

e fli

e

Page 39: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project39

GEMS News

• Version 1.4 available Friday:– LogTM bugfixes– MESI Single-CMP protocol– Fixes for GCC 4.X Compilation

See www.cs.wisc.edu/gems

• GEMS 2.0: A future major LogTM release will include:– Single-CMP TM protocol– Signature support– Software abort handler

Page 40: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project40

Backup

Page 41: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project41

Before Virtualization After Virtualization

$Miss

Com

mit

Abort

$Eviction

$Miss

Com

mit

Abort

$Eviction

Paging

Thread S

witch

UTM - - - H H H HC H H H

VTM - - - S S SC S S S SWV

UnrestrictedTM - - - A B B B B AS AS

XTMXTM-g

--

--

--

ASCSC

--

SCVSCV

SS

SCSC

SCSC

ASAS

PTM-CopyPTM-Select

--

--

--

SCS

SH

SS

SCS

SCS

SS

SS

LogTM-SE - - SC - - S SC - S S

Shaded = virtualization event- = handled in simple HWH = complex hardware

S = handled in softwareA = abort transactionC = copy values

W = walk cacheV = validate read setB = block other transactions

HTM Virtualization Mechanisms

Page 42: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project42

LogTM Overview [HPCA ’06]

• Version Management– Keep old values for abort AND new values for commit– LogTM:

• Eager: record old values “elsewhere”; update “in place”– Other HTMs:

• Lazy: update “elsewhere”; keep old values “in place”

• Conflict Detection– Find read-write, write-read or write-write conflicts

among concurrent transactions– LogTM:

• Eager: detect conflict on every read/write– Other HTMs:

• Lazy: detect conflict at end (commit/abort)

FastCommit

Allows Stalling

Page 43: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project43

Benchmark Details

Benchmark Input Unit of Work Units Measured

BerkeleyDB 1000 words 1 database read 128

Cholesky Tk14.O Factorization 1

Radiosity batch 1 task 64

Raytrace teapot Whole parallel phase

1

Mp3d 128 molecules 1 step 1024

Page 44: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project44

Signature Implementations

Results assume 2048-bit signatures, and operates on physical addresses

Page 45: LogTM-SE:  Decoupling Hardware Transactional  Memory from Caches

04/22/23 LogTM-SE Wisconsin Multifacet Project45

Paging Support Summary

• At Page Out,– Remember VP->PP

• At Page In,if (VP->PP’) for each thread t in process if t is in transaction:

Update signatures of t