Effective and Inexpensive (Memory) Race Recording

Effective and Inexpensive(Memory) Race Recording

Min Xu

Thesis Defense

05/04/2006

Electrical and Computer Engineering Department, UW-Madison

Advisors: Mark Hill, Rastislav Bodik

Committee: Remzi Arpaci-Dusseau, Mikko Lipasti, Barton Miller, David Wood

2OverviewIncreasingly useful to replay multithreaded code• Race recording: key to dealing with nondeterminism

A Case Study• Long recording: 1 byte/kilo-instr• Always-on recording: less than 2% overhead• Low cost: 24 KB RAM/core• Support both SC & TSO (x86-like)

Effective Inexpensive

Race Recorder

Order-ValueHybrid

RTRAlgorithm

Thesis Contributions

Set/LRUApproximation

CoherencePiggyback

Effective Inexpensive

Low CostHardware

SmallLog Size

Low RuntimeOverhead

SC & TSOApplicability

4Outline

Motivation & Problem

An Effective and Inexpensive Race Recorder

Evaluation Method & Results

RTRAlgorithm

CoherencePiggyback

Order-ValueHybrid

Conclusion & My Other Research

5slides

Motivation & Problem

6Multithreaded Debugging

% gcc hash.c% a.outSegmentation fault%

% gdb a.outgdb> runProgram received SIGSEGV.In get() at hash.c:4545 a = bucket->d;

% gdb a.outgdb> runProgram exited normally.gdb>

% gcc para-hash.c% a.outSegmentation fault%

% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;

% gcc para-hash.c% a.outSegmentation faultRace recorded in “log”%

7Race Recording

print(X)

-X = X*5

X = X*5-

Thread IThread J

Original Replay

Recording

-X = X*5

Thread IThread J

8Recording for Multithreaded Replay

Race Recording• Not-an-issue for a single thread• Create the same general & data races

Checkpointing• Provide a snapshot of the program state• Many proposals (e.g., SafetyNet), not focus

Input Recording• Provide repeatable inputs• Some proposals (e.g., part of FDR), not focus

9A Good Race Recorder

Long recording:small log

Low runtimeoverhead

Low cost

Applicability

10Desired & Existing Race Recorders

RecordingLength

Applicability

Overhead Cost

DesiredRecorder

Small Log Size

MPRacey

TSONegligible Slowdown

Little Hardware

InstRply ’87

R&C ’90

Bacon’91

Netzer’93

Déjà Vu ’98

RecPlay ’00JaRec ’04Our

Recorder

Order-ValueHybrid

RTRAlgorithm

CoherencePiggyback

SmallLog Size

Reproduce exact same conflicts: no more, no less

Problem Formulation

Thread I Thread J

Recording

Thread I Thread J

Replay

Conflicts(red)

Dependence(black)

Detect conflicts Write log

Log All Conflicts

Thread I Thread J

Replay

Log J: 23 14 35 46

Log I: 23

Log Size: 5*16=80 bytes(10 integers)

Dependence Log

16 bytes

Assign IC(logical Timestamps)But too many conflicts

14Netzer’s Transitive Reduction

Thread I Thread J

Replay

TR reduced Log J: 23

Log I: 23

Log Size: 64 bytes(8 integers)

TR Reduced Log

15The Intuition of the New RTR Algorithm

After Reduction

From I to J

From J to I

Vectors

VectorsRegulate Replay (RTR)

Stricter Dependences to Aid Vectorization

Thread I Thread J

Replay

Log J: 23 45

Log I: 23

New Reduced Log

stricter

Reduced

17Compress Vectorized Dependencies

Thread I Thread J

Replay

Log J: x=3,5, ∆=1

Log I: x=3, ∆=1

Vectorized Log

VectorDeps.

Reduce log size to KB/core/second

Order-ValueHybrid

RTRAlgorithm

CoherencePiggyback

Low RuntimeOverhead

19Detect Conflicts

Thread I Thread J

Recording

A.readers.add(I, 1)

if (C.writer != I) log(WAW)foreach C.readers if (reader != I) log(WAR)C.readers.clear( )C.writer = (I, 3)

B.writer = (I, 2) C.writer =(J, 2)

if (B.writer != J) log(RAW)B.readers.add(J,3)

Expensive in software

A.readers

A.writer

20Use Cache and Cache Coherence

Tag State Data TimestampA S … 1B M … 4

Tag State Data TimestampA S … 3B I … 2

A.readersA.writer

B.readersB.writer

Get/S Request

Data Response

Timestamp

Detect conflict in hardware with little runtime cost

RAWDetected& Logged

21Cache Evictions and Writebacks

OK with nonsilent eviction & directory eviction

C M … 3

Directory of A: Shared(I,J) Owner()

Get/SInv

AckTimestamp? WAR

Detected& Logged

M … 4

22Implement TR and RTR in Hardware

Ideal TR requires vector timestamps• Too expensive• New idea: Pairwise-TR (use scalar timestamp)• Enable pairwise transitive reduction

Optimal RTR algorithm is likely expensive• Implement a greedy RTR algorithm• One-pass, online algorithm• Keep a sliding window of vectorizable

dependencies

23Hardware Implementation

CacheEviction/writeback Solved, more details

Directory protocols Solved

Snooping protocols Partly solved

Two-level coherence Not yet solved

ProcessorOut-of-order/Prefetching Solved

Unordered message Solved

Counter overflow Solved

Thread Migration Not yet solved

Order-ValueHybrid

RTRAlgorithm

CoherencePiggyback

Low CostHardware

25Timestamp Approximation

One Set of I’s $

Correct, but more evictions more logged conflicts

Thread I Thread J

Recording

st AI ld D

Use current IC of thread

C M … 3

Directory of A: Shared(I)

HardwareCost

Log Size

One Set of I’s $ 1

Thread I Thread J

st AI ld D

C M … 3

Recording

Set/LRU Approximation

Use current IC of thread

LRU guarantee B’s TS > A’s TS

Set/LRU better preserve reducibilitySmall $ more misses but still small log

28Hardware Cost of Timestamps

Coupled timestamp memory: overhead cache size• Not flexible• 64B line + 64b (24b) timestamp 12.5% (4.7%)

overhead• 192 KB for a 4MB L2

Need to modify cache

Coupled Timestamp Memory

29Decoupled Timestamp Memory

Decoupling Small timestamp memory (Set/LRU)• e.g., 32-set, 64-way 99% transitive reduction• Timestamps Memory 24 KB

No need to modify cache

Tag State DataA S …B M …

Tag TimestampA 1B 2

Timestamp Memory

Coupled Timestamp Memory

From 192 KB to 24 KB: 8x reduction

Order-ValueHybrid

RTRAlgorithm

CoherencePiggyback

SC & TSOApplicability

st A,1

st B,1

st A,1

st B,1

st A,1

st B,1

A=1B=0

A=0B=1

A=1B=1

Recording with Total Store Order (TSO)

Majority of existing MP are non-SC

TSO is well defined, x86-like

st A,1

Thread I Thread J

st B,1

st A,1

st B,1

A=0B=0

32TSO Execution

st A,1

Thread I Thread J

st B,1

A=B=0 ld A

st A,1

st B,1

A=0B=0

st A,1

st B,1

Memory System

A=0 B=0A=0 B=0

A=1 B=1

33Order-Value-Hybrid Recording

st A,1

Thread I Thread J

st B,1

Recording

st A,1

Thread I Thread J

st B,1

Replay Value UsedA=0

st A,1

st B,1

A=0B=0

st A,1

st B,1I

Memory System

A=0 B=0

WAROmitted Value

Logged

A=0 B=0

A=1 B=1

StartMonitor A

StartMonitor B

A Changed!

StopMonitor B

34Hybrid Recording with TR and RTR

Hybrid recording• All loads get correct values• Hardware similar to OoO SC [Gharachorloo et al.

’91]

Hybrid + TR & RTR• TR will not use the omitted WAR in reduction• RTR vectorize dependencies more conservatively

Evaluation Method & Results

36Put-it-together: Determinizer/CMP

Shared L2 Cache(L1 Dir)

TSM TSM

L1_I$ L1_D$

L1CoherenceController

Log TRReg

RTRReg

37Simulation Method

Commercial server hardware• GEMS: http://www.cs.wisc.edu/gems• Full-system (OS + application) executions• 4-core CMP (Sequential Consistent)

• 1-way in-order issue, 2 GHz, • 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory

Commercial server software• Apache – static web serving• SpecJBB – middleware• OLTP – TPC-C like• Zeus – static web serving

38Log Size: 1 byte/kilo-instr

Well within in the capability of current machines• Long recording (days – months) need improvement

2.0byte/core/kilo-instr

ApacheJBB OLTP Zeus AVG0

200KB/core/s

ApacheJBB OLTP Zeus AVG

39Runtime Overhead

Baseline With race recorder

Execution Time

Apache JBB OLTP Zeus

Interconnection Msg. B/W

Our recorder can be “always-on”

Apache JBB OLTP Zeus

40Benefits of RTR and Set/LRU (Log Size)

Pairwise-TR

Our RTR

Improvement by RTR

ApacheJBB OLTP ZeusAVG

Perfect TSM

24KB Set/LRU TSM

Effectiveness of Set/LRU

Apache JBB OLTP Zeus AVGL

41Why RTR and Set/LRU Work Well?

RTR• Processors execute instructions at similar speed• Therefore, we can find “vectorizable”

dependencies

Set/LRU• Temporal locality makes the LRU timestamps old• We only need to know if a timestamp is “old-

enough”

42Sensitivity and Scalability

A design space of the timestamp memory (TSM)• Size: smaller TSM -> larger log• Read/write timestamp: should be used when TSM is

large• Partial timestamp: 24-bit enough• Associativity: higher better for RTR

Scalability of the recorder• Studied with modest processors (2p – 16p)• Commercial workloads, not scientific workloads• Log size increase slowly with number of cores

Conclusion & My Other Research

44Race Recording

Race recording Key to combat nondeterminism

My thesis An effective & inexpensive Recorder• RTR algorithm small log size• Coherence piggyback Negligible slowdown• Timestamp approximation Low hardware cost• Order-value hybrid support SC & TSO

Future work• Improve race recording algorithm • Improve race recorder implementation• Study race replay

Serializability Violation Detector [PLDI’05]Like a race detectorNo a priori annotation requirement

• “critical sections” are inferredIntend to detect bugs “actually” happen

• Check for a 2-Phase-Locking condition

Read in1

Read in2Write out1

Write out2

Write local

Read local

SharedVariables

A “Critical Section”

46Publications

FDR (ISCA’03)• Adopted by UCSD BugNet (ISCA’05)

SVD (PLDI’05)• Cited by Vaziri et al. (POPL’06)• Influenced new data race definition

RTR, Set/LRU & Hybrid• Submitted for publication

Thank you!

48Acknowledgements

Joint work with my advisors• Mark Hill, Ras Bodik

Ph.D. Committee• David Wood, Mikko Lipasti, Remzi Arpaci-Dusseau,

Barton Miller

Multifacet Group• Milo Martin, Dan Sorin, Carl Mauer, Brad Beckmann,

Kevin Moore, Alaa Alameldeen, Mike Marty, Luke Yen

Affiliates & Companies• Joe Emer, CJ Newburn, Peter Hsu, Bob Zak, Eric Bach,

Gang Luo, Alex Chow, IBM, Intel, Microsoft, Sun

49Deterministic Replay is Useful

Deterministic Replay is logically recreating a program execution

Present applications• Cyclic Debugging ([Pancake & Netzer ‘93])• Fault Tolerance (ExtraVirt [Lucchetti et al. ’05])• Intrusion Analysis (ReVirt [Dunlap et al. ’02])

Future applications• Data Recovery • Replay-based Synchronization

50Multicore and Multithreading

Multicore is common• AMD X2• IBM Power 5/6, Cell• Intel Pentium D, Core Duo• Sun SPARC T1

Multithreading is common• Server: high throughput• Scientific: high performance• Desktop/embedded: low response time

51Race Recording: Key to Determinism

Races: general race & data race [Netzer & Miller]• Both cause nondeterminism• Race recording can help, but

Existing race recorders are inadequate• Some generate large logs• Some have high runtime overhead• Some have high hardware cost (space overhead)• Support only sequential consistency

Need a better race recorder

52Recording/Replay & Debugging

Online Recorder

Dump “Core”

Checkpoint B Checkpoint C

Store log A Store log B Store log C

Checkpoint A

Read Checkpoint B

Replaying fromlog B, C

Deterministic Replayer

53Deterministic Replay & Fault Tolerance

Fault Recovery• Replay after a failure

Fault Detection• Replay then compare

(Courtesy of VMware)

54Future: Record/Replay & Undo/Redo

VM as a software platform• Ease software development• Fine granularity in Undo and Redo

Windows XP

55Future: Replay-based Synchronization

Three steps• Coarse-grain sync. fine-grain sync. hardware sync.

Results: higher performance

Works only if static control flow & fixed data addr• DSP kernels

ld Ast B

Unlock()

lock()st Ald B

Recording

ld Ast B st A

ld BReplay

56Race Recording Related Work

Total-order recorders Partial-order recordersBacon ’91(Hardwar

RecPlay ’00

JaRec ’04

R&C’90

Déjà Vu ’98

Bacon ’91(Hardware

Instant Replay ’87

Netzer ’93

Bus transactio

Lamport Clocks

SchedulingBus

transaction groups

Variable versionVector clocks

Large log Small log Small log Large log Large log Small log

Low overhead

(sync only)

Low overhead(non-MP)

Low overhead

High overheadHigh

overhead

Low replay parallelism High replay parallelism

57Correctness of Order-Value-Hybrid

Removing WAR dependencies• Say thread I read, thread J write• Removing the WAR affects I’s read, not J’s write• But, for every dependence removed, thread I

reads correct value from the value log• Therefore, all reads get the correct value

58TR and TSO

TR affects dependencies reduced by a WAR• The WAR itself may later be removed during replay• Solution: Not use WAR in TR if the WAR can be

removed• Respond with a special flag when a loaded cache line

is stolen

Thread I Thread J

Recording

3 3ld B ld A

Must notbe reduced

59RTR and TSO

The sliding window may expose the ordered loads• Shrink the sliding window to avoid it

Thread I Thread J

Recording

3 3st B ld A

4 4ld C ld Bordered

in write bufffer

orderednew winfor j:3old win

for j:3

Not allowedby new window

60Deadlock Avoidance of RTR

Thread I Thread J

Recording

Avoid deadlock by adhere to a SC total order

i:4j:1 j:2 i:3 i:4

Replay Cycle

61Recording Race-free Executions

No data races

Only need to record synchronization race

Deterministic replay up until the first data race

62Replay Parallelism

Replay performance depends on

(1)Number of synchronizations(2)Extra wait incurred by the

synchronizations

63Directory Protocols

Add sticky states in the directory• Retain states after writebacks• Need extra acknowledgements

Or, add extra timestamp memory in the directory• Helps to avoid extra acknowledgements

A tradeoff• Sticky states can be cheaper• But extra timestamp memory can be faster

64Snooping Protocols

Key problem is combined/implicit response• Not a problem for AMD Hammer

Get/XPull Shared

WARDetected& Logged

+ Current IC

65Nonsilent Evictions

Directory eviction: more false conflict, like snooping

C M … 3

Directory of A: Shared(J) Owner() StickyS(I,J)

M … 4

AckTimestamp

TimestampMemory

Eviction

66Out-of-Order & Hardware Prefetching

Speculative execution• No IC assigned yet

Hardware prefetching• No IC assigned

Key idea: receive observation• Can associate a ld/st with current commit

instruction

67Unordered Messages in Interconnect

Message arrive out-of-order

Can affect reduction

But better add a sequence number• Reconstruct the message order• Enable IC compression by sending deltas

68Integer Overflow

IC and timestamps may overflow

IC: make it 64bit, will not overflow for a long time

Timestamps: use approximation techniques• MSB of IC + LSB of Timestamps

69Varying TSM Size

2 4 8 16 32 64 128 256 512 1024 2048

Size of the Timestamp Memory (KB)

/secon

Apache-1TS-RTRApache-1TS-TRApache-2TS-RTRApache-2TS-TR

(64 ways, Full Timestamps, Set/LRU)

2 4 8 16 32 64 128 256 512 1024 2048

/secon

OLTP-1TS-RTROLTP-1TS-TROLTP-2TS-RTROLTP-2TS-TR

2 4 8 16 32 64 128 256 512 1024 2048

/secon

SPECjbb-1TS-RTRSPECjbb-1TS-TRSPECjbb-2TS-RTRSPECjbb-2TS-TR

2 4 8 16 32 64 128 256 512 1024 2048

/secon

Zeus-1TS-RTRZeus-1TS-TRZeus-2TS-RTRZeus-2TS-TR

70Varying Associativity

2 4 8 16 32 64 128 256 512 1024

Associativity of the Timestamp Memory

0.01Log

/secon

Zeus-CurrentIC-RTRZeus-CurrentIC-TRZeus-SetLRU-TRZeus-SetLRU-RTR

(64KB, Full R/W Timestamps)

2 4 8 16 32 64 128 256 512 1024

0.01Log

/secon

SPECjbb-CurrentIC-RTRSPECjbb-CurrentIC-TRSPECjbb-SetLRU-TRSPECjbb-SetLRU-RTR

2 4 8 16 32 64 128 256 512 1024

0.01Log

/secon

OLTP-CurrentIC-RTROLTP-CurrentIC-TROLTP-SetLRU-TROLTP-SetLRU-RTR

2 4 8 16 32 64 128 256 512 1024

0.01Log

/secon

Apache-CurrentIC-RTRApache-CurrentIC-TRApache-SetLRU-TRApache-SetLRU-RTR

71Varying Partial Timestamp Width

10 15 20 25 30

Partial Timestamp Width

0.01Log

/secon

Zeus-TRZeus-RTR

(64sets, 64ways, Set/LRU)

10 15 20 25 30

0.01Log

/secon

SPECjbb-TRSPECjbb-RTR

10 15 20 25 30

0.01Log

/secon

OLTP-TROLTP-RTR

10 15 20 25 30

0.01Log

/secon

Apache-TRApache-RTR

72Log Size Scaling

2 4 8 16

Number of Cores

ApacheSPECjbbOLTPZeus

73In Retrospect …

What are you most proud of?• RTR improves TR after 13 years

What would you do differently if doing it again?• “replaying me is deterministic” (just kidding)• I wish I focused on race recording earlier

What the industry should do?• Implement the recorder as a VMM extension

Effective and Inexpensive (Memory) Race Recording

Documents

Transcript of Effective and Inexpensive (Memory) Race Recording

Inclusion in the Recording Studio - 50/50 by 2030 Foundation · Inclusion in the Recording Studio? Gender and Race/Ethnicity of Artists, Songwriters & Producers Across 600 Popular

5 Inexpensive Home Improvement Projects

Inexpensive Skin Care For Men

Inexpensive bathroom remodeling ideas

Inexpensive Wedding Reception Venues

PBworks · or drum) and related to exercise time. This recording, together with the exercise tape, forms a complete and inexpensive record of the exercise as it actually develops.

Inexpensive Marketing2009

Building Inexpensive CNC Machines

Inexpensive Wedding Venue Massachusetts

RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill.

Bestnewluxury Inexpensive Designer Watches

Inexpensive usability testing

Inexpensive Multi-Megabaud Microwave Data Linkn5dux.com/ham/files/pdf/Inexpensive Multi-Megabaud...INEXPENSIVE MULTI-MEGABAUD MICROWAVE DATA LINK By Glenn Elmore, NGGN, 550 Willowside

Debugging with gdb · Debugging Data Race Conditions: Section 12.2 [Data Race Detection], page 171. Debugging OpenMP*: Section 12.4 [OpenMP* Debugging], page 177. Extended recording

Free or Inexpensive Marketing Ideas

Bestnewluxury Inexpensive Watches

BACKGROUND: - Colorado Department of Education · Web Recording Race & Ethnicity Information What Parents and Students Should Know 2008 INTRODUCTION The race and ethnicity of students

Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

A Small, Inexpensive Moonbounce Antenna System for … Inexpensive... · A SMALL INEXPENSIVE MOONBOUNCE ANTENNA SYSTEM FOR 144 MHz Effective performance with a small, lightwei ght

Race Management Flags & Signals - Middle Harbour · Race Signals (RRS 2021-2024) 9/09/2020 8 Signals Ashore ... Race Officer Timekeeper Signals Recorder Recording Written – Wind,

Debugging with gdb · Debugging Data Race Conditions: Section 12.2 [Data Race Detection], page 171. Debugging OpenMP: Section 12.4 [OpenMP Debugging], page 177. Extended recording