Instruction and Data Address Trace Compression

43
Instruction and Data Address Trace Compression Aleksandar Milenković (collaborative work with Milena Milenković and Martin Burtscher) Electrical and Computer Engineering Department The University of Alabama in Huntsville Email: [email protected] Web: http://www.ece.uah.edu/~milenka http://www.ece.uah.edu/~lacasa

description

Instruction and Data Address Trace Compression. Aleksandar Milenković (collaborative work with Milena Milenković and Martin Burtscher) Electrical and Computer Engineering Department The University of Alabama in Huntsville Email: [email protected] Web: http://www.ece.uah.edu/~milenka - PowerPoint PPT Presentation

Transcript of Instruction and Data Address Trace Compression

Page 1: Instruction and Data Address  Trace Compression

Instruction and Data Address Trace Compression

Aleksandar Milenković(collaborative work with Milena Milenković and Martin Burtscher)

Electrical and Computer Engineering DepartmentThe University of Alabama in Huntsville

Email: [email protected] Web: http://www.ece.uah.edu/~milenka

http://www.ece.uah.edu/~lacasa

Page 2: Instruction and Data Address  Trace Compression

2

Outline

Program Execution Traces Trace Compression Trace Compression in Hardware

Stream caches and predictors for instruction address trace compression

Data address stride caches for data address trace compression

Results Conclusions

Page 3: Instruction and Data Address  Trace Compression

3

Program Execution Traces

Streams of recorded events Basic block traces Address traces Instruction words Operands

Trace uses Computer architects for evaluation

of new architectures Computer analysts for workload characterization Software developers for program tuning,

optimization, and debugging

Page 4: Instruction and Data Address  Trace Compression

4

Instruction and Data Address Traces:An Example

for(i=0; i<100; i++) {

c[i] = s*a[i] + b[i];

sum = sum + c[i];

}

2 0x020001f4

0 0x020001f8 0xbfffbe24

0 0x020001fc 0xbfffbc94

2 0x02000200

2 0x02000204

2 0x02000208

2 0x0200020c

1 0x02000210 0xbfffbb04

2 0x02000214

InstructionAddress

DataAddressType

Dinero+ Execution Trace

@ 0x020001f4: mov r1,r12, lsl #2

@ 0x020001f8: ldr r2,[r4, r1]

@ 0x020001fc: ldr r3,[r14, r1]

@ 0x02000200: mla r0,r2,r8,r3

@ 0x02000204: add r12,r12,#1 (1 >>> 0)

@ 0x02000208: cmp r12,#99 (99 >>> 0)

@ 0x0200020c: add r6,r6,r0

@ 0x02000210: str r0,[r5, r1]

@ 0x02000214: ble 0x20001f4

Page 5: Instruction and Data Address  Trace Compression

5

Trace Issues

Trace issues Capture Compression Processing

Traces tend to be very large In terabytes for a minute of program execution Expensive to store, transfer, and use

Effective reduction techniques: Lossless High compression ratio Fast decompression

Page 6: Instruction and Data Address  Trace Compression

6

Outline

Program Execution Traces Trace Compression Trace Compression in Hardware

Stream caches and predictors for instruction address trace compression

Data address stride caches for data address trace compression

Results Conclusions

Page 7: Instruction and Data Address  Trace Compression

7

Trace Compression

General purpose compression algorithms Ziv-Lempel (gzip) Burroughs-Wheeler transformation (bzip2) Sequitur

Trace specific compression techniques Tuned to exploit redundancy in traces Better compression, faster,

can be further combined with general-purpose compression algorithms

Page 8: Instruction and Data Address  Trace Compression

8

Trace-Specific Compression TechniquesLossless Compression

Instructions Instructions + data

- Acyclic path (WPP [Larus 1999], Time Stamped WPP [Zhang and Gupta 2001])

- N-tuple [Milenkovic, Milenkovic and Kulick 2003]

- Instruction (PDI [Johnson, Ha and Zaidi 2001])

Graph with number of repetitions in nodes

Replacing an execution sequence with its identifier

Control flow graph + trace of transitions

Offset

Offset + repetitions

Link data addresses to dynamic basic block

Link data addresses to loop

Regenerate addresses

Abstract execution

Value Predictor

Mache [Samples 1989],LBTC [Luo and John 2004]

QPT [Larus 1993]

[Hamou-Lhadj and Lethbridge 2002]

PDATS [Johnson, Ha and Zaidi 2001]

[Pleszkun 1994],SBC [Milenkovic and Milenkovic, 2003]

[Elnozahy 1999], SIGMA [DeRose, et al. 2002]

[Eggers, et al. 1990],[Larus 1993]

VPC [Burtscher and Jeeradit 2003],TCGEN [Burtscher and Sam 2005]

Page 9: Instruction and Data Address  Trace Compression

9

Outline

Program Execution Traces Trace Compression Trace Compression in Hardware

Stream caches and predictors for instruction address traces

Data address stride caches for data address traces

Results Conclusions

Page 10: Instruction and Data Address  Trace Compression

10

Why Trace Compression in Hardware? Problem #1: Capture program traces

In software: trap after each instruction or taken branch E.g., IBM’s Performance Inspector Slowdown > 100 times

Multiple cores on a single chip + more detailed information needed (e.g., time stamps of events)

Problem #2: debugging is far from fun Stop execution on breakpoints, examine the state Time-consuming, difficult,

may miss a critical state leading to erroneous behavior Stopping the CPU may perturb the sequence of events

making your bugs disappear => Need an unobtrusive real-time tracing mechanism

Page 11: Instruction and Data Address  Trace Compression

11

Trace Compression in Hardware

Goals Small on-chip area and small number of pins Real-time compression (never stall the processor) Achieve a good compression ratio

Solution A set of compression algorithms

targeting on-the-fly compression of instruction and data address traces

Page 12: Instruction and Data Address  Trace Compression

12

Exploiting Stream and Strides

Instruction address trace compression

Limited number andstrong temporal locality of instruction streams

=> Replace an instruction streamwith its identifier

Data address trace compression Spatial and temporal locality

of data addresses => Recognize regular strides

CINT #Streams Max.L Dyn.SL164.gzip 1437 229 13.6176.gcc 30162 315 11.4181.mcf 1181 88 7.4186.crafty 5347 191 13.3197.parser 6116 189 10.0252.eon 4389 169 13.7253.perlbmk 11542 868 11.8254.gap 3530 284 11.1255.vortex 8254 126 11.0300.twolf 4902 185 14.4

CFP #Streams Max.L Dyn.SL168.wupwise 1912 229 27.4171.swim 1839 707 130.8172.mgrid 1725 1944 420.8173.applu 1752 3162 462.4177.mesa 1938 550 18.15178.galgel 4153 264 21.8179.art 976 561 9.0183.equake 1355 623 27.7188.ammp 1810 422 38.5189.lucas 1414 427 113.3191.fma3d 5007 1158 34.3200.sixtrack 6515 580 170.5301.appsi 2989 894 50.7

Page 13: Instruction and Data Address  Trace Compression

13

Trace Compressor: System Overview

SCIT

Stream Cache(SC)

Data Address Stride Cache (DASC)

Predictor +Byte rep. FSM

Processor Core

SCMT DT DMT

Program

Counter

Data Address

Task Switch

Trace Output Controller

To External Unit

DAPC

Data Address

Buffer

Byte rep.FSM

Processor Core Memory

Trace Compressor

System Under Test

Trace port

External Trace Unitfor Storing/Processing(PC or Intelligent Drive)

Page 14: Instruction and Data Address  Trace Compression

14

Outline

Program Execution Traces Trace Compression Trace Compression in Hardware

Stream caches and predictors for instruction address traces

Data address stride caches for data address traces

Results Conclusions

Page 15: Instruction and Data Address  Trace Compression

15

Stream Detector + Stream Cache

F(S.SA, S.SL)

iSet

Hit/Miss

SCMT (SA, SL) SCIT

’00…0’

S.SA & S.L

Stream Cache (SC)

NSET - 1

…NWAY - 1

=?

iWay

S.SA & S.LFrom InstructionStream Buffer

Stream Cache Index Trace

Stream Cache Miss Trace

iWay

PC

PPC

-

S.SA S.L

SA

=! 4

SL

Instruction Stream Buffer

SA

SA

0

1

i

01

reserved

SA L

(0x020001f4,0x09)

0x0E

(0x020001f4,0x09) 0x00 // it. 0

0x020001f40x020001f8

...0x02000214

0x0E // it. 1

0x0E // it. 99

Page 16: Instruction and Data Address  Trace Compression

16

SC Itrace Compression

Instruction Stream Buffer size Not to stall processor

(e.g., have consecutive very short instruction streams)

Stream cache Size Associativity Replacement policy Mapping function

Compress instruction stream1. Get the next instruction stream record

from the instruction stream buffer(S.SA, S.SL);2. Lookup in the stream cache with iSet = F(S.SA, S.SL);3. if (hit) 4. Emit(iSet && iWay) to SCIT; 5. else {6. Emit reserved value 0 to SCIT;7. Emit stream descriptor (S.SA, S.SL) to SCMT;8. Select an entry (iWay) in the iSet set to be replaced;9. Update stream cache entry: SC[iSet][iWay].Valid = 1

SC[iSet][iWay].SA = S.SA, SC[iSet][iWay].SL = S.SL;}10. Update stream cache replacement indicators;

Design Decisions:

Page 17: Instruction and Data Address  Trace Compression

17

SC Itrace Compression: An Analytical Model

Legend: CR(SC.I) – compression ratio N – number of instructions SL.Dyn – average stream

length (dynamic) SC.Hit(Nset,Nway) – SC hit rate Assumptions:

stream length < 256(1 byte for SL)

4 bytes for stream starting address

).1(5)(log81

.4).(

5).1(.

)(

8)(log

.)(

4).()()(

).().(

2

2

WAYSNSETNWAYSSET

WAYSNSETN

WAYSSET

HitSCNN

DynSLISCCR

BytesHitSCDynSLNSCMTSize

BytesNNDynSLNSCITSize

BytesNIDineroSizeSCMTSizeSCITSizeIDineroSizeISCCR

DynSLISCCRLimNN

DynSLISCCRLimNN

DynSLISCCRLimNNNNDynSL

LimISCCRLim

HitSCWAYSSET

HitSCWAYSSET

HitSCWAYSSET

WAYSSETHitSCHitSC

.34.5)).((64

.57.4)).((128

.4)).((256)(log

.32)).((

1.

1.

1.

21.1.

Page 18: Instruction and Data Address  Trace Compression

18

2nd Level Itrace Compression

Size(SCIT) >> Size(SCMT) HitRate = 98%, 8-bit index

=> Size(SCIT) = 10*Size(SCMT) Redundancy in SCIT

Temporal and spatial locality of instruction streams Reduce SCIT trace

Global Predictor N-tuple compression using Tuple History Table N-tuple compression using SCIT History Buffer

Page 19: Instruction and Data Address  Trace Compression

19

Global Predictor Structure

...

SCIT Trace

==?’0’

0

MaxP-1

Hit/Miss

SCIT PRED Trace SCIT PRED Miss Trace

History Buffer

F

’1’

next.sid

pindex

Predictor

Page 20: Instruction and Data Address  Trace Compression

20

SCIT CompressionPredict SCIT index1. Get the incoming index, next.sid, from the SCIT trace2. Calculate the SCIT predictor index, pindex,

using indices in the History bufferpindex = F (indices in the History Buffer);

3. Perform lookup in the SCIT Predictor with pindex;4. if(SCIT.Predictor[pindex] == next.sid) 5. Emit(‘1') to SCIT PRED trace; 6. else {7. Emit(‘0’) to SCIT PRED trace;8. Emit next.sid to SCIT Miss PRED trace; 9. SCIT.Predictor[pindex] = next.sid; }10. Shift in the next.sid to the History Buffer;

Length of history buffer Global predictor Size Mapping function

Design Decisions:

Page 21: Instruction and Data Address  Trace Compression

21

Redundancy in SCIT Pred Trace High predictor hit rates and long runs of 0xFF bytes

are expected in Predictor Hit Trace Use a simple FSM to exploit byte repetitions

PREDHit

TracePrev.BYTE

=?CNT

SCIT PRED Header

SCIT PRED Repetition

Trace

// Detect byte repetitions in SCIT pred1. Get next SCIT Pred byte, Next.BYTE; 2. if (Next.BYTE == Prev.BYTE) CNT++;3. else {4. if (CNT == 0) {5. Emit Prev.BYTE to SCIT.REP.Trace;6. Emit ‘0’ to SCIT Header;7. } else {8. Emit (Prev.BYTE, CNT) pair

to SCIT.REP.Trace;9. Emit ‘1’ to SCIT Header;}10. Prev.BYTE = Next.BYTE;}

Page 22: Instruction and Data Address  Trace Compression

22

Outline

Program Execution Traces Trace Compression Trace Compression in Hardware

Stream caches and predictors for instruction address traces

Data address stride caches for data address traces

Results Conclusions

Page 23: Instruction and Data Address  Trace Compression

23

Data Address Trace Compression

More challenging task Data addresses rarely stay constant

during program execution However, they often have a regular stride => Use Data Address Stride Cache (DASC) to exploit

locality of memory referencing instructions and regularity in data address strides

Page 24: Instruction and Data Address  Trace Compression

24

index

PC

Data Address Stride Cache (DASC)

0

1

i

N - 1

… …

… …

LDA Stride

DA-LDA

G(PC)

DA

==?’0’ ’1’

DT (Data trace)DMT

Data Miss Trace

Stride.Hit

Data Address Stride Cache

Stride.Hit

DASC Tagless structure Indexed by PC of

the corresponding instruction Entry fields

LDA – Last Data Address Stride

0x020001f8

0xbfffbe240xbfffbe200xbfffbe1c

0xbfffbe200xbfffbe24

0 0 1

Page 25: Instruction and Data Address  Trace Compression

25

DASC Compression

// Compress data address stream1. Get the next pair from data buffers (PC, DA)2. Lookup in the data address stream cache indexSet = G(PC);3. cStride = DA - DASC[iSet].LDA;4. if (cStride == DASC[iSet].Stride) {5. Emit(‘1’) to DT; //1-bit info 6. } else {7. Emit(‘0’) to DT;8. Emit DA to DMT;9. DASC[iSet].Stride =lsb(cStride); }10. DASC[iSet].LDA = DA;

Number of entries Index function G Stride length Data address buffer depth

Design Decisions:

Page 26: Instruction and Data Address  Trace Compression

26

DASC Dtrace Compression: An Analytical Model

Legend: CR(SC.D) – compression ratio Nmemref – number of memory

referencing instructions DASC.Hit – DASC hit rate Assumptions:

4 bytes for stream starting address

HitDASCDSCCR

BHitDASCNDMTSizeDTSize

BNDDineroSizeDMTSizeDTSizeDDineroSizeDSCCR

memref

memref

.03125.11).(

)]125.04).1[()()(

4).()()(

).().(

3203125.01)).((

1.

DSCCRLim

HitDASC

Page 27: Instruction and Data Address  Trace Compression

27

Redundancy in DT Trace

DT

Prev.DT

=?CNT

Data Header(DH)

Data Repetition Trace (DRT)

// Detect data repetitions1. Get next DT byte; 2. if (DT == Prev.DT) CNT++;3. else {4. if (CNT == 0) {5. Emit Prev.DT to DRT;6. Emit ‘0’ to DH;7. } else {8. Emit (Prev.DT, CNT) pair to DRT;9. Emit ‘1’ to DH;}10. Prev.DT = DT;}

High predictor hit rates and long runs of 0xFF bytes are expected in DT Trace

Use a simple FSM to exploit byte repetitions

Page 28: Instruction and Data Address  Trace Compression

28

Outline

Program Execution Traces Trace Compression Trace Compression in Hardware

Stream caches and predictors for instruction address traces

Data address stride caches for data address traces

Results Conclusions

Page 29: Instruction and Data Address  Trace Compression

29

Experimental Evaluation Goals

Assess the effectiveness of the proposed algorithms

Explore the feasibility of the proposed hardware implementations

Determine optimal size and organization of HW structures

Workload 16 MiBench benchmarks ARM architecture

IC NUS maxSL SL.Dyncjpeg 104,607,812 1636 239 10.89djpeg 23,391,628 1324 206 21.81lame 1,285,111,635 3410 252 27.81tiff2bw 143,254,646 1058 43 12.79tiff2rgba 151,691,275 1146 75 27.54tiffmedian 541,260,067 1431 75 22.22tiffdither 832,951,018 1831 51 12.57mad 286,974,899 1659 1055 20.09sha 140,885,982 495 62 15.15bf_e 544,053,846 413 300 5.85rijndael_e 319,977,971 542 254 18.94ghostscript 708,090,638 6900 187 8.70rsynth 824,942,227 1323 180 15.77stringsearch 3,675,745 439 62 5.61adpcm_c 732,513,651 347 71 54.63gsm_d 1,299,270,245 845 401 11.07

Legend: • IC – Instruction count• NUS – Number of unique instruction streams• maxSL – Maximum stream length• SL.Dyn – Average stream length (dynamic)

Page 30: Instruction and Data Address  Trace Compression

30

Findings about SC Size/Organization

Good compression ratio Outperforms fast GZIP High stream cache hit rates for

all application (>98 %) Smaller SCs work well too

Replacement policy Pseudo-LRU vs. FIFO

Associativity 4-way is a reasonable choice 8-way and 16-way desirable

Mapping function S.SA<5+n:6> xor S.L<n-1:0>

n=log2(NSET)

CR(SC.I) WaysEntries 1 2 4 8

8 16.3 17.6 17.0 15.816 21.1 22.1 27.8 26.632 23.9 28.0 34.4 34.064 27.5 36.9 44.1 47.1

128 29.0 47.6 54.1 57.4256 28.0 47.8 53.6 54.2

CR=f(Complexity), 4-way SC

0

0.2

0.4

0.6

0.8

1

1.2

0 50 100 150 200 250 300

#SC entries

CR

/Max

CR

Page 31: Instruction and Data Address  Trace Compression

31

Findings about Global Predictor Number of entries

should not exceed the number of entries in SC

Having longer histories and larger predictorsgives only marginal improvements for all applicationsexcept ghostscript, blowfish, and stringsearch

History length = 1 Index GPRED using the

previous SCIT index

CR(SC+GP.I) Pred. entriesSC Entries P32 P64 P128 P2568x4 47.6416x4 72.17 81.1932x4 91.91 113.22 145.7964x4 100.32 115.09 150.54 207.64

Page 32: Instruction and Data Address  Trace Compression

32

Putting It All Together (SC+GPRED+BREP): Itrace Compression

SC,GPRED DEF. FAST BEST BEST DEF.CR 8x8,64 16x8,128 32x8,256 64x4,256 I.GZ I.GZ I.GZ I.BZ2 GZGZcjpeg 263.7 316.7 315.0 277.1 109.6 54.5 124.5 342.0 265.7djpeg 287.1 443.3 539.4 492.3 71.8 39.8 73.7 202.0 232.5lame 214.0 238.6 255.2 250.6 60.5 128.5 333.9 87.6 174.2tiff2bw 351.5 1111.5 3062.2 1493.0 114.1 83.9 114.4 376.8 615.2tiff2rgba 517.6 3713.1 3592.0 1834.0 121.3 20.3 122.0 529.6 1292.7tiffmedian 649.4 1229.4 1827.4 1601.2 152.8 92.3 155.5 472.9 1017.5tiffdither 54.8 120.9 184.8 154.3 91.1 46.4 99.8 170.9 147.1mad 221.0 230.4 257.2 253.4 73.5 37.8 78.5 94.3 206.2sha 348.5 339.6 322.4 322.3 211.4 54.4 221.8 656.5 4112.1bf_e 100.2 100.2 92.6 92.6 170.4 41.0 182.3 352.0 4065.9rijndael_e 142.1 298.6 290.1 285.6 143.8 12.6 150.6 141.8 2392.9ghostscript 30.4 106.4 123.6 119.4 100.6 39.7 111.2 212.5 434.5rsynth 97.0 152.8 246.0 211.5 46.7 30.6 48.0 143.2 191.2stringsearch 21.8 78.5 114.0 74.9 82.1 32.3 100.6 202.5 132.8adpcm_c 29972.5 28663.9 27457.8 27456.6 233.1 107.3 233.6 1862.6 12764.7gsm_d 234.9 292.3 401.2 376.0 85.4 59.2 87.2 165.6 507.1TOTAL 113.2 209.0 254.4 237.8 87.5 47.2 112.9 172.0 321.6

Page 33: Instruction and Data Address  Trace Compression

33

Findings about DASC Stride size

1 byte is optimal 2 byte stride improves

compression for 10% DASC with 1K entries

is an optimal choice Tagged (multi-way) DASC

further improves overall compression ratio

Increased complexity

CR=f(Complexity)

0

1

2

3

4

5

6

7

0 1000 2000 3000 4000 5000

# DASC entries

CR

Page 34: Instruction and Data Address  Trace Compression

34

DASC Compression Ratio

DASC DASC DASC DASC DASC DASC DEF. FAST BEST32 64 128 256 512 1024 D.GZ D.GZ D.GZ D.BZ2 D.GZGZ

cjpeg 3.35 4.60 5.14 5.77 6.54 7.11 5.98 4.50 6.11 18.20 9.57djpeg 2.81 3.57 4.28 4.96 5.22 5.29 4.22 3.78 4.22 8.62 4.92lame 1.20 1.52 2.81 3.82 4.49 4.88 6.56 4.01 6.63 8.80 8.60tiff2bw 76.31 78.04 84.28 105.04 128.84 134.23 2.14 2.55 2.10 14.28 3.07tiff2rgba 5.98 79.81 91.24 107.49 127.05 139.57 2.10 2.79 2.09 4.06 4.03tiffmedian 8.64 8.70 8.74 8.81 8.87 8.89 4.40 4.37 4.53 11.16 6.03tiffdither 2.61 6.08 7.21 8.69 9.65 10.06 4.51 4.41 4.51 7.87 6.77mad 1.30 1.59 1.96 2.07 2.35 2.64 4.08 3.60 4.22 13.47 6.97sha 6.58 7.94 9.38 10.79 11.36 11.36 44.91 8.36 45.61 172.71 591.69bf_e 1.58 1.95 2.38 2.61 2.75 2.91 7.58 4.86 7.83 16.35 9.08rijndael_e 1.10 1.10 1.10 1.13 1.29 2.06 4.24 3.22 4.27 7.31 4.49ghostscript 1.07 1.19 1.56 2.19 2.93 5.27 27.21 18.58 27.46 47.42 40.83rsynth 1.22 1.36 1.76 3.81 8.30 32.43 24.44 21.46 25.27 57.40 43.88stringsearch 1.80 2.04 2.70 4.13 4.44 5.16 11.12 8.57 11.23 15.03 11.47adpcm_c 3.13 3.13 3.13 3.13 3.13 3.13 6.57 3.64 7.15 12.27 11.42gsm_d 2.67 4.48 11.30 13.60 14.81 16.78 21.60 18.05 23.29 63.53 33.15TOTAL 1.66 2.04 2.80 3.77 4.67 6.12 6.78 5.51 6.90 13.29 9.70

Page 35: Instruction and Data Address  Trace Compression

35

Hardware Complexity Estimation CPU model

In-order, Xscale like Vary SC and DASC parameters

SC and DASC timings SC: Hit latency = 1 clock,

Miss latency = 2 clocks DASC: Hit latency = 2 clocks

Miss latency = 2 clocks To avoid any stalls

Instruction stream input buffer: MIN = 2 entries

Data address input buffer: MIN = 8 entries

Results are relatively independent of SC and DASC organization

Component Entries Complexity BytesInstruction stream buffer

2 2x5 10

Stream detector 2 2x4 8Stream cache 64x4 256x5 1280Global Predictor 256 256 + 1(h) 257Data address buffer 8 8x8 64Data address stride cache

1024 1024x5 5120

Byte repetition state machines

- 4 4

Page 36: Instruction and Data Address  Trace Compression

36

Trace Port Bandwidth AnalysisCJPEG

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 21 41 61 81 101

Instructions Executed (millions)

bits

/inst

r.

SCSC+PREDSC+PRED+BREP

CJPEG

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 21 41 61 81 101

Instruction Executed (millions)

bits

/inst

r.

TDASCTDASC+BREP

MAD

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 41 81 121 161 201 241 281

Instructions Executed (millions)

bits

/inst

r.

SCSC+PREDSC+PRED+BREP

MAD

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1 41 81 121 161 201 241 281

Instruction Executed (millions)

bits

/inst

r.TDASCTDASC+BREP

Page 37: Instruction and Data Address  Trace Compression

37

Outline

Program Execution Traces Trace Compression Trace Compression in Hardware

Stream caches and predictors for instruction address traces

Data address stride caches for data address traces

Results Conclusions

Page 38: Instruction and Data Address  Trace Compression

38

Conclusions A set of algorithms and hardware structures

for instruction and data address trace compression Stream Caches + Global Predictor + Byte repetition FSM

for instruction traces Data Address Stride Cache + Byte repetition FSM for data traces

Benefits Enabling real-time trace compression with high compression ratio Low complexity (small structures, small number of external pins)

Analytical & simulation analysis focusing on compression ratio and optimal sizing/organization of the structures as well as real-time trace port bandwidth requirements

Page 39: Instruction and Data Address  Trace Compression

Laboratory for Advanced Computer Architectures and Systems

at Alabama: Research Overview

Aleksandar Milenković

The LaCASA LaboratoryElectrical and Computer Engineering Department

The University of Alabama in HuntsvilleEmail: [email protected]

Web: http://www.ece.uah.edu/~milenkahttp://www.ece.uah.edu/~lacasa

Page 40: Instruction and Data Address  Trace Compression

40

Secure Processors

PMAC (Parallel MACs) for reducedcryptographic latency

A variation of the one-time-pad for code encryption

Instruction Verification Buffer for conditional execution before verification

Computer Security is Critical Software & physical attacks

Sign & Verify for Guaranteed Integrity and Confidentiality of Code

Improvements

Buffer overflow in MMClient.exe in IndiatimesMessenger 6.0 allows remote attackers to cause a denial of service (application crash) and possibly execute arbitrary code via a long group name argument to the RenameGroupfunction in the MMClient.MunduMessenger.1 ActiveX object.

Multiple format string vulnerabilities in (1) neon 0.24.4 and earlier, and other products that use neon including (2) Cadaver, (3) Subversion, and (4) OpenOffice, allow remote malicious WebDAV servers to execute arbitrary code.

Buffer overflow in the J PEG (J PG) parsing engine in the Microsoft Graphic Device Interface Plus (GDI+) component, GDIPlus.dll, allows remote attackers to execute arbitrary code via a J PEG image.

Multiple buffer overflows in RealOne Player, RealOne Player 2.0, RealOne Enterprise Desktop, and RealPlayer Enterprise allow remote attackers to execute arbitrary code via malformed (1) .RP, (2) .RT, (3) .RAM, (4) .RPM or (5) .SMIL files.

Multiple heap-based buffer overflows in the imlibBMP image handler allow remote attackers to execute arbitrary code via a crafted BMP file.

I nteger overflow in pixbuf_create_from_xpm (io-xpm.c) in the XPM image decoder for gtk+ 2.4.4 (gtk2) and earlier, and gdk-pixbuf before 0.22, allows remote attackers to execute arbitrary code via certain n_col and cpp values that enable a heap-based buffer overflow.

Stack-based buffer overflow in the URL parsing function in Gaim before 1.3.0 allows remote attackers to execute arbitrary codevia an instant message (IM) with a large URL.

Buffer overflow in WIDCOMM Bluetooth Connectivity Software, as used in products such as BTStackServer 1.3.2.7 and 1.4.2.10, Windows XP and Windows 98 with MSI Bluetooth Dongles, and HP IPAQ 5450 running WinCE 3.0, allows remote attackers to execute arbitrary code via certain service requests.

Original Code Signed Code

Secure Installation

Trusted Code

Signature Match

Signature Fetch

Instruction Fetch

Secure Execution

CalculateSignature

EKey3(I-Block)

Signature

Encrypt

Generate Program Keys(Key1,Key2,Key3)

Secure ModeEKey.Cpu(Key1)EKey.Cpu (Key2)EKey.CPU(Key3)

Encrypt

I-Block

ProgramLoading

Decrypt Program Keys(Key1,Key2,Key3)

Decrypt I-Block

=?

CalculateSignature

Yesterday

Today

Tomorrow

http://www.ece.uah.edu/~lacasa/research.htm#secure_processors

Page 41: Instruction and Data Address  Trace Compression

41

Microbenchmarks for Architectural Analysis

Small programs for uncovering architectural parameters (usually not publicly disclosed) of modern processors

Relatively simple, so their behavior can be understood

Benefits Architecture-aware

compiler optimization Processor design evaluation

and verification Testing Competitive analysis

PerformanceCounters

Microbenchmarks

...

BTBOutcome Predictor

Branch relatedevents

BTB Size

BTB Org.

Local History

BTB Indexing

Global History...

Results Microbenchmarks

for BTB analysis Experimental flow for

outcome predictor Tested on P6 and NetBurst

(Northwood core)

Challenge Dothan (PentiumM) predictor

http://www.ece.uah.edu/~lacasa/bp_mbs/bp_microbench.htm

Page 42: Instruction and Data Address  Trace Compression

42

TinyHMSConcept Prototype

Software

WirelessTransceiver

TimeSyncInterface(USB/CF)

Main Control (Messaging, Fusion, Buffering)

Flash Storage WirelessTransceiver

TimeSync

MessagingBuffering

FlashStorage

ActiSProtocol

Network Coordinator(Telos)

Interface(USB/CF)

ActiSProtocol

WWAN/WLANCommunication

Messaging Control

Storage

User InterfaceNetwork Coordinator

(Telos)

Interface(USB/CF)

ActiSProtocol

WWAN/WLANCommunication

Messaging Control

Storage

User InterfacePS

(PDA)ActiS Application Layer

Signal Processing

ActiS(Tmote sky)

IAS/ISPM

Data Acquisition

Filtering/Pre-processing

ActiSInterface

IAS/ISPM

Data Acquisition

Filtering/Pre-processing

ActiSInterface

Data Acquisition

Filtering/Pre-processing

ActiSInterface

Sensor Interface

http://www.ece.uah.edu/~lacasa/research.htm#tinyHMS

Page 43: Instruction and Data Address  Trace Compression

43

TinyHMS105 105.2 105.4 105.6 105.8 106 106.2 106.4 106.6 106.8 1070

1000

2000

105 105.2 105.4 105.6 105.8 106 106.2 106.4 106.6 106.8 1070.5

1

1.5x 10

4

105 105.2 105.4 105.6 105.8 106 106.2 106.4 106.6 106.8 1071000

2000

3000

4000

accXaccYaccZ

Heart Beat

Event Messagewith Timestamp

BeaconMessage

Heart Beat Step Heart Beat Step

Frame i-1

Motion Sensor(TS2)

ECGSensor(TS1)

TS1 TS2NC TS3

Frame i

BeaconMessage

TS1 TS2NC TS3