Simulation of Decode Filter Cache using SimpleScalar simulator Presented by Fei Hong.

Post on 21-Jan-2016

217 views 0 download

Transcript of Simulation of Decode Filter Cache using SimpleScalar simulator Presented by Fei Hong.

Simulation of Decode Filter Cache using SimpleScalar simulator

Presented by Fei Hong

Motivation & Goals

• Instruction fetches and decodes are the major on-chip power consumers

• Optimize the power consumption by reducing instruction fetches and decodes

• Simulate the DFC architecture using simplescalar

• To test the performance of DFC

Prediction Mechanism Each sector in DFC has the following fields.

(tag, sector_valid, next_address)

If A is not equal to C, a different control path will be taken tag(A) != tag(C) (1)

A and B are consecutively accessed. If they belonged to a small loop

tag(A) == tag(B) (2) Based on (1) and (2), the prediction for next fetch : tag(C) == tag(B) (3)

Next AddressValid bits

Tag Data

B

...

X: A

B

Y: B

X: C

Working Process

last_table_entry

next_fetch_srcfetch

address

...

Fetch from DFC or I-cacheC

next_fetch_srcupdate

update

predict

1

2

3NFPT

The Platform

• Host computer: ACPI x86-based PC • Host computer operating system: Microsoft Windows V

ista Ultimate• Virtual Machine: VMware Workstation version 6.03• Linux operating system: Fedora Core 6• Simulator: SimpleScalar version 3.0

Work have done so far…

• Setup the platform• Reading the source code of SimpleScalar• Apply my DFC structure and working process to S

impleScalar• Find benchmarks and compile in the platform • Do simulation using given memory hierarchy par

ameters

MiBench

• dijkstra: it constructs a large graph in an adjacency matrix representation and then calculates the shortest path between every pair of nodes using repeated applications of Dijkstra’s algorithm.

• stringsearch: it searches for given words in phrases using a case insensitive comparison algorithm.

• rijndael encrypt/decrypt: it was selected as the National Institute of Standards and Technologies Advanced Encryption Standard (AES).

• CRC32: This benchmark performs a 32-bit Cyclic Redundancy Check (CRC) on a file. CRC checks are often used to detect errors in data transmission.

Memory hierarchy parameters

Parameter Value

Instr. size 4B

DFC direct-mapped, 32 secotors,4 decoded instr. per sector,

8B per decoded instr.

L1 I-cache 16KB, 2-way, 32B line,1 cycle hit latency

L1 D-cache 8KB, 2-way, 32B line,1-cycle hit latency

Memory 30-cycle latency

Simulation results

% reduction in instruction fetches and decodes

0

20

40

60

80

100

di j kstra stri ngsearch ri j ndael CRC32

fetch and decodereducti on

Simulation results

Prediction hit rate

97

97. 5

98

98. 5

99

99. 5

100

di j kstra stri ngsearch ri j ndael CRC32

predi cti on hi t rate

Simulation results

dijkstra stringsearch rijndael CRC32

sim_num_insn

255620304 4437612 391487315 533385529

il1.accesses 43508918 1605417 236160209 972328

il1.hits 43399500 1568976 228694324 971600

il1.misses 109418 36441 7465885 728

il1.miss_rate 0.0025 0.0227 0.0316 0.0007

dfc.accesses 215740165 3269067 232531480 532674172

dfc.hits 212111386 2832195 155327106 532413201

dfc.misses 3628779 436872 77204374 260971

dfc.miss_rate 0.0168 0.1336 0.3320 0.0005

Conclusion

• The DFC stores decoded instructions and can be very small and energy-efficient.

• Use of the DFC eliminates both the access to a much larger instruction cache and the entire decoding step.

• From the simulation results, we can see that most instruction fetch and decode can be eliminated by using DFC. Therefore, it is a very efficient way to optimize the power consumption of embedded processors.

Thank you!