Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank...

21
Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the U.S. National Science Foundation and a U.S. Dept. of Education GAANN Fellowship International Conference on Computer Design, 2002
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank...

Page 1: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Loop Caching Meets Preloaded Loop Caching – A

Hybrid Approach

Dynamic Loop Caching Meets Preloaded Loop Caching – A

Hybrid Approach

Ann Gordon-Ross and Frank Vahid*Department of Computer Science and Engineering

University of California, Riverside*Also with the Center for Embedded Computer Systems, UC Irvine

This work was supported in part by the U.S. National Science Foundation and a U.S. Dept. of Education GAANN Fellowship

International Conference on Computer Design, 2002

Page 2: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

2

IntroductionIntroduction

• Memory access can consume 50% of an embedded microprocessor’s system power– Instruction fetching usually

more than half of that power– Caches tend to be power

hungry

• ARM920T: caches consume half of total power (Segars 01)

• M*CORE: unified cache consumes half of total power (Lee/Moyer/Arends 99)

ARM920T. Source: Segars ISSCC’01

I-Mem

L1 Cache

Processor

Page 3: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

3

Filter CacheFilter Cache

• Tiny L0 cache (~64 instruct.)– Kin/Gupta/Mangione-

Smith97 – Has very low dynamic

power• Short internal bitlines • Close to microprocessor

• Power/energy savings, but:– Performance penalty of

21% due to high miss rate (Kin’97)

– Tag comparisons consume power

L1 Cache

Filter Cache (L0)

Processor

Page 4: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

4

Dynamically Loaded Tagless Loop CacheDynamically Loaded Tagless Loop Cache• Tiny cache that passively

fills with loops as they execute (Lee/Moyer/Arends 99)

• Not really first level of memory– Rather, an alternative

• Operation– Filled when short backwards

branch detected in instruction stream

• Compared to filter cache...– No tags – even lower power– Missless – no performance

penalty

L1 Cache

Dynamic Loop Cache

Mux

Processor

Page 5: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

5

Dynamically Loaded Tagless Loop CacheDynamically Loaded Tagless Loop Cache

L1 Cache

Dynamic Loop Cache

Mux

Processor

mov r1, 2

sbb -2

• Tiny cache that passively fills with loops as they execute (Lee/Moyer/Arends 99)

• Not really first level of memory– Rather, an alternative

• Operation– Filled when short backwards

branch detected in instruction stream

• Compared to filter cache...– No tags – even lower power– Missless – no performance

penalty

Page 6: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

6

Dynamically Loaded Tagless Loop CacheDynamically Loaded Tagless Loop Cache

L1 Cache

Dynamic Loop Cache

Mux

Processor

mov r1, 2

sbb -2

• Tiny cache that passively fills with loops as they execute (Lee/Moyer/Arends 99)

• Not really first level of memory– Rather, an alternative

• Operation– Filled when short backwards

branch detected in instruction stream

• Compared to filter cache...– No tags – even lower power– Missless – no performance

penalty

Page 7: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

7

Dynamically Loaded Tagless Loop CacheDynamically Loaded Tagless Loop Cache

L1 Cache

Dynamic Loop Cache

Mux

Processor

mov r1, 2

sbb -2

• Tiny cache that passively fills with loops as they execute (Lee/Moyer/Arends 99)

• Not really first level of memory– Rather, an alternative

• Operation– Filled when short backwards

branch detected in instruction stream

• Compared to filter cache...– No tags – even lower power– Missless – no performance

penalty

Page 8: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

8

Dynamically Loaded Tagless Loop Cache – ResultsDynamically Loaded Tagless Loop Cache – Results• We ran 10 Powerstone

benchmarks (from Motorola) on a MIPS processor instruction-set simulator– Average L1 fetch reduction

was 30%– Closely matched results of

[Lee et al 99].– L1 fetch reductions

translate to system power savings of 10-15%

L1 Cache

Dynamic Loop Cache

Mux

Processor

Page 9: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

9

Dynamically Loaded Tagless Loop Cache - LimitationDynamically Loaded Tagless Loop Cache - Limitation• Does not support loops

with control of flow changes (cof)– Only supports sequential

instruction fetching since it was filled passively during a loop iteration

• Does not see instructions not executed on that iteration

– A cof thus terminates loop cache filling or fetching

– cof’s unfortunately include common if-then-else statements within a loop

L1 Cache

Dynamic Loop Cache

Mux

Processor

mov r1, 2

mov r2, 3

bne r1, r2, 2

sbb -4

Page 10: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

10

Dynamically Loaded Tagless Loop Cache - LimitationDynamically Loaded Tagless Loop Cache - Limitation• Does not support loops

with control of flow changes (cof)– Only supports sequential

instruction fetching since it was filled passively during a loop iteration

• Does not see instructions not executed on that iteration

– A cof thus terminates loop cache filling or fetching

– cof’s unfortunately include common if-then-else statements within a loop

L1 Cache

Dynamic Loop Cache

Mux

Processor

mov r1, 2

mov r2, 3

bne r1, r2, 2

sbb -4

Page 11: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

11

Dynamically Loaded Tagless Loop Cache - LimitationDynamically Loaded Tagless Loop Cache - Limitation

L1 Cache

Dynamic Loop Cache

Mux

Processor

• Does not support loops with control of flow changes (cof)– Only supports sequential

instruction fetching since it was filled passively during a loop iteration

• Does not see instructions not executed on that iteration

– A cof thus terminates loop cache filling or fetching

– cof’s unfortunately include common if-then-else statements within a loop

mov r1, 2

mov r2, 3

bne r1, r2, 2

sbb -4

Page 12: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

12

Dynamically Loaded Tagless Loop Cache - LimitationDynamically Loaded Tagless Loop Cache - Limitation

• Lack of support of cof’s results in support of only half of small frequent loops in the benchmarks

Page 13: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

13

Preloaded Tagless Loop CachePreloaded Tagless Loop Cache

• Embedded systems typically execute a fixed application– Can determine critical

loops/subroutines through profiling

– Can preload critical regions into loop cache – whose contents will not change

• Preloaded loop cache (Gordon-Ross/Cotterell/Vahid CAL’02)– Tagless, missless– Supports more loops

than dynamic loop cache

L1 Cache

Preloaded Loop Cache

Mux

Processor

Dmem.Processor

Periph. Pmem.

mov r1, 2

mov r2, 3

bne r1, r2, 2

sbb -4

Page 14: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

14

Preloaded Tagless Loop CachePreloaded Tagless Loop Cache

L1 Cache

Preloaded Loop Cache

Mux

Processor

• Embedded systems typically execute a fixed application– Can determine critical

loops/subroutines through profiling

– Can preload critical regions into loop cache – whose contents will not change

• Preloaded loop cache (Gordon-Ross/Cotterell/Vahid CAL’02)– Tagless, missless– Supports more loops

than dynamic loop cache

Dmem.Processor

Periph. Pmem.

mov r1, 2

mov r2, 3

bne r1, r2, 2

sbb -4

Page 15: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

15

Preloaded Tagless Loop Cache - ResultsPreloaded Tagless Loop Cache - Results

• Results– 128 instruction

preloaded loop cache reduces L1 fetches by nearly twice that of dynamic (30%) for the benchmarks studied (Powerstone and Mediabench)

Page 16: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

16

Preloaded Tagless Loop Cache - DisadvantagesPreloaded Tagless Loop Cache - Disadvantages

• Preloaded loop cache has some limitations too– Occasionally dynamic

loop cache is actually better

– Preloading also requires• Fixed application• Profiling

– Limited number of loops can be preloaded

– We really want both!

Instruction fetch power savings

Page 17: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

17

Solution: A Hybrid Loop CacheSolution: A Hybrid Loop Cache

• Functions as both a dynamic and a preloaded loop cache

• 2 levels of cache storage– Main Loop Cache – for

instruction fetching– 2nd Level Storage -

preloaded loops are stored here

L1 Cache

Microprocessor

Main LoopCache

2nd Level Storage

Mux

Loop Cache Controller

Preloaded Loop Filler

Loop Match

Memory

Addr

Data

Addr

Data

Addr

Data

Control signals

Control

Control Control

LARs

Page 18: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

18

Hybrid Loop Cache - FunctionalityHybrid Loop Cache - Functionality

• Dynamic Loop Cache functionality– On a short backwards branch, main loop

cache is filled dynamically

• Preloaded Loop Cache functionality– On a cof, if the next instruction falls within

a preloaded region of code, that region is filled into the main loop cache from 2nd level storage

– After being filled, instructions can be fetched from the main loop cache

Page 19: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

19

Hybrid Loop Cache - ResultsHybrid Loop Cache - Results

• Hybrid performance – Best in 9 out of 13

benchmarks– Equally well in 1– In the remaining 3,

the hybrid loop cache performed nearly as good or better the strictly dynamic approach but was outperformed by the preloaded approach

Instruction fetch power savings

Page 20: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

20

Hybrid Loop Cache – Additional ConsiderationHybrid Loop Cache – Additional Consideration

• Hybrid loop cache can behave like a dynamic loop cache – If designer does not

wish to profile/preload– Power savings are

almost identical to the dynamic loop cache when no loops are preloaded

Page 21: Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

21

ConclusionsConclusions

• Hybrid loop cache reduced embedded system instruction fetch power by average of 51%– 90% savings in several cases– Outperformed dynamic and preloaded loop caches

• Dynamic 23%, Preloaded 35%, Hybrid 51%

• Can work as dynamic, preloaded, or both– More capacity than a preloaded loop cache– Can be used transparently as dynamic loop cache

• With nearly identical results

• Hybrid loop cache may be a good addition to low power embedded microprocessor architectures