Xiangrong Zhou and Peter Petrov Proceedings of the 16th ACM Great Lakes Symposium

Xiangrong Zhou and Peter PetrovProceedings of the 16th ACM Great Lakes Symposium

on VLSI (GLSVLSI '06)pp. 398-403, Apr. 2006

Citation Count: 6Presenter: Chun-Hung Lai

112/04/21

In this paper we present a novel cache architecture for energy efficient data caches in embedded processors with virtual memory. Application knowledge regarding the nature of memory references is used to eliminate tag address translations for most of the cache accesses. We introduce a novel cache tagging scheme, where both virtual and physical tags co-exist in the cache tag arrays. Physical tags and special handling for the super-set cache index bits are used for references to shared data regions in order to avoid cache consistency problems.

By eliminating the need for address translation on cache access for the majority of references, a significant power reduction is achieved. We outline an efficient hardware architecture for the proposed approach, where the application information is captured in a reprogrammable way and the cache architecture is minimally modified. Our experimental results show energy reductions for the address translation hardware in the range of 90%, while the reduction for the entire cache architecture is within the range of 25%-30%.

AbstractAbstract

- 2 -

Cache organization with virtual memory support is very power consuming The address translation (TLB lookup) is performed each time the

cache (for PIPT and VIPT cache) is accessed。The TLB power constitutes 20-25% of the total cache power

Goal: reduce power by minimizing the number of address translation on cache accesses Propose a selective tag translation cache architecture

。For private data: can be handled with virtual tag (w.o. addr. translation)。For shared data: the physical tag is required (need addr. translation)

Different virtual addresses are mapped to the same physical address

What’s the ProblemWhat’s the Problem

- 3 -

Synonym problem:

Virtually Indexed Virtually Tagged (VIVT) cache

Pros: fast and low power (since no address trans. on cache access) Cons: Synonym

。Different virtual addresses (> 1 task) are mapped to the same physical address (shared data) For inter-process communication

Physically Indexed Physically Tagged (PIPT) cache Pros: Synonym is no longer an issue Cons: delay and power overhead (address trans. for each cache access)

BackgroundBackground- VIVT & PIPT- VIVT & PIPT

- 4 -

Cache Consistency Problem

Since virtual address is used to access $- The shared data will in different blocks

Since virtual address is used to access $- The shared data will in different blocks

Virtually Indexed Physically Tagged (VIPT) cache

Pros:。Hide address translation latency

Since perform address translation only for tags Cache indexing can be overlapped with the tag trans.

。Can eliminate the cache synonym problem By imposing certain restriction to the OS memory manager

Cons:。Power overhead (address trans. for each cache access)

Background- VIPTBackground- VIPT

- 5 -

Most typical cache architecture for general-purpose processor- Discuss in this paper

Most typical cache architecture for general-purpose processor- Discuss in this paper

Both virtual and physical tags are utilized at the same time All cache lines are virtually-indexed

。Non-shared data are tagged with virtual tags。Shared data are tagged with physical tags

Selective Tag Translation Cache ArchitectureSelective Tag Translation Cache Architecture

- 6 -

Different VAs are mapped to the same PA

VPN

Virtual Tag

Physical Tag

Virtual Index

Save power since no addr. translation is

required

Save power since no addr. translation is

required

Can be identified in advanced; physically tagged

when placed in cache

Can be identified in advanced; physically tagged

when placed in cache

Special Care

Mode bitMode bit

What is the Aligned Synonym The superset bits of virtual address are identical to superset

bits of physical address。Thus, virtual index is the same as the physical index

What is the superset bits (or color bits)。The intersection bits of cache index and VPN

When the cache way size is larger than the page size

The Proposed Technique can Work Correctly The Proposed Technique can Work Correctly When Synonyms are AlignedWhen Synonyms are Aligned

- 7 -

Virtually Indexed Physically Tagged (VIPT) = Physically Indexed Physically Tagged (PIPT)Virtually Indexed Physically Tagged (VIPT) = Physically Indexed Physically Tagged (PIPT)

Virtual address:

When used to access $:

The MSB of virtual index overlaps with VPN

The MSB of virtual index overlaps with VPN

To eliminate synonym problem in VIPT:-Align synonyms in OS memory manager

The virtual superset bits are not identical to the physical superset bits Conflict with other virtual indexes which don’t belong to the same

synonym group

However, When the Synonyms are Not AlignedHowever, When the Synonyms are Not Aligned

- 8 -

Two virtual addresses have the same virtual superset

bits

Two virtual addresses have the same virtual superset

bits

Indicate to the same $ lineIndicate to the same $ line

The same virtual indexThe same virtual index

If we use VIPT

However, they have different PPNs-Physical tag part is the same-Only the physical superset bits is different

However, they have different PPNs-Physical tag part is the same-Only the physical superset bits is different

Same physical tag:- Misunderstand they are the same

Same physical tag:- Misunderstand they are the same

Goal: translate the virtual superset bits to the physical superset bits with minimal cost Add an offset to the virtual superset bits

。Since shared data buffer is allocated in consecutive physical addresses It is also mapped to consecutive virtual addresses

To Avoid the Previous Conflict When Synonyms To Avoid the Previous Conflict When Synonyms are Not Alignedare Not Aligned

- 9 -

Page offset adder:-Replace TLB to translate physical tag (power efficient)

Page offset adder:-Replace TLB to translate physical tag (power efficient)Physical

Tag

Superset offset adder:- Translate physical superset bits with little delay

Superset offset adder:- Translate physical superset bits with little delay

Physical Superset bits

ConcatenateConcatenate

Virtual Tag

To apply the proposed scheme The shared data buffer and the hot-spots are identified

。During program profile, compile, and load phases Two extra bits are encoded in the memory reference instruction

(which access shared data buffer)。Case1: for the most frequently accessed shared data buffer

Utilize the offset adjustment address translation method Index of the offset table is encoded in the memory reference inst. as well

。Case2: for the less frequently accessed shared data buffer Translate physical tag by the D-TLB

。Case3: for the non-shared data Handle with virtual tag

Compiler and OS SupportCompiler and OS Support

- 10 -

Offset Table:

For each shared buffer:-An entry is reserved-The offset is determined by OS

For each shared buffer:-An entry is reserved-The offset is determined by OS

Benefit comes from hereBenefit comes from here

First, one additional bit is associated to each cache line Indicate a physical tag or virtual tag

Second, implement the offset table

Third, superset offset adder and page offset adder Translate the physical superset bits and PPN for synonym access The introduced delay is small

。The superset offset adder on cache access path is small (2 bits typically)。Though, the page offset adder is longer

The adder delay < TLB delay TLB is replaced with adder

Hardware SupportHardware Support

- 11 -

The Label Part (L) of each memory inst.-Index of offset table-The synonym bit (the previous 3 cases) - Use virtual or physical tag

The Label Part (L) of each memory inst.-Index of offset table-The synonym bit (the previous 3 cases) - Use virtual or physical tag

Offset table access and cache access are pipelined -> not critical path

Offset Table:

No performance overhead

No performance overhead

The different address translation path are controlled by the L field of memory inst.

Overall Hardware OrganizationOverall Hardware Organization

- 12 -

Physical Superset bits

Virtual Tag

VPN

VPNPhysical

TagCase3:-For non-shared data - No power spend on address translation

Case3:-For non-shared data - No power spend on address translation

Case2:-For non-frequently used shared data - Default D-TLB

Case2:-For non-frequently used shared data - Default D-TLB

Case1:-For frequently used shared data - Use offset adjustment

Case1:-For frequently used shared data - Use offset adjustment

Virtual Superset bits

Experimental Results: Energy Reduction for Experimental Results: Energy Reduction for Selective Tag Translation Cache- 1/2Selective Tag Translation Cache- 1/2

- 13 -

The energy reduction corresponding to a direct-mapped $ For the address translation only: 77.8% ~ 99.3% For the entire cache including address translation: 22.1% ~ 29.4%

dm: direct-mapped cache2way: 2-way set associative cache

A pair of number:1st number: address translation only2nd number: entire cache


Assume we use D-TLB for physical tag translation

Assume offset adjustment translation for physical superset bits and PPNs is applied

The energy reduction corresponding to a direct-mapped $ For the address translation only: 82.1% ~ 99.9% For the entire cache including address translation: 23.6% ~ 29.6%

Experimental Results: Energy Reduction for Experimental Results: Energy Reduction for Selective Tag Translation Cache- 2/2Selective Tag Translation Cache- 2/2

- 14 -

dm: direct-mapped cache2way: 2-way set associative cache



This paper proposed a selectively tagged cache architecture for low-power processors with virtual memory support References to private data

。Virtual tags are used and power consuming address translation is eliminated References to shared data

。Physical tags are used to avoid synonym problems Furthermore, due to the consecutive property of shared buffer

allocation。The address translation can be performed by a adder instead of TLB

lookup The energy reduction can be improved further

Results show that the proposed scheme Energy reduction for the entire cache: 25%~30%

ConclusionsConclusions

- 15 -

The instruction set extension may not easy The proposed scheme will add a label field for memory

instruction。Whether the unused bits in the instruction encoding are sufficient

for the label field?

The relationship between related works and the proposed work is not connected The step further with related works is not concrete Lack comparison with related works in experimental results

The area and performance overhead are not listed

Comment for This PaperComment for This Paper

- 16 -

Related WorksRelated Works

- 17 -

Techniques for minimizing the power/performance overhead of TLB

Techniques for minimizing the power/performance overhead of TLB

Page Sharing table to TLB

[2]

Page Sharing table to TLB

[2]

Replace TLB with more scalable

Synonym Lookaside Buffer

[4]

Replace TLB with more scalable

Synonym Lookaside Buffer

[4]

TLB supports up to two pages per entry

[7]

TLB supports up to two pages per entry

[7]

Redirect TLB accesses to a

register which holds recent TLB entries

[8]

Redirect TLB accesses to a

register which holds recent TLB entries

[8]

Reduce the amount of TLB activities

Selective tag translation cache architecture

Selective tag translation cache architecture

This paper:

???

Xiangrong Zhou and Peter Petrov Proceedings of the 16th ACM Great Lakes Symposium

Documents

Transcript of Xiangrong Zhou and Peter Petrov Proceedings of the 16th ACM Great Lakes Symposium