Page associative caches on Futurebus

Page associative caches on Futurebus

Current technology will support cost effective implementation of page associative caches in high-performance systems. Paul Dixon explains how

such a scheme could work on Futurebus

A cache scheme which uses page associative cache descriptors can offer advantages in terms of its impact on cache coherence in the presence of paged transactions, and on the use of local memory to minimize bus loading. It can also be used to preserve cache coherence when processor accesses are cached, i.e. a logical cache, in the presence of an external memory management unit. This paper proposes a mechanism for page associative cache operation for use with Futurebus systems, and analyses how such a cache would overcome the problems common to cache memory designs. Cache protocols are not included in the current Futurebus specification IEEE 896.1, recently standardized, but will be added after further development. Performance of page associative caches is compared with directly mapped caches and fully associative caches.

microsystems caches Futurebus

protocols required to maintain cache coherence in a system with multiple cache memories is supported by the IEEE 896 Futurebus specification 1 . The following discussion relates to a Futurebus environment.

A variety of different cache memory architectures may be implemented, ranging from a simple, directly mapped cache to a fully associative cache. Each cache scheme has its own advantages and disadvantages in terms of its effect on processor performance and its interaction with the architecture of the overall system. In the implementation of a cache memory it is assumed that the following will apply.

• When data is loaded into the cache memory it is loaded a line at a time. Within a given system, the line length is fixed at 2 n byte. Typical line lengths will range from a minimum of 4 byte up to 64 byte.

• A dual set of descriptors is used, one for processor references and one for bus references (see Figure 1). This removes the need to arbitrate for the single

Cache memories are being used to improve processor performance in an increasing number of high-performance computer systems. A cache memory is a block of fast memory placed close to the processor that it serves. Its purpose is to maintain a copy of instructions and/or data that are frequently accessed by the processor, thereby reducing access time to that data on subsequent accesses. The data held in a cache memory is a copy which exists in parallel with a real location in memory, which may or may not be up to date, and possibly with other copies in cache memories belonging to other processors in the system.

In a system where cache memories are employed it is imperative that no processor is allowed to operate on a copy of data which may no longer be valid, due to a change that has occurred elsewhere in the system. This is referred to as the problem of cache coherence. The

Ferranti Computer Systems Ltd, Simonsway, Wythenshawe, Manchester M22 5LA, UK Paper revised: 4 January 1988

Processor[ address ]

Processor check

Buffer or MMU

Cache memory

Figure 1. Dual descriptor cache

0141-9331/88/03159-05 $03.00 © 1988 Butterworth & Co. (Publishers) Ltd

Bus address

Bus check

Vol 12 No 3 April 1988 159

descriptor resource on each bus access and, therefore, improves the efficiency of the cache memory.

• Paged memory management is used, where logical addresses generated by the program are translated to physical addresses by a memory management unit. This address translation is based on pages of size 2 p, where address bits A<~(p - I ) . . . 0) define an index into the page and are not affected by translation, while address bits A{31 . . . p) are subject to translation by the memory management unit, allowing the relocation of instructions and data to arbitrary physical locations. Large objects may, in addition to this relocation, be fragmented, easing memory allocation algorithms.

Investigating the interaction of the cache memory with the system architecture, the following advantages of page associative caches become obvious. These advantages are all related to the associative nature of the scheme and could equally be applied to a fully associative cache memory. Current technology does not permit cost- effective implementation of a large, fully associative cache memory, but it is adequate to produce a cost effective implementation of a page associative cache memory scheme. While page associative caches do have disadvantages in terms of processor performance, it is felt that these are outweighed by the advantages that can be gained by making use of the features outlined below.

PAGE ASSOCIATIVE CACHES

Cache memory is simply a fast memory significantly smaller than the total address field of the processor and the data it holds is a copy of the real data. A scheme of cache descriptors is therefore required to allow the data in the cache memory to be identified.

The simplest scheme to implement is a directly mapped scheme, where a subset of the processor address lines is used to form an index into the cache memory to define the single location in cache where that reference, if it exists, will be found. Additional memory bits are provided in each location, where the address bits not used as part of the index are stored for comparison on subsequent accesses. The result of this comparison defines whether a 'hit' or 'miss' has occurred.

More complex cache memory schemes may use associativity in the descriptors. In this case, each reference from the processor has a number of different locations to which it could be allocated. A parallel associative match is performed on each access from the processor to determine if and where that reference resides. Page associative caches have their memory divided into a number of pages, where the size of a page is defined by the system and will, probably, be related to the memory management scheme, which may also employ demand paging, and uses pages of size 2 p. Typical page sizes will lie in the range 1 kbyte to 8 kbyte.

When the requirement to load a new page descriptor arises, a cache page is selected and the new reference loaded. The whole of that cache page is now invalid and available to hold data from the physical page referred to. Entries in this cache page become valid, a line at a time, as data is entered into the cache in response to a miss from the processor. Consequently, the utilization of the page cache, i.e. the ratio of valid to invalid lines, is typically lower than in other cache architectures.

0 I Task,o ]Pagestatusl A31 , A;I:'I 1 [Task IO" Pagestatus I A31 A,,j

"--m i v v v v v vvv I "lTas " lPages' '°'l I • I OOOO0 OOO

ITasklDIPagestatuslA31 An ] i I Task ,D P.ge s . osl A31

2 k associat ive page descr ip to rs i ' ~: ,,i l i nea t tHDu tes n d e x e d b v address bi ts c, (p 1 } n

Figure 2. Descriptor block for a page associative cache

To implement the above scheme requires a cache controller which consists of 2 k associative descriptors (see Figure 2), which will hold the page portion of the address and a process identification. A minimum of 32 descriptors is required for adequate performance; more, if technology permits, would be desirable. For efficiency, the controller should consist of two similar descriptor blocks, one for processor references and one for bus references. The processor descriptor block must also contain a copy of the page status bit from local memory for each valid page entry.

Associated with each page in the processor descriptor block there must, to implement the MOESI scheme (which assigns five characteristics - - modified, owned exclusive, shared and invalid - - that define cache data) as defined by the IEEE 896 Futurebus literature 1, be the attribute b i t s - valid, exclusive and owned - - f o r each line in that page, as shown in Figure 2. These bits will encode to form a sixth state as a page containing modified lines is made invalid to the processor. This reference may not be overwritten until all modified lines have been restored to memory. It must be possible to perform an ORed read of the attribute bits from each line to ensure that no modified lines exist before a page can be replaced.

The abili W to detect pages with the 'owned' attribute permits the discarding of unmodified pages without a sequential search. Modified lines should be restored to memory, as required, by the cache controller, using the physical reference in the bus check descriptor :block. To assist the restoration of a limited, but fragmented, number of lines, a fast search to find the next modified line would be an advantage.

To minimize the interaction between the processor and bus check descriptor blocks, the bus check descriptor block requires a copy of the attribute bits. Thispermits the bus check descriptor block to interact with the bus during the connection phase with a minimum of delay and without the need to delay the processor by unnecessary reference to its descriptor block. Interaction with the processor descriptor is required only in cases where the modification of an attribute is required.

As a new reference is allocated to a physical cache page, all attribute bits in both blocks are cleared and the physical reference is loaded into the bus check descriptor. The processor reference, which may be either logical or physical, depending on the location of the memory management unit, is loaded into the processor descriptor block.

160 Microprocessors and Microsystems

A reference to an invalid line in the cache will result in loading of that line and the acquisition of the 'valid' attributes in both processor and bus check descriptor blocks. Loading of any line in a page which is either external to the module or else is local but has the page status bit set will result in the generation of a bus transaction. The modification of an item in a line which does not possess the 'owned' attribute will result in the acquisition of this attribute in both the processor and bus check descriptor blocks. For each bus transaction generated due to the state of the page status bit, a disconnection command may be used to verify that an external cache reference to that page still exists. Where no reference exists, the page status bit may be cleared.

PROBLEMS

The following problems were investigated to ensure that there were no fundamental problems that could compromise the successful implementation of a high-performance multiprocessor system.

Memory management

The potential value of page associative caches first becomes obvious in the solution to the memory management problem, where to maximize processor performance it is necessary to connect the cache memory directly to the processor. At the time of the investigation, only external memory management units were available. To connect the cache after the memory management unit results in a loss of performance due to the memory management unit translation time. To connect the cache before the memory management unit results in the caching of logical references, which can present difficulties with cache coherence when monitoring physical system addresses.

Previous cache memory solutions, due to implementation difficulties, have usually employed directly mapped schemes. A directly mapped cache of size 2 m uses add ress bits A{ (m - 1) . . . 0} as an index into the cache to define the location at which a reference will, if present, reside. Where m > n, i.e. the cache size is larger than the page size, translation of address bits A<~(m - 1) . . . p~ will occur in a manner such that two or more entries, which may exist simultaneously in the logical cache, may translate to produce the same conflicting index into the physical descriptors. The result of this conflict is to render it impossible, within a directly mapped cache scheme, to use two similar descriptor blocks to maintain coherence between physical and logical references.

If directly mapped caches are employed a scheme of reverse translation would be one solution. The only solution to the problem of coherence in logical caches that does not require a scheme of reverse translation lies in the use of associative descriptors, where references are placed, arbitrarily with respect to address, anywhere within the descriptor block.

For an associative cache with 2 k associative descriptors, a reference may be allocated to the ith descriptor in the processor descriptor block. The corresponding reference is also loaded into the ith descriptor in the bus check descriptor block at the same time. There is now no requirement that the references loaded into the processor

and bus check descriptor blocks be identical. The buffer (in Figure 1) required to allow a path between processor and bus addresses may include a memory management unit. All that is required to maintain cache coherence across a memory management unit, where translation of the addresses has occurred, is to load the translated, physical address into the bus check descriptor block. When coincidence is detected during a bus access, cooperation between the logical and physical blocks allows the physical site of the reference to be identified. Coherence may now be preserved and the need for reverse translation is eliminated.

Since associative descriptors hold the solution, and large, fully associative caches are beyond current technology, the only solution appears to lie in some form of partially associative scheme. The most logical partial division for the cache descriptors which avoids an arbitrary division of the address field appears to be the already existing partition into pages. This produces a scheme where there are 2 k associative page descriptors, and associated with each is sufficient fast cache memory to contain a complete page of data. The index into each page is by the least significant page index address bits A{ (#-1) . . .0~>, which have already been defined as undergoing no translation.

Block transfers

Block transfers, where data is transferred from sequentially ascending addresses in memory, are used to improve the bandwidth available on the bus. During a block transfer on Futurebus, only connected slaves are able to keep track of the progress of the transaction and are aware of the size of the address block transferred. Where the length of a block transfer exceeds the length of a line, coherence may be compromised. To avoid this it is necessary either to employ some, probably complex, high-level scheme to ensure that coherence is not compromised, or else to restrict the length of block transfers to a single line.

The restriction to line size transactions has a serious impact on potential bus bandwidth, particularly for line lengths as short as 16 byte, which may be required in conjunction with a 68030 processor, for example. In this case, when the length of time required for the connection and disconnection phases is added to the data phase the effective bus bandwidth could easily be halved.

Neither a line length restriction nor a high-level scheme is desirable, but the need for nonconnected slaves to detect coincidence by monitoring only the start address requires an arbitrary limit on the block size. The size of this arbitrary limit is governed by the ability of the cache descriptor scheme to identify references to lines other than the one addressed at the start of the transaction.

Page associative caches, again, offer a potential solution, since they are able to detect an existing reference within a given page. The requirement when beginning a long block transfer is only to be sure that cache coherence will not be compromised if the block transfer is completed.

When checking for coincidence, a page associative cache must first check whether a reference to that page exists, and then whether there is a reference to the addressed line. The ability to test for the presence of a descriptor for a given page allows an indication to be

Vol 12 No 3 April 1988 161

given if coherence is compromised by a transfer within the page specified. This permits block transfers to proceed without compromise to coherence as long as a page boundary is not crossed. Where coherence would be compromised by a transfer within that page, the transfer must be broken down into line-size transfers.

Since many long block transfers will be of pages, particularly in a demand paged system, the division of the cache scheme into pages appears to impose no restriction. To implement this scheme the master must, as the transaction begins, indicate that the size of the transfer may exceed a single line. This may be achieved by the use of the cache command bit CC* which, if it is asserted, indicates that the transaction will not exceed a line and that line wrapround, if used in that system, should be used; also that the master will retain a copy of the line in its cache. Slaves with a copy of the line in their own cache should respond by asserting the cache status bit CS*.

If the cache command bit CC* is not asserted this should indicate that line wrapround is not required and that the length of the transaction may exceed the length of a line, but may not cross a page boundary. Any slave with a copy of that page in its cache may indicate, by asserting the cache status bit CS*, if there would be any com promise to coherence if a transfer of the type indicated were to proceed within the page. If the master detects the cache status bit CS* asserted, to preserve coherence it must break down the transfer into a number of transactions which do not cross a line boundary. The coherence of each line may now be preserved by nonconnected slaves.

The majority of page transfers will find no potential compromise of coherence and the full page will be transferred in a single transaction. In the minority of cases, where potential compromise exists, coherence is preserved by the breakdown of the transfer into line-size transactions. Where a page transfer modifies the global memory image, e.g. during page replacement in a demand paged system, commands may be employed in the disconnection phase to invalidate cache references to the discarded page.

Cache flush

At the termination of a process it may be desirable to flush from the cache the descriptors of all references relating to that process. The use of a process identification of an associative nature in the descriptor block permits a selective flush to take place.

The alternatives to a selective associative flush are either a selective flush by a sequential search or a complete flush. A scheme requiring a sequential search is undesirable due to the time taken (although this search may be limited, in the case of a page as opposed to a process identification, to the locations which that page may occupy). A complete flush will never compromise coherence, but if a write-back algorithm is employed this solution will itself require a sequential search through the whole descriptor block to identify modified lines which must be restored to memory.

In principle, an associative cache can support the associative flush of references, either by process identification or by page. If the flush arises as a result of the termination of a process it is possible that modified data may not be required (portions of the stack may be deleted by the termination of the process, for example) and the

data may be discarded using associative techniques without the need to restore to memory.

Memory management coherence

In a demand paged system, pages are removed from memory to make way for new pages required by the current processes. As a page is removed there is an interaction with the cache. All references to the page to be removed, particularly where write access to that page is permitted, must be safely removed from all caches in a manner that restores all modified lines of datato memory and prevents further access by the process to the page. This may be achieved if the eutry in the memory management tables is invalidated, followed by a flush of that page from cache. Associative techniques would, again, assist in the identificatio~ of modified pages, removing the need for a sequential search.

Local memory

Many slow bus systems have adopted local memory as a means of minimizing bus traffic and to reduce the latency of access to data, thereby improving performance. In any system where bus bandwidth and latency become a limiting factor, local memory offers a valid technique for squeezing a little extra from the system.

The use of a cache memory with a processor, now that the problems of coherence have been solved, allows the use of off-card data and code with acceptable latency, allowing a closely coupled, shared memory system to be implemented. The addition of memory on each processor card is a sensible step as it reduces the number of cards required for a minimum system and also results in an increase in system memory with each increment ivy, processing power added to the system.

This configuration raises the question of whether any benefit, in terms of bus utilization, can be gained by using the local memory. The problem here is associated with the system-wide requirements of the cache coherence scheme, and at first glance it would appear that the majority of accesses which produce a cache miss will still need to generate traffic on the bus to maintain cache coherence. With a small amount of additional logic, however, the characteristics of page associative caches may be used to reduce this traffic

To gain this benefit the local memory needs an additional status bit per page of physical memo~. The page status bit is set by an external access to the page which results in a copy of any line from that page being held in a cache. Accesses to local memory where the page status bit is clear require no external bus traffic to preserve cache coherence. An access to local memory where the page status bit is set warns that there may be a copy of a line from this page resident in an external cache. In the case of a line read, for example, a bus transaction must be initiated which, unless a slave indicates its desire to intervene or reflect, may be an address-only transaction. The cache status CS* permits the correct condition to be allocated to the attribute of exclusiveness during the address phase. The use of commands during the disconnection phase may allow slaves to report if a reference to that page exists in cache. Where no external cache holds a reference to that page, the page status bit

~/62 Microprocessors and Microsystems

may be cleared to indicate that the page is exclusive to its local processor.

The use of local memory also permits a write-through algorithm to be implemented on writes to exclusive data from local pages, reducing the potential write-back overhead for lines from local pages. The write-back overhead for a page associative cache is greater than it is for most other cache architectures, since as a page is replaced it may be necessary to restore that full page to memory before its reference can be removed and before the physical cache page can be reallocated. A write- through algorithm may operate on all local and exclusive lines of data with a minimum of latency, which may be further reduced by write buffering into local memory. External and shared lines would, for efficiency, operate a write-back scheme. The use of the write-through algorithm eliminates the need for a line to assume the 'owned' attribute.

CONCLUSIONS

A page associative cache has a number of disadvantages with respect to some popular cache architectures

• low utilization, where a high proportion of available lines may not contain valid data

• the potentially large amount of cached data discarded by the load of a new reference

• the limited number of page descriptors, especially in a multitasking environment

• the potential need to write back a large quantity of modified data, although this can be minimized with no loss in performance by the use of a write-through algorithm to local memory.

Weighed against these are the system advantages that can be gained within a paged architecture

• the ability to maintain coherence with an external memory management unit while caching logical references from the processor

• the ability to transfer complete pages in a single transaction offers an improvement in the page transfer time, which allows a higher bus bandwidth and reduces the period of time for which a local processor may be halted by a page transfer

• the reduction in the number of sequential searches required to satisfy the page-structured requirements of demand paging and the process-identification-based requirements of process termination

• the ability to use local memory in a manner that removes the need, in most cases, to inform the bus allows a considerable reduction in latency on a cache miss.

There are, therefore, advantages available to outweigh the disadvantages - - advantages which also overcome the disadvantages of directly mapped schemes and which may increase the potential performance of a system beyond that available using conventional techniques. If and when technology can support large, fully associative caches, partially associative facilities may be used within these controllers to retain compatibility with the paging facilities offered by page associative caches.

If in the future advantage is to be taken of associative techniques or of reference to local memory without bus transactions, then the Futurebus cache coherence protocols must be specified in a way that does not exclude the use of associative caches that operate in the manner described. The correct set of disconnection phase commands must also be specified.

REFERENCES

1 IEEE 896.7-'1987, Futurebus specification Standards Office, New York, NY, USA (1987)

IEEE

Paul Dixon graduated with a BSc from Leeds University, UK, in 1967. Since then he has worked for Ferranti Computer Systems at Wythenshawe. He has been involved in memory design, bus design and in the design of multiprocessor architectures, more recently using microprocessors. He is a member of the working

group for the IEEE 896 Futurebus standard and has been involved with a recently formed subcommittee of the VME International Trade Association (VITA) producing the specification for a VMEbus control IC. His interests are memory management, including demand paging, cache memory systems and multiprocessor architectures.

Vol 12 No 3 April 1988 163

Page associative caches on Futurebus

Documents

Transcript of Page associative caches on Futurebus