"Subsettable Top Level Cache" issued as patent, #6092153. Please search at http://www.uspto.gov for the as issued version.)
Some of the ideas here are covered by the above patent.
The compiler would do some of the cache allocation and an adaptive cache allocator would consider data sizes and available cache memory in allocating the rest of the cache memory at each level.
Subsets can be used as direct mapped caches.
Subsets can be used as local memory. These subsets are n blocks long. The advantage is the potential savings in memory, e.g. an array 33 blocks in size requires 33 blocks in a local memory subset, but 64 blocks in a direct mapped subset. (Note that even though this kind of subset is used as a local memory, the cache like behavior is retained such that redundant block loads are avoided.)
Subsets can be used as small buffers for sequential access to larger arrays lower in memory. The "Stream Engine", described later, would keep the buffer partly filled for read accesses, or partly emptied for write accesses.
A direct mapped subset with cache like behavior would be used to avoid redundant block loads, e.g. in table lookups when the desired word may be already present in a block in the subset.
Consider the simple example of computing the checksum of an array that is larger than the cache. The array would be accessed througha small subset used as a buffer. The "Stream Engine" would provide the array elements much like an input/output controller outputs data to a peripheral. This avoids thrashing the rest of the cache.
In a cache with subsets, all data is accessed through subsets.
At each phase of a program, the program would use a subset for each of the few to several most active data sets, and then use a catchall subset for whatever is left, if anything.
Each direct mapped subset behaves as a standalone cache, and shares the cache memory access hardware. The direct mapped subsets come in sizes that are a power of two. A cache with subsets naturally has the same large block size that a direct mapped cache has. The subsets cannot thrash each other; this eliminates most thrashing misses. (Thrashing can still occur within a subset.)
Generally, each base register would have a descriptor of a direct mapped subset. Memory accesses using a given base register would go through the subset specified by the descriptor associated with that base register. Each subset is described by a start address in the cache memory and the subset's size.
The memory base register could have a real memory address, a virtual memory address, a segment offset, etc. depending on how the memory system was being managed.
The top level cache would typically be 16 blocks, and be often used as 8 pairs of direct mapped subsets. This size cache would provide fast access to data, providing an alternative for some of thegeneral register usage. This arrangement would allow either one load/store operand per clock or a block read/write from lower in the memory hierarchy. Also, an extra set of 16 blocks could be included to allow rapid thread switching. The top level cache could be split in two, thereby doubling the bandwidth. Instruction accessing could be through a top level cache.
The compiler would be responsible for planning the use of the subsets in the top level cache during the execution of the program, i.e. which data uses which subsets, the size of each subset, and how the cache memory is allocated to subsets. The compiler would generate instructions as needed to configure the cache subsets. The planning would be somewhat similar to a compiler planning the use of a register set.
For cache levels below the top level, the compiler would plan the usage of the cache for the smaller data sets that are fixed in size. An adaptive cache allocator would handle the rest.
For performance critical applications, programmer planned allocation of the cache could be used.
Most of the time, each kind of data will have it's own subset. Often, a single data item, e.g. a record, will be loaded into a subset, then when processing is finished, the next data item of the same kind will be loaded into the same subset. With several subsets to utilize, the compiler can load a data item into a subset, then not disturb it until processing is finished. Because the loaded data item will not be disturbed, the start address of the data item in the cache subset can be passed to the processor, placed alongside the corresponding base register, and then the processor can access the data item using cache addresses. (Cache address wraparound within a direct mapped subset must be handled properly.)
Prefetch like load string instructions move the blocks containing the string from some lower level of memory to the specified subset. (Only the blocks not already in the cache subset need to be moved.) Either the string must fit in the subset or the access must be sequential.
The processor can access the most recent string loaded into each subset.
Processor accesses to the cache are usually for single operands and use cache addresses.
The compiler keeps track of what was last loaded in each subset. Also, the processor keeps a copy of the last load string instruction executed for each subset. This may be needed for thread switch recovery.
Implied in the preceding is that the string can span a number of blocks, and be fetched as a group, e.g. a record which may span two blocks. On average, I'd expect that this would result in getting ~2 blocks per DRAM row access. (A DRAM row access followed by a 2 block transfer takes significantly less time than 2 independent DRAM row accesses, each followed by a 1 block transfer.) In some cases, the transfer time of blocks after the first one will be masked by processing.
Two base registers could specify the same subset, e.g. a subset which holds part of an activation record stack. One base register could point to the current activation record and the other base register could point to the the next level up activation record in a stack. The compiler would need to ensure that the subset was large enough to hold both activation records at once.
Interlocks prevent attempted usage of data before it's available.
For high usage code, the programmer should write his code such that the working set fits in the minimum level one cache size. (The top level cache is the level zero cache.) Generally speaking, each set of randomly accessed data should have it's own cache subset. For sequentially accessed data, e.g. vectors, a small buffer in the level one cache could hold two blocks worth of the level two cache blocks, serving as a buffer for data stored in lower levels of memory.
The problem of optimizing the cache memory usage is similar to the problem of optimizing general register usage. Both usages require considering competing demands for resources at each phase of a program's execution as well as considering the transitions between phases.
When a general register usage optimizer runs out of general registers, it can use load/store instructions instead. When there is not enough room in a cache level for all of the data needed for a phase of program execution, first allocate for the high frequency uses, then allocate a direct mapped subset through which the remaining data will be accessed. (The remaining data would be allocated further down in the memory hierarchy.)
For a cache with subsets, I'm assuming that the compiler will use program flow analysis to do a good job of reusing the blocks containing data that is no longer needed. There are two cases here, 1. the data currently in the subset is no longer needed, after which new data can be loaded in over it, and 2. the subset itself is no longer needed, e.g. when leaving an area of the program.
A subset used with an array of data items will typically be either large enough to hold one data item or all of them. An exception is table lookups where the frequency and locality of the lookups affect the optimum size of the subset. Level two cache subsets will tend to be larger, e.g. where the first level cache holds one to several data item(s) of an array, the level two cache may hold the whole array.
C++ with it's pointers, etc. can be messy. A C++ compiler would need to add code such that when a C++ pointer accesses memory, the access is to the correct cache subset.
Java gets rid of pointers. Using Java and the Java VM as the framework for a computer design gets rid of much of the messiness.
If aliasing is possible, it'd help if aliasing was explicitly declared.
OOP (Object Oriented Programming) allows hiding actual data structures, e.g. one could have a matrix routine that had small, medium and large data structures. The choice of which data structure to use would be determined by the actual size of the matrices read in, e.g. small matrix operations could fit in the cache and large matrix operations could use submatrices (tiling?). Then, JIT (Just In Time compiling) would compile code that used the chosen data structure. With this strategy, there is apparent inefficiency when one looks at the source code, but most of the inefficiency can be optimized away.
Cache allocation is affected by how many threads are running, e.g. 1 or 2, the chip's cache size and the size of the data that is encountered. JIT (Just In Time compiling) also allows tuning the cache allocation to the size of the data encountered and the cache memory available to the thread. The plan is compiler allocation of smaller data sets and adaptive cache allocation of larger data sets, depending on the cache memory available.
Most of the time, data of a kind will go through a single subset. For this, there is no coherency problem. However, when two or more subsets are used to process the same kind of data, and updating occurs in one of the subsets, then coherency between subsets needs to be maintained. Various techniques can be used to maintain coherency between the subsets. Only the data traffic in the subsets subject to incoherency needs to be monitored.
To maintain cache coherency in a multiprocessor system, similar techniques could be used. Usually, only one or two subsets in each processor would be subject to incoherency.
In some cases, cache data may be stale, if so, a forced load would ensure that fresh data was loaded.
The Stream Engine moves data between levels in memory, block parallel, sometimes doing striding and/or scatter/gather, such that the processor can access code/data in direct mapped, local memory and small buffer subsets.
The general pattern would be that small strings are moved at the higher levels of memory, sometimes streamed, and longer strings at the lower levels of memory.
A memory controller would provide stride 1 access for accessing complete blocks. Stride n accesses to memory could be provided easily. Striding through a DRAM row allows fetching just the needed data.
It may be useful to access n columns of a matrix instead of just one column as one strides through memory. Then later to do stride n through the resultant array to pickup a single column.
An 8x8 submatrix could be stored row 1, row 2, etc. in a DRAM row, then the submatrix row accesses would be by contiguous bytes and submatrix column accesses would be by strided access through the DRAM row, picking up a value from each submatrix row. In the level two cache, the matrix rows and columns would be arrays of stride 1.
A field from sequentially stored records could be picked up and stored in the level two cache, allowing stride 1 processor access to the fields. This technique would provide a significant reduction in the bandwidth needed to transfer the strided data, i.e. only the data needed in a block is transferred, not the whole block every time. Plus, it groups the DRAM accesses such that one DRAM row access can often provide many data values. (This assumes that the DRAM access can include the striding specification.)
When following linked lists, it may be more efficient to transfer the link field and the field being checked to the level two cache, then follow the links in the level two cache.
On the processor side, a vector operation usually becomes a simple matter of sequentially accessing stride 1 arrays in the level one cache, or for arrays lower in the memory hierarchy, streaming the arrays through a small buffer. By the time streaming is used to do >1 strides, and arrays are accessed sequentially by the processor, there's not much left in a vector like loop, just the vector operation instruction which can specify two sources and a sink, a branch back and a loop end test. The loop end test could be a conditional branch branch back, the test being on whether the end of data has been reached. This model generalizes easily to more complex loops.
The real payoff of a Stream Engine is when you can average several uses of each data set streamed.
Techniques used to avoid striding through a disc memory could be used to avoid striding through DRAM.
A cache with subsets can be retrofitted to an existing ISA, be backward compatible and provide significant cacheperformance gains when executing programs compiled to exploit theadvantages of a cache with subsets.
The compiler would generate instructions as needed to specify the cache subsets for each base register, e.g. two consecutive load immediates to the same register could contain subset specifications, yet be transparent to existing ISA implementations.
While executing software that was not compiled for a cache with subsets,all of the base registers' subset descriptors could point to the same large subset, giving the effect of a single direct mapped cache. It'd probably be necessary to include a victim cache to obtain adequate performance.
Pointers in C++ may point to different data, depending on program flow. There is a risk that the processor would look for data in the wrong subset.The safe approach would be to use a single direct mapped subset when the compiler encounters a program where a pointer might point into two or more subsets. (Java doesn't use pointers, so this problem would not arise in Java.)
After a thread switch, a conventional cache doesn't immediately save the cache contents. However, when an updated (dirty) block is replaced, the updated data is correctly stored away. Later, when the thread is restarted, a conventional cache refetches the data as needed. A cache with subsets can use a somewhat similar approach.
If a thread suspension, then resumption occurs between the time that a load string instruction is executed and when the processor accesses any of the loaded string, then the correct block may not be there due to the thread switch.
A thread ID attached to each block's address enables doing a check that prevents erroneously using a block from another thread. When the thread ID mismatches, the correct block must be fetched. (This only happens after a thread suspension has allowed an intervening thread to overwrite cache blocks.) The simplest recovery is to just reexecute the last load stringthat was executed for that subset. (The processor can only directly access the words from the last load string instruction.)
Partial executions of the last load are also possible. The memory address of the missing block can be calculated from the last load string instruction i.e. the memory address of the missing byteis the memory address of the first byte loaded into the accessed subset plus the offset, if any, from the first byte. The block can then be loaded.
Adjacent blocks (on the same DRAM access) that contain some of the loaded string could also be fetched, if missing from the cache.
If the compiler knows the needs of external functions, to some extent it can plan it's allocation such that the external function is accomodated.The compiler could allocate the current program's most active subsets bottom up in the cache. And the operating system and library calls could be allocated top down in the cache (when they execute). Library and operating system routines could use the program's thread ID plus one.This strategy should minimize the thread switch recovery time for those calls.
Alternatively, the operating system could have it's own area of the cache in which it allocated it's subsets.
Fast response to interrupts can be assured by holding some portion of the interrupt processing code in an inactive subset, ready to be used in processing an interrupt.
I'm advocating keeping the cache memory at each cache level, e.g. levels one, two and three, managing each cache memory much as a main memory is managed, allocating it, etc., accessing each cache memory with cache memory addresses, but also providing cache like behavior within subsets of the cache memory when cache like behavior is needed.
The compiler controlled cache with direct mapped, local memory and buffer subsets provides the framework that enables the compiler's knowledge of program flow and data usage to be utilized in planning the use of the cache memory. Cache like behavior is provided where it is useful, e.g. in table lookups, and more generally, in avoiding redundant block loads, i.e. when the block is already in the cache subset.
By using direct mapping in subsets, the check for a redundant block load is nearly as simple as the check in a direct mapped cache.
Much of the time, the cache memory would be used like a main memory; data would be brought in from DRAM much like files or records are brought in from disc and cache memory addresses would be used to access it.
Regarding program size, consider that doing "load multiple words "into general registers can avoid individual RISC "load"s for each operand, thereby decreasing program size. (ARM uses this technique to improve code density.) In a similar way, a load string into a cache subset and putting the cache address of the first byte into a register of the subset, that this allows an operand to be specified by the active subset number, (perhaps 3 bits), which then accesses the cache address of the first byte, and then adds the offset from the first byte, (perhaps 7 bits), for a total of 10 bits. It should be easy enough to fit one cache access into a 32 bit au instruction.
A level two cache with subsets can support a number of processors on the same chip without thrashing degradation. Code and data to be shared among processors can reside in subsets which are shared. Code and data not to be shared can reside in private subsets.
Overall, I estimate that in exchange for the added compiler complexity, one can reduce cache size and improve cache performance, netting a significant reduction in processor cost for comparable performance.
To achieve maximum performance/cost, one needs a well balanced memory hierarchy with the number of levels and the size at each level tuned to the available technology. Also, as you go down in the memory hierarchy, the optimum transfer size, on average, gets larger. If you stray very far from the optimums, one pays in having less performance/cost.
For PIM (Processor In Memory designs), the onchip DRAM could be another cache level in the memory hierarchy.
Compared to a conventional cache, a cache with subsets is intuitively estimated to have ~40% of the miss delay and need ~40-50% as much cache memory. The basis for the estimates follows.Miss delay versus cache size is a tradeoff.
With current caches, the different sets of data within a program are all mingled in the same cache, into one large set, where the different sets of data thrash each other occasionally, and the only real defense against the thrashing is to make the cache so large that the thrashing is infrequent. Going to higher levels of associativity, e.g. from a direct mapped cache to a four-way set associative cache helps some.
From a different viewpoint, a conventional cache's hardware does not know the likelihood of a block being accessed soon. A blind strategy is used to choose the block to replace when a miss has occured, e.g. choose the "least recently used" block. This is often a bad choice, perhaps causing a thrashing miss ~45% of the time. (A thrashing miss occurs when the block which was replaced must itself be reloaded.) One thrashing miss can lead to another thrashing miss. There needs to be a sizable number of blocks in the cache which won't cause a thrashing miss, perhaps on the order of 50%, corresponding to half of the cache blocks being unused on average. (The "unused" cases are that the block won't be accessed for a long time or the block won't be accessed before being replaced.)
The above assumes that the demand for blocks is spread around evenly. In fact, the mapped addresses of active blocks will sometimes cluster, e.g. three active blocks mapping to the same cache set will produce thrashing in a two way cache. If you double the number of sets by doubling the cache size, you lessen the frequency of clustering. I estimate that clustering results in even more of the cache being unused, perhaps 50-60% on average.
I estimate that ~50% of all misses are thrashing misses, and that most of these can be avoided. The estimate of ~40% as much miss delay comes from eliminating most thrashing misses, and the grouping of DRAM accesses. The latter both reduces the average block access time and sometimes masks some of the access delay.
For a cache with subsets retrofitted to an existing ISA, there would not be the further reduction in miss delay due to grouping of DRAM accesses.
Back to Computer Architecture Page