Last revised 2-23-98.
All data transfers in a memory hierarchy can be viewed as string transfers, from data files on disc, e.g. load modules and data files, on up to the top level cache. The size of the string transfers are large at the low end, e.g. to/from disc, on average, and get smaller as move up the memory hierarchy, on average. Sequentially accessed strings can be streamed, e.g. to/from the processor.
A load string instruction moves data up the memory hierarchy. Checks can be made to prevent redundant loads, as done in a cache.
A store string instruction moves data down the memory hierarchy.
On the disc, strings are called files.
In the DRAM, strings would be segments. For example, a load module file read in could contain a code segment and a data segment.
In the caches, strings would be subsegments.
To/from the arithmetic unit, the strings would have the usual operand lengths of 1, 2, 4 & 8 bytes.
Often, a single load string or store string instruction will be able to replace more than one RISC load or store instruction.
In the above example, the technological opportunity that the load string & store string instructions exploit is that in current cache designs, a cache access reads a block, but only one operand is selected out for transfer to the alu. However, there will often be other needed operands in the same block. A design that made the other operands available to the alu would save some cache accesses.
The above load string & store string instructions are much like "load/store multiple word" instructions except that the string approach allows operands of different lengths. An implementation requirement is that operands can occur anywhere within a block on byte boundaries, and need to be extracted accordingly.
Besides some general registers, the CPU could operate with up to 8 byte strings accessible to it at any given time.
A predicated (conditional) store string instruction would often allow the results of several instructions to be saved conditionally. This should be a fairly effective substitute for fully predicated execution and requires fewer code bits.
For "store string", only the updated bytes get written to memory. This avoids some otherwise nasty problems, e.g. two different processors updating non-overlapping strings that are in the same block.
In much the same way that a compiler can collapse two or more load instructions into a single load multiple registers instruction, a compiler could collapse two or more load instructions into a load string instructions, and then compile accesses to the operands within the string as accesses to the string.
For data compression, string loads/stores to/from the processor may be somewhat short for good compression. Accordingly, they could be grouped into larger strings for compression. The compression/decompression boundary could be at the L1-L2 cache interface. The larger strings would in turn be grouped for storage in DRAM and on disk.
To assess the effectiveness of load string & store string instructions, it'd help to have some statistics. If someone out there has convenient access to memory trace acquisition software and is interested in the problem, please e-mail me if you have questions. The algorithm to gather statistics could be:
Assume 32 byte blocks and 32 byte max string length.
To process each load/store in a trace:
Convert the memory reference to a string instruction, e.g. a load 4 byte word becomes a load string of length 4 bytes.
For loads, if within the last n (say 12) memory references, you find a load which can be merged with the current load to form a load string of length <= 32 bytes, merge them. Note that the current load can be in front, behind, or in the middle of a prior load.
For stores, treat similarly to the loads above.
Keep counts of loads and stores, merged loads and stores, and also load string and store string instructions which cross a block boundary.
The above should provide a ballpark estimate of the usefulness of load string & store string instructions. I'd expect cache friendly programs to score well and cache unfriendly programs to score less well. (The load string & store string instructions need both spatial locality of access and temporal locality of access.)
Back to Computer Architecture Page