Last revised 7-8-04.
This is some conjecture on how semiconductor process trends and advances in computer architecture will influence the design of high performance computers. In particular, much of the control hardware complexity can be moved into software.
Current out of order execution superscalar computer designs consume considerable power and also require substantial design effort and verification.
Re OOE (out of order execution), in a sense, an OOE processor scans ahead in a program, using predicted branches to direct the scan, looking for loads to execute. Loads without indexing can be executed immediately. A renamed register provides a destination register for a load. These loads can occur across a few levels of calls. Along the way, the code predicted to be executed is also fetched.
To avoid the above scan, the non-indexed loads could be abstracted into separate data structures, e.g. grouped at the beginning of a function. Then loads would be executed indirectly by an instruction which indexed into a data structure. The instruction could specify a range of loads to be executed. Many indexed loads within loops can be converted into streams.
A function could be executed as a separate micro thread with the prefetches moved to the beginning of the micro thread. When the micro thread is activated, the prefetches could be executed and then the micro thread would stall until conditons are met for it to proceed. (Note that activating a micro thread would serve to prefetch the start of the function code.)
There is the hazard of prefetching a data item before a new value has been stored into it. When this occurs, the old value of the data is prefetched. The JIT compiler should know when this is possible and could simply fetch the data item again. This would ensure that the latest data value is used. If the data item is in a lower level of cache or memory, the prefetch would have the effect of moving the data item to a higher level cache.
The reference pointers in Java avoid the possibility of a C like pointer pointing to anywhere in memory. With Java, it should be much easier to detect aliasing, and to generate safe code. This aspect of Java fits well with a cache with subsets. The processor only has to look in one subset for a data item.
In some cases, prefetches could be grouped, e.g. two adjacent data items could be fetched as one.
Streams will effectively provide much prefetching.
Consider that as the processor-memory gap widens by say a factor of four, that one could reasonably reduce the instructions per clock by a factor of four. This would allow a small compact processor, and compared to a complex OOE processor, it would allow a faster clock, less power used in charging and discharging wires, and a much simpler design. Further, many such processors could be put on a chip. Each pretty much a copy of the processor next to it.
Perhaps 90% of the chip would be memory (includes cache memory). Data compression could be used to make more effective use of memory.
The arithmetic units could be tuned to the intended applications area.
The prefetch strategy can be very conservative when the other processors on the chip are mostly busy. Conversely, when only a few processors are busy, the prefetch strategy can be much more aggressive.
In some cases, it would be helpful to migrate data or code between memory levels, including cache levels. This is something that would need to be planned by software. The mindset for this would be something like thinking in terms of file input/output, moving data or code about as files.
This approach is very different from an out of order processor. Processors would be much smaller. Caches with subsets should allow supporting more processors on the chip. Since much of the power consumed is in the logic of a processor, the chip should use less power than an out of order processor.
My sense is that even a moderately complex design, including the software aspects, would be more cost effective than current out of order execution designs.
To Computer Architecture Page
To home page.