The primary compute intensive application that most consumers will care about is graphics, e.g. 3D animation, MPEG decoding, etc., and there is plenty of parallelism in graphics to enable effective use of multiple processors on a chip.
With coming graphics programs requiring thousands of mips, I'd estimate the utilization of multiple processors to be over 90%.
Comparing wide issue processors to multiprocessors, each time you increase the issue rate by one, you must add logic to support the increased issue rate, and the overall average utilization of the logic falls off.
Similarly, each time you add a processor in a multiple processor system, the overall average utilization falls off. However, it's easier to find exploitable parallelism in a program than in a single thread, and so the fall off should be less rapid in a multiple processor system, given the coming graphic applications. For this reason, I think that processors beyond a few instructions issued per clock will be less than optimum for graphics applications.
Supercomputers, as I recall, went to multiprocessors at less than 1000 mips in the early 80's. Apparently, it was the path of least cost to higher performance.
It seems that a certain threshold must be crossed before multiple processors are the better choice. Supercomputers crossed that threshold in the early 80's. I surmise that microprocessors will soon cross it because graphics programs have a lot of parallelism.
As CMOS processes shrink, wiring delay is becoming a larger part of the clock delay. This tends to reduce the performance of wide issue processors.
At some point, multi-threading becomes attractive as a way to keep the processor busy. However, each thread needs cache space, and it seems to me that once you're up to 2-4 threads, it's better to go to more processors rather than attempt to run even more threads to keep the processor busy.
Multi-threading a processor can mask some of the DRAM access time. One could even execute on another thread while waiting for a branch condition to become available. This could be a viable option when there are plenty of threads.
As the amount of logic that can be put on a chip increases, it's only a matter of time before multiple processors become more viable than a really wide issue single processor. And then it's a matter of optimizing the issue width versus productivity within the single processors used in a multiple processor system.
There is tradeoff between low utilization of hardware in a high ILP processor and low utilization of hardware in a multiple processor due to a lack of paralellism. The balance will soon tip in favor of the multiple processors, if not already.
Looking down the road, the ILP needed for good multiple processor performance is inversely proportional to the clock speed, and so as clock rates increase, the ILP can decrease. E.G. by the time clock rates have increased by a factor of 5, the ILP may have dropped by a factor of 2, or down to where ILP is near optimum with respect to cost-performance.
Once you're up into > 1 Ghz clock rates, an ILP < 2 would seem to be optimum from the standpoint of cost-performance.
The Stanford Hydra project is about using 4 processors on a chip to cooperatively work on a single thread. This improves single thread execution speed when it's the bottleneck. And when not, the multiple processors can execute code more efficiently than a single processor.