A Perspective on DRAM and Packaging

Last revised 5-5-04.

The conclusion is that a package to package messaging interface makes sense for packaging DRAM, processors and interface chips.

Single chip multiprocessors with onchip DRAM would provide the main compute power. An onchip DRAM of 128mbits (16MB), if operated as a huge cache, would provide a significant reduction in needed off chip bandwidth per processor. However, with many processors on a chip, the overall bandwidth need would increase.

For at least a few years, it'd seem that satisfying cache misses will be the majority of DRAM reads. Accordingly, a DRAM interface should handle this usage with reasonable efficiency. The block addresses might be 6 bytes long with the block transfers being 64 to 128 bytes long.

However, I believe that it would be more effective to operate the 16MB as a main memory, with much of the access to off chip DRAM memory being like file i/o, including streams of data. (File i/o naturally groups data for transfer, as opposed to getting the data a cache miss at a time.)

DRAM chips used mainly for memory would include a simple processor which could perform data base searches of it's DRAM, access data structures within the DRAM, e.g. linked lists, and do strided accesses. A result of putting a processor on each DRAM chip is that some data will be processed on chip, e.g. data base searches, and the results transmitted. As a consequence, the average data transfer size will increase significantly. On a 1 gbit DRAM, the simple processor might add .1% to the chip area, and use the logic capabilities of a standard DRAM process.

The chips could communicate with point to point connections, e.g. with a 2 byte wide data path. Data flow would be in/out through pads on the upper side of a chip package edge, to the chip itself, then out/in to the lower side of the chip package edge. This avoids crossing the chip. It's easy to come up with a variety of stackable arrangements. (The main compute chip could have up to 8 - 2 byte wide data paths, 4 on the top edges and 4 on the bottom edges.)

Alternatively, if transmission lines are used to cross a chip, the delay in crossing the chip would be about .1 ns. This would be tolerable. For moderate size computers, this would allow a comparatively flat layout, e.g. as one would like in a PDA. (Note, I'm not an engineer. I assumed about 1/3 of the speed of light in getting to the other side of the chip. Also, it costs more to make a workable transmission line.) The added delay for an 11 long string of DRAMs could be (11-1)*.1, or 1 ns. Then double this for the out and back delay, 2 ns. Then if put 11 DRAMs per channel on each of 8 channels, get 88 DRAM packages. Further, if each DRAM package has 2 - 1 gbit DRAM chips, then each package has .25GB in it. The overall memory size could then be 22GB.

In many applications, many of the DRAM packages would be in the leaf position, or last in a stack or string, and then would only use one side of the interface.

I'd estimate the average data transfer size to be greater than 1k bytes. For transfers of this size, it makes sense to use a single set of data lines for both addresses and data. The addresses may only occupy the data lines 1-3% of the time. An exception to this is following linked lists. The techniques developed for storing databases on hard drives may be useful here.

To keep latency minimal on DRAM reads, there could be two data lines dedicated to outbound data and the same number of data lines dedicated to inbound data. Then addresses for DRAM reads can often be sent ahead prior to the actual need, and only a short "read using already sent address" message needs to be sent to initiate the DRAM read.

In addition to the above unidirectional data lines (after powerup), a bidiectional set of lines could be used to provide higher overall bandwidth for longer messages.

The chip interface could be symmetrical about a center ground (or power) line. Then upon power up, the minimal interface processor on each chip would detect how it was oriented to the next chip, i.e. normal or rotated 180 degrees.

A sample interface could be:

Note the symmetry in the above.

There could be a variety of chip to chip interface standards, e.g. 4, 8, 16 & 32 data bits wide. Any data width interface could communicate with any other data width by using the smaller data width. The interface processor would detect the data width and the maximum safe data transfer speed, then data would be transferred at that speed.

While the usual connection would be chip to chip, it may be useful to allow up to four chips to be butted together, forming a small bus. This could operate at a lower frequency.

Another variation is to allow one or two chips at each end of a cable. Again, this could operate at a lower frequency.

A mechanical assembly would be used to hold the chips together in the mated position, obviating the need for a PCB for interconnecting components. However, a PCB may be needed for mounting external connectors, e.g. a USB connector.

A standardized interface would allow many embedded processor applications to be assembled tinker toy like from an assortment of chips. For many low to mid volume applications, this would likely be significantly cheaper than designing an ASIC.

The discussion so far assumes one chip per package. Chips could also be stacked, e.g. like in cell phones. This would be mainly useful in processor and DRAM packages.

To Computer Architecture Page

To home page.