Is RAM becoming a misnomer?
Computer hardware architectures and the performance of various components are co-evolving in a complex way, but one of the results is that “RAM” is becoming less well suited to random access – or at least relatively better optimized for sequential access. In the last decade or so from SDRAM to DDR3, memory transfer rates have improved from 800MB/second for PC100 memory to 17066MB/second for DDR3-2166, or a little better than 21x improvement. Access time, however, has barely improved 2x, from around 60ns to around 30ns. So the relative difference between sequential and random memory access performance has increased more than 10x. Its not quite the random/sequential speed difference of tape, but that’s a very significant change.
What’s going on? I’ll go through memory speeds in the “old days” (I think for everyone based on when they learned to program; for me that’s the early 80s) and how things have evolved more recently. Warning: unless you are really a serious geek, having heard the main point you should probably stop reading here.
I started programming on an Apple II+, first in BASIC then in assembly language (after that I moved to a VAX, programming mostly in C). I still remember on a 6502 A9 would load the accumulator with an immediate value (one byte). Its all a little hazy now, but the length of time to perform an operation had to do in large part with memory accesses. Most immediate operations (operations where the operand was specified as a constant, for example LDA #$AA, which would load the accumulator with the value $AA) ran in two clock cycles. Zero-page addressing (the first 256 bytes of memory) would add an extra cycle, so LDA $AA, which loads the accumulator with whatever value was in memory location $00AA, would run in 3 clock cycles. Absolute addressing, for example LDA $AAAA which loads the accumulator with whatever value is in memory location $AAAA, would run in 4 clock cycles. It didn’t actually cost anything extra to offset those absolute pointers by the X or Y register as long as you didn’t cross a page boundary when you added the offset.
Indirect zero page addressing would take 5 or 6 cycles, and is worth explaining because it was the sensible way to do arrays and pointers: LDA ($AA), Y would find the two byte address stored starting at $00AA, add Y to that, and load the value at that address. In an asymmetry which I transposed far too often (and created many bugs in the process) but makes sense when you only have 256 instructions to play with, indirect addressing with the X register was completely different: the X register was added to the immediate value used for the zero page address before dereferencing. This was useful for tables of pointers but I found the Y register style of indirection useful much more often in my programming.
Anyway, the overhead of accessing memory was anywhere from 1 to 4 CPU clock cycles, which was the equivalent of .5-2 immediate (no memory access) instructions. Memory access times were significant but didn’t absolutely dwarf processing time.
Fast forward to today’s hardware. Much publicized “numbers that every programmer should know:” L1 cache access is .5 ns, L2 cache access is 7 ns, and main memory access is 100 ns. Rather than a 4:1 range in the old days, we now have a 200:1 range. And even worse, in the old days you knew statically (within 1 cycle depending on page boundaries anyway) how much time you would take for memory access; now you don’t know, it depends on whether you get a cache hit.
100 ns is a long time; lets dig in and see what’s behind it. Accessing memory nowadays is a complicated affair, and honestly until I sat down to write this I didn’t have a clear understanding of how that time was spent or how long it would take with different types of memory. Here’s what I found.
Using today’s memory, you don’t just ask for an address. Memory is organized in rows and columns, and before you can access an new row, you first have to precharge, then you strobe the new row address, then you strobe the column address that you want. If you’re looking at memory performance specs, the times for each of these operations are listed as tRP, tRCD, and CL. You have to add these together to determine how many memory I/O bus cycles it takes to read a new address not on the same row as the last read. Careful though, because a memory I/O bus cycle nowadays is twice as long as you think: DDR3-1600 memory actually has an I/O bus speed of 800 MHz, so each cycle is 1.25 ns. A DDR3-1600G rated memory takes 8 cycles each for tRP, tRCD, and CL, so altogether a memory access can happen in 30ns – not as bad as I feared. Well, 100 ns was an order of magnitude figure; its still quite a bit longer than .5 ns for an L1 cache hit and I’m not quite including everything that has to occur in the whole cycle (eg, you need to know you have a cache miss before you even start the process, and there are multiple layers of cache).
If DDR3-1600 memory can return a result in 30ns, how much can we shave the time with DDR3-2133 memory? The answer will surprise you – it might not be faster at all for random access. There is no JEDEC standard for DDR3-2133G; instead it starts at DDR3-2133K as the fastest DDR3-2133 standard. Rather than 8 cycles each for tRP, tRCD, and CL, it costs 11 cycles. Overall a random memory access is 33 cycles. 33 cycles at 1066MHz is about 31ns – no faster at all, and in fact slightly slower. Of course the absence of a JEDEC standard doesn’t mean it isn’t possible – a true weenie will buy his memory and test it with various BIOS settings. I’ve seen tests which show stable operation in some cases with 9/11/9 cycle timings at DDR3-2133, for a total of 29 cycles or just over 27 ns.
That said, DDR3-2133 memory has a significantly higher transfer rate than DDR3-1600: 17066MB/second vs 12800MB/second. Again, you see throughput increasing but not latency. Going back to the PC100 memory, that was common in PCs a decade ago, you see the same patter: an 800MB/second transfer rate based on a 100MHz memory I/O bus and one transfer per cycle (rather than the current two), but tRP, tRCD, and CL were all 2 cycles. A random memory access was 6 cycles of 100MHz, or 60 ns, only twice the access time of today’s high end memory as compared to transfer rates 21 times slower than today.
Where this may be headed in the future will be the subject of another post.