Is RAM becoming a misnomer?

Posted August 30, 2011
Filed under: Uncategorized |

Computer hardware architectures and the performance of various components are co-evolving in a complex way, but one of the results is that “RAM” is becoming less well suited to random access – or at least relatively better optimized for sequential access. In the last decade or so from SDRAM to DDR3, memory transfer rates have improved from 800MB/second for PC100 memory to 17066MB/second for DDR3-2166, or a little better than 21x improvement. Access time, however, has barely improved 2x, from around 60ns to around 30ns. So the relative difference between sequential and random memory access performance has increased more than 10x. Its not quite the random/sequential speed difference of tape, but that’s a very significant change.

What’s going on? I’ll go through memory speeds in the “old days” (I think for everyone based on when they learned to program; for me that’s the early 80s) and how things have evolved more recently. Warning: unless you are really a serious geek, having heard the main point you should probably stop reading here.

I started programming on an Apple II+, first in BASIC then in assembly language (after that I moved to a VAX, programming mostly in C). I still remember on a 6502 A9 would load the accumulator with an immediate value (one byte). Its all a little hazy now, but the length of time to perform an operation had to do in large part with memory accesses. Most immediate operations (operations where the operand was specified as a constant, for example LDA #$AA, which would load the accumulator with the value $AA) ran in two clock cycles. Zero-page addressing (the first 256 bytes of memory) would add an extra cycle, so LDA $AA, which loads the accumulator with whatever value was in memory location $00AA, would run in 3 clock cycles. Absolute addressing, for example LDA $AAAA which loads the accumulator with whatever value is in memory location $AAAA, would run in 4 clock cycles. It didn’t actually cost anything extra to offset those absolute pointers by the X or Y register as long as you didn’t cross a page boundary when you added the offset.

Indirect zero page addressing would take 5 or 6 cycles, and is worth explaining because it was the sensible way to do arrays and pointers: LDA ($AA), Y would find the two byte address stored starting at $00AA, add Y to that, and load the value at that address. In an asymmetry which I transposed far too often (and created many bugs in the process) but makes sense when you only have 256 instructions to play with, indirect addressing with the X register was completely different: the X register was added to the immediate value used for the zero page address before dereferencing. This was useful for tables of pointers but I found the Y register style of indirection useful much more often in my programming.

Anyway, the overhead of accessing memory was anywhere from 1 to 4 CPU clock cycles, which was the equivalent of .5-2 immediate (no memory access) instructions. Memory access times were significant but didn’t absolutely dwarf processing time.

Fast forward to today’s hardware. Much publicized “numbers that every programmer should know:” L1 cache access is .5 ns, L2 cache access is 7 ns, and main memory access is 100 ns. Rather than a 4:1 range in the old days, we now have a 200:1 range. And even worse, in the old days you knew statically (within 1 cycle depending on page boundaries anyway) how much time you would take for memory access; now you don’t know, it depends on whether you get a cache hit.

100 ns is a long time; lets dig in and see what’s behind it. Accessing memory nowadays is a complicated affair, and honestly until I sat down to write this I didn’t have a clear understanding of how that time was spent or how long it would take with different types of memory. Here’s what I found.

Using today’s memory, you don’t just ask for an address. Memory is organized in rows and columns, and before you can access an new row, you first have to precharge, then you strobe the new row address, then you strobe the column address that you want. If you’re looking at memory performance specs, the times for each of these operations are listed as tRP, tRCD, and CL. You have to add these together to determine how many memory I/O bus cycles it takes to read a new address not on the same row as the last read. Careful though, because a memory I/O bus cycle nowadays is twice as long as you think: DDR3-1600 memory actually has an I/O bus speed of 800 MHz, so each cycle is 1.25 ns. A DDR3-1600G rated memory takes 8 cycles each for tRP, tRCD, and CL, so altogether a memory access can happen in 30ns – not as bad as I feared. Well, 100 ns was an order of magnitude figure; its still quite a bit longer than .5 ns for an L1 cache hit and I’m not quite including everything that has to occur in the whole cycle (eg, you need to know you have a cache miss before you even start the process, and there are multiple layers of cache).

If DDR3-1600 memory can return a result in 30ns, how much can we shave the time with DDR3-2133 memory? The answer will surprise you – it might not be faster at all for random access. There is no JEDEC standard for DDR3-2133G; instead it starts at DDR3-2133K as the fastest DDR3-2133 standard. Rather than 8 cycles each for tRP, tRCD, and CL, it costs 11 cycles. Overall a random memory access is 33 cycles. 33 cycles at 1066MHz is about 31ns – no faster at all, and in fact slightly slower. Of course the absence of a JEDEC standard doesn’t mean it isn’t possible – a true weenie will buy his memory and test it with various BIOS settings. I’ve seen tests which show stable operation in some cases with 9/11/9 cycle timings at DDR3-2133, for a total of 29 cycles or just over 27 ns.

That said, DDR3-2133 memory has a significantly higher transfer rate than DDR3-1600: 17066MB/second vs 12800MB/second. Again, you see throughput increasing but not latency. Going back to the PC100 memory, that was common in PCs a decade ago, you see the same patter: an 800MB/second transfer rate based on a 100MHz memory I/O bus and one transfer per cycle (rather than the current two), but tRP, tRCD, and CL were all 2 cycles. A random memory access was 6 cycles of 100MHz, or 60 ns, only twice the access time of today’s high end memory as compared to transfer rates 21 times slower than today.

Where this may be headed in the future will be the subject of another post.

Thanks,

— Max

18 comments so far

Cliff Elam on August 31, 2011

Yes, you are a geek. Is now the time we compare paper tape to punch cards? But, I remember when we upgraded our test sets from a 2 to a 4 bit processor (details fuzzy, but I think we were going from 128bit to 1Kbit RAM testing) and we had to re-write everything how fast it all seemed. And how frustrating that all the little ‘tricks’ didn’t work anymore. And made it slower sometimes.

I think we need to start thinking about system speed and more complex optimizations for the future. How do you optimize your overall speed inside a cloud composed of VM handling different tasks across a set of geographic data centers?

-XC

Reply
Dwight Merriman (@dmerr) on August 31, 2011

really?

Reply
- Dwight Merriman (@dmerr) on August 31, 2011
  
  by ‘really’ i meant “is that really right?” not “duh”. 🙂 but i had a context in mind thus my surprise — larger reads. say a 4KB page. if access time is 30ns:
  
  1.70E+10 bytes to read
  4096 bytes per random read
  4.15E+06 accesses
  0.012 access time (seconds for the above reads)
  
  1.70E+10 bytes to read
  32 bytes per random read
  5.31E+08 accesses
  1.594 access time (sec)
  
  so in the latter case, if the 30ns is right, access time dominates. for a larger read such as at the top, it’s not bad.
  
  it’s very clear that cpu speed : ram speed ratio period is changing such that ram will be the bottleneck (or already is). the random penalty is interesting.
  
  i’ll throw out that the solution (at least as long as reading say 4KB randomly is not too expensive) is cache aware data structures.
  
  btw: a TLB miss is pretty expensive too, right???
  
  Reply
  - Dwight Merriman (@dmerr) on August 31, 2011
    
    perhaps ‘cache aware’ isn’t quite the right term — cache aware + ram behavior aware?
  - Benjamin Darfler on August 31, 2011
    
    Cache aware structures can require a lot of tuning for different hardware. Cache oblivious structures such as streaming b-trees (i.e. cache oblivious lookahead arrays and shuttle trees) are almost as efficient without the necessary tuning.
  - Max Schireson on August 31, 2011
    
    Sorry for misinterpreting your initial question.
    
    You have factored just the access time, which will typically get you 64 bits.
    
    My understanding is that half the address bits on the chip address the row and half address the column. So a 1 gigabit x 1 chip is 32k columns by 32k rows. Within a row, the chip can stream data twice per cycle (leading and trailing clock edges) which is why the “nominal” clock speeds are twice the actual clock speeds in DDR (double data rate) memory.
    
    So DDR-1600 memory 64 bits wide streams12800 megabytes bytes of memory per second, and would take about 1.3 seconds to sequentially stream your 1.70E.+10 bytes. The overhead for switching rows is pretty minimal. But if you only grabbed 64 bits with each access, you’d get 240 Megabytes per second and it would take a little over a minute to stream the same amount of data.
    
    Yes, if you look at a system with say 16 cores running at 2 GHz, even with multiple banks of memory you really need to have high cache hit rates to keep the CPUs busy. Yes, cache aware data structures make sense. I think the system will grab more than the 64 bits you ask for into cache. You have to balance cache usage with memory access; I think Intel chips have mostly migrated from 64 byte cache lines to 128 byte cache lines as the caches have gotten bigger. I think they also do some predictive cache loading where they may grab more than one line of cache, but I don’t understand the details there.
    
    — Max
  - Dwight Merriman (@dmerr) on August 31, 2011
    
    @max i had a typo above. the times are off by a factor of 10. reading 4KB at a time (.125sec overhead for 17GB) would work well. 8 bytes, no. seems your thesis holds – implying e.g. tree structures should be n-ary not binary going forward.
Harold on August 31, 2011

Well, ignoring the new paradigm of “Cache is the new RAM, RAM is the new disk, disk is the new tape, and tape is still pretty awesome” (the latter is my contribution, e.g. I’ve heard there’s no single spinning magnetic media hard disk on earth that will keep up with the latest LTO-5 generation of tapes):

Is it the case that RAM is a misnomer, or have we instead simply created optimizations for sequential reading of memory when we couldn’t do much of anything to improve random access? (I don’t know.) In a world where serious processors have 3-24MB of shared Level 3 cache, how much is this a problem? Or to look at one of my interests (concurrent parallel GC), isn’t it nice that I can read a LOT of RAM sequentially, with the whole memory hierarchy supporting that?

On the other hand, we haven’t touched on cNUMA systems….

Reply
- Max Schireson on August 31, 2011
  
  Certainly its more precise to say that current memory is much better optimized for sequential reading. Yes, the shared L3 cache mitigates the impact and enables a different set of economic tradeoffs in memory design. SRAM would be really fast but you’d need 6 transistors per bit instead of 1 so it would be more expensive. Really its a case where the whole system evolved together to reach an equilibrium where cost/performance tradeoffs are optimized across a broad set of usage. I’m not suggesting there’s anything wrong with this, only that it is very different and that RAM is now somewhat ironically named.
  
  Sequential transfer rates for memory nowadays are amazingly high, very cool for lots of applications.
  
  — Max
  
  Reply
  - Harold on August 31, 2011
    
    One nit: while SRAM indeed takes 6 transistors as I recall, a DRAM cell is a capacitor (that’s why it needs the refresh cycle and to be rewritten after reading). I know they’ve gone to a vertical orientation of DRAM cells; I wouldn’t be surprised if those were now smaller than a transistor (although they’re of course working on making those more vertical as well).
    
    Bringing this back to our discussion, the IBM processor with 24 MB of L3 cache uses a process they developed to make it out of DRAM. There’s interesting potential there, and, hey, Intel might one day get back into the DRAM business this way ^_^.
  - Max Schireson on August 31, 2011
    
    I don’t know about the relative size of a capacitor and a transistor today; it may be hard to get an apples-apples comparison across different fab technologies, power levels, etc and I’m definitely not up to speed on the latest manufacturing techniques.
    
    It would not surprise me if there’s a lot of performance envelope available for DRAM. When its halfway across a motherboard, you’re not going to get a round trip of half a nanosecond. Intel back in the market would be interesting.
Benjamin Darfler on August 31, 2011

I’m dying to know if/how MongoDB will take advantage of sequential I/O. It seems to me that HBase/Cassandra’s Log Structured Merge Tree / Cache Oblivious Lookahead Array have a lot to offer to MongoDB

Reply
- Max Schireson on August 31, 2011
  
  I haven’t looked at those structures in detail, but I have worked on other databases with log structured merge systems. In general they do a good job of reducing random I/O and converting it to sequential. This is a great tradeoff for writes, but having multiple index segments is costly for reads when you support secondary indexes. Any offhand advice on that issue?
  
  Reply
  - Benjamin Darfler on August 31, 2011
    
    In general BigTable reduces random I/O by using bloom filters to prevent reading unnecessary SSTables and indexes within the SSTables to prevent reading unnecessary blocks of the SSTable. Of course this is only for point queries. I’m less clear on how they optimize range queries, they might have to check every SSTable index though they could still exploit the index to ignore unnecessary blocks. Luckily the actual data is all sorted so it ends up being primarily sequential I/O.
    
    My understanding of how Cassandra supports secondary indexes is that they are just mappings from the secondary index value to a primary index value. In this case the index is basically a hash index and point queries would look just like point queries on the primary index.
    
    Range queries on a secondary index would be more difficult. You could use a secondary set of SSTables to store the mapping from secondary key to primary key. You would have to do a range query on the secondary index SSTables to get the necessary primary keys. At that point it would probably make sense to sort the primary keys and use some heuristics to decide what ranges vs points to pull in from the primary key SSTables.
    
    Its worth pointing out that cache oblivious lookahead arrays use lookahead pointers to reduce I/O though I haven’t fully grocked the design. Additionally, they are cache oblivious which more directly addresses the RAM discussion. I’m not sure how well SSTables or LSM Trees stack up in the cache oblivious memory model. Finally, I think they can benefit from bloom filters as well.
    
    One last point, reads are really easy to cache, so simple caching mechanisms can go a long way.
    
    Two really good links I’ve found on the subject:
    
    Click to access socc11chisl.pdf
    
    http://mysqlha.blogspot.com/2011/08/read-amplification-factor.html
  - Benjamin Darfler on September 2, 2011
    
    The issue here is really random I/Os (in RAM and on disk); we want to get rid of them as much as possible. In your basic b-tree, with pointers between the nodes, there is a random I/O for each level of the tree and then a random I/O to get the actual data. In a log structured index where the data is stored with the key each segment is random I/O and the reading the data is free. While it might seem that querying multiple segements is inefficent its actually on par with walking a b-tree. However, the winner goes to the sequential structures since bloom filters can be used to prevent reading most if not all of the irrelevant index segments.
    
    Secondary indexes can be created using the same sequential structures to map between the secondary key and the primary key. They add another level of indirection but bloom filters help significantly.
    
    This link might be of interest: http://mysqlha.blogspot.com/2011/08/read-amplification-factor.html
  - Benjamin Darfler on September 15, 2011
    
    Looks like Basho is using LevelDB (Google’s open source embedded log structured db) for Riak’s secondary index storage.
    
    http://blog.basho.com/2011/09/14/Secondary-Indexes-in-Riak/
Albert Wenger on November 11, 2011

Very late to this post but love it. Would be great to see a chart that shows over the last 20+ years the relative increase in CPU speed compared to RAM and disk. Should break out main memory and cache of course and also for disk should break out local disk versus something like Amazon EBS (obviously the line for that doesn’t go quite as far back). I will look around to see if someone has put this together already.

PS You should use disqus from comments 🙂

Reply
- Max Schireson on November 11, 2011
  
  Yeah, been meaning to move to disqus for comments but hasn’t made it to the top of my list yet 🙂
  
  Reply