Moore’s Law marches on, the transistor counts are continuing to increase at the predicted rate and will continue to do so for the foreseeable future. However, what has changed is where these transistors are going: instead of a single core, they are appearing in multi-core designs, which place a much higher premium on hardware and software parallelism. This is hardly news, I know. However, before we get back to arguing about the “correct” parallelism & concurrency abstractions (threads, events, actors, channels, and so on) for our software and runtimes, it is helpful to step back and take a closer look at the actual hardware and where it is heading.
Single Core Architecture & Optimizations
The conceptual architecture of a single core system is deceivingly simple: single CPU, which is connected to a block of memory and a collection of other I/O devices. Turns out, simple is not practical. Even with modern architectures, the latency of a main memory reference (~100ns roundtrip) is prohibitively high, which combined with highly unpredictable control flow has led CPU manufacturers to introduce multi-level caches directly onto the chip: Level 1 (L1) cache reference: ~0.5 ns; Level 2 (L2) cache reference: ~7ns, and so on.
However, even that is not enough. To keep the CPU busy, most manufacturers have also introduced some cache prefetching and management schemes (ex: Intel’s SmartCache), as well as invested billions of dollars into branch prediction, instruction pipelining, and other tricks to squeeze every ounce of performance. After all, if the CPU has a separate floating point and an integer unit, then there is no reason why two threads of execution could not simultaneously run on the same chip – see SMT. Remember Intel’s Hyperthreading? As another point of reference, Sun’s Niagara chips are designed to run four execution threads per core.
But wait, how did threads get in here? Turns out, threads
are a way to expose the potential (and desired) hardware parallelism to the rest of the system. Put another way, threads are a low-level hardware and operating system feature, which we need to take full advantage of the underlying capabilities of our hardware.
Architecting for the Multi-core World
Since the manufacturers could no longer continue scaling the single core (power, density, communication), the designs have shifted to the next logical architecture: multiple cores on a single chip. After all, hardware parallelism existed all along, so the conceptual shift wasn’t that large – shared memory, multiple cores, more concurrent threads of execution. Only one gotcha, remember those L1, L2 caches we introduced earlier? Turns out, they may well be the Achilles’ heel for multi-core.
If you were to design a multi-core chip, would you allow your cores to share the L1, or L2 cache, or should they all be independent? Unfortunately, there is one answer to this question. Shared caches can allow higher utilization, which may lead to power savings (ex: great for laptops), as well as higher hit rates in certain scenarios. However, that same shared cache can easily create resource contention if one is not careful (DMA is a known offender). Intel’s Core Duo and Xeon processors use a shared L2, whereas AMD’s Opteron, Athlon, and Intel’s Pentium D opted out for independent L1’s and L2’s. Even more interestingly, Intel’s recent Itanium 2 gives each core an independent L1, L2, and an L3 cache! Different workloads benefit from different layouts.