As a kid growing up in Chicago during the ’90s, the sonic artistry stemming from that period’s “Renaissance Era of East Coast Hip Hop” (e.g., Nas, Biggie, Jay-Z, etc.) formed the soundtrack for many of the pivotal moments of my adolescence. But I never could’ve fathomed that the Wu-Tang Clan’s most critically acclaimed single, “Cash Rules Everything Around Me (C.R.E.A.M.),” off their debut album, Enter the Wu (36 Chambers), would remain so relevant in my chosen profession of Performance Engineering. Homophone considerations aside, The Wu forewarned me long before I’d even chosen my career path that CPU cache would play a central role in my performance analysis, benchmarking, optimization, and. . . well, everything around me.
But maybe you’re not a hip-hop fan. Perhaps you simply were raised on the idea that Asymptotic Complexity Analysis and Big O Notation are the end-all be-all of coding for optimal performance. And then you found yourself scratching your head as you noted the stark performance difference in searching a std::vector vs. a std::list, even though both operations have O(N) complexity. Or maybe you encountered this phenomenon when swapping out a simple chaining hash table with an open-addressing based one with linear probing. No matter which software project or Conference Talk apprised you of the interplay between common data structures and CPU caches, we all have since come to the realization that “algorithmic complexity,” while important, is incomplete from a performance perspective without consideration for “communication complexity.” Memory access hierarchy from the L1 to L2 to L3 to L4 to main memory, with its concomitant latency differences, cannot be ignored.
But we all know that now, right? We all know the common tips and tricks to get the most out of our caches. Some of us even replace standard STL containers with more cache-friendly ones (e.g., flat_map, flat_set, etc.). But is it possible to do everything right and still have our efforts derailed by a rogue application running elsewhere? Yes, it most certainly is. Enter the Inclusive vs. Non-inclusive Last Level Cache (LLC).
INCLUSIVE vs. NON-INCLUSIVE
CPU manufacturers provide chips with LLCs that generally fall into the Inclusive (e.g., Intel) or Non-inclusive (e.g., older AMD CPUs) camp. In the former case, everything that is contained in the L1 and L2 is guaranteed to exist in the LLC; in the latter case, no such relationship is guaranteed. The benefit of Non-inclusive LLCs is that it saves on cache capacity as storage is not duplicated; however, it complicates the cache coherency protocol and requires extra hardware for snoop filtering. Inclusive LLCs simplify cache coherency since much less protocol traffic must traverse the various levels; however, it wastes capacity and, more insidiously, it inflicts “backward invalidations” to maintain that inclusiveness; that is, if a cache line in the LLC is targeted for eviction for any reason, then it must ensure that the corresponding cache line is also evicted from the cache levels beneath it.
Ok, that’s bad. . . but not that bad, right? Except that while the L1 and L2 are private to each CPU core, the LLC is shared among many cores on a CPU socket – in the case of the Broadwell E5-2699 chip, it is shared across 22 cores. This means that another developer, not nearly as technically savvy or as debonair or as good-looking as yourself, can run his memory-hungry, LLC-thrashing application on a completely separate core from the one you’ve chosen, and yet still wreck the performance of your own application as your nicely laid out data structures repeatedly get evicted and reloaded into your private L1/L2 caches! And you thought the next door neighbors in your apartment building were bad! What can be done about this?
KEEPING YOUR HARD EARNED CACHE
Intel has tried a few things to combat this phenomenon. One of them was called Data Reuse Optimization – it appears to have been based on Temporal Locality Hints, whereby cache hits in the L1 would initiate an update to the history table of the LLC. Sounds pretty chatty to me. How does it perform? I wouldn’t know since that feature didn’t survive past Westmere, and I have no access to anything that old in my lab.
Then, Intel introduced a new Snoop Mode in Haswell called Cluster-On-Die – it effectively splits a CPU socket into two (2) separate NUMA nodes, each with its own half of the socket’s LLC. While this does not solve the problem entirely, it offers a mitigating option.
Then, Intel toyed with Query Based Selection (QBS) in Broadwell – it worked by having the LLC send “backward invalidation requests” to the lower level caches once a cache line had been targeted for eviction from the LLC. The lower level caches could either accept or reject the request – in the latter case, the LLC would issue a new request for a different cache line until it arrived upon one that the cache levels could agree upon. I was excited for this feature until I received an evaluation pre-release Broadwell chip early last year and, after much anguish during testing, learned that Intel decided to drop the feature due to “unresolved issues.”
Finally, Intel added Cache QoS Enforcement, included as part of Intel Resource Director Technology, which allows user-defined quotas for the LLC. This technology defines several Classes of Service where each class is configured to allow access to a certain number of ways in the 20-way LLC. You can assign one of these Classes of Service to each core via an MSR write. The MSRs for each Class of Service begins at 0xc90 for Class 0, 0xc91 for Class 1, and so on and so forth. MSR 0xc8f (bits 32 – 63) is used to assign a Class of Service to a specific core. Here’s an example:
### Define 2 classes of service (CoS) ### CoS 0 has mask 0xffffc - i.e., 18 ways $ wrmsr --all 0xc90 0xffffc ### CoS 1 has mask 0x3 - i.e., 2 ways $ wrmsr --all 0xc91 0x3 ### Assign CoS 1 to core #2 - all other cores default to using whatever you have defined for CoS 0 $ wrmsr -p 2 0xc8f 0x100000000
With this configuration you can pin your application to core #2 and safely run it with its streamlined data structure, which fits comfortably within 10% (2 ways out of the total 20 ways) of the LLC, with full confidence that no other application running on any other core can ever evict your working set. Not bad, huh?
As we’ve discussed, Inclusive LLCs simplify cache coherence at the expense of allowing costly “backward invalidations,” which can become troublesome in high core-count CPUs. Broadwell offers both Cluster-On-Die and Cache QoS Enforcement to tame these inclusion bugaboos. Maybe Skylake, Intel’s next “tock” in its tick-tock release schedule, will add further wrinkles to this cache game (and if you have an NDA relationship with Intel then you’re already pretty familiar with these upcoming new wrinkles). But whatever the future holds, for as long as non-uniform access to memory is the lay of the land, The Wu’s words of wisdom from ’93 will remain true. “Dollar, dollar bill, y’all. . .”