Once upon a time I worked for a rather eccentric computational geophysicist who, on my first day, asked if I knew anything about Graphics Processing Units (GPUs). As I shrugged my shoulders, he handed me a fresh Nvidia GTX 480 and said, “here, figure out how to make this thing do science.” It was my first foray into the fascinating world of heterogeneous high-performance computing, something I’d eventually come to specialize in.
Fast forward a few years and imagine my excitement when, on my first day at Belvedere, I got an account on a shiny new server packed to the gills with Intel Xeon Phi co-processors. I had played around with these devices before, but this would be the first time I’d have an opportunity to get beyond “Hello, World!”
Now almost a year later, I’ve had the ability to dig into what these peculiar little things can do. Presented here for your consideration, are my considerations of Xeon Phi co-processor.
First of all, who cares?
The unfortunate reality in computer engineering is that processors haven’t actually gotten any faster since about 2005. It turns out that if you push clock rates any higher you risk the magic blue smoke escaping from your chip.
The trick to squeezing more performance out of computing machinery is to go wide. Don’t waste your time building a faster chip; rather, spend your transistor budget building a computer with *as many* processing elements as you can.
One way to add compute capacity to a system is to use so-called “co-processors.” It’s not a terribly new idea at all. In fact, Intel released their first co-processor in the 80s as a hardware floating point unit for the 8088, and as early as the 1970s engineers were adding specialized graphics co-processors to computers.
What is new and exciting about co-processors today is the density of computing elements manufacturers are able to pack into them. For example, in 2006, Nvidia was fabricating chips with 128 stream processors. By 2008 they were designing chips with over 500 *general purpose* compute cores tuned for floating point arithmetic. Now we have the Xeon Phi, a massively parallel x86 processor.
The Knights Corner Device
Intel jumped into the world of massively parallel processors in 2006 with their Larrabee architecture which, by 2012, had evolved into their first production Xeon Phi co-processor code-named “Knights Corner” (KNC).
In a single KNC device you get 61 x86 cores. They are Intel P54C cores (i.e. Pentium I), each 4-way hyper-threaded for a total of 244 contexts. Intel also bolted on a 512-bit wide (!!) vector unit to each core which let you make use of some pretty nifty AVX512-esque instructions that bang out 8 double precision (or 16 single precision) operations per cycle.
Feeding these cores are 16GB of high-speed memory with a very respectable 350 GB/s of bandwidth. A bi-directional ring interconnect provides core-core communication and memory access, and communication between the device and host travels the PCIe bus.
Just like almost every other accelerator, the standard programming model for Xeon Phi’s main CPU holds the primary thread of control and every so often sends portions of the application to the co-processor. For better or worse, where the Xeon Phi differentiates itself from other accelerators is that the co-processor itself runs a stripped down linux-based “micro-OS.” The card is, all things considered, a full-fledged linux computer in its own right: it has a filesystem, it pretends to have network interfaces, and you can even ssh to it.
So really, that “standard” accelerator programming model I just mentioned isn’t necessarily how you need to treat the Xeon Phi at all. You can let the Phi be its own fully autonomous “thing” and talk with the host computer or even to the outside world. It turns the standard accelerator model on its head and provides a ton of flexibility for how applications can be architecturally structured.
The fact that Xeon Phi is “just x86” and runs “regular linux” is super attractive. Having had the pleasure of writing HPC applications in Nvidia’s proprietary CUDA language for many years, I feel reasonably confident saying that no one actually writes CUDA because they want to. You write CUDA because you have to.
So I will hand it to Intel here: Xeon Phi cards are extremely easy to program. Remember: it’s *just* x86. Take your vanilla C/C++ code, sprinkle some compiler directives on it, shove it through the Intel compiler and bingo, you’ve got something that will run on a Xeon Phi!
I’ve intentionally decided to not put any benchmark numbers here. You see, depending on what you read (and more importantly, who wrote it), the Xeon Phi is both the greatest thing since sliced bread and total rubbish (here are just a couple of my favorites: https://blogs.nvidia.com/blog/2016/08/16/correcting-some-mistakes, http://www.intel.com/content/www/us/en/benchmarks/server/xeon-phi/xeon-phi-competitive-performance.html). So what I will say is this: while the straightforward programming model lets you get almost any code up and running on a Xeon Phi in no time flat, unless your algorithm is perfectly parallelizable (and vectorizable!) the performance probably isn’t going to knock your socks off.
The key to targeting an application to the Xeon Phi architecture (maybe even more true for other accelerators like GPUs), is exposing as much fine-grained parallelism as you possibly can. If your algorithm isn’t massively parallel or, let’s be honest, just a huge DGEMM, you’re pretty much screwed. But c’est la vie, and such is life in co-processor land.
Here at Belvedere we have been able to port a “mostly vectorizable” chunk of code to the Xeon Phi with decent results. Profiling revealed just how sensitive these devices are to cache alignment. Additionally, a good amount of time was spent hand tuning the vector instructions in key computational kernels to achieve timings that could beat the same code compiled with AVX2 for the Haswell architecture.
The next generation of Xeon Phi devices are now right around the corner. The code-named “Knight Landing” (KNL) chip promises to increase the performance of the Xeon Phi architecture by swapping the P54C cores in Knights Corner for Intel Airmont (Atom) cores and also adding some interesting bells and whistles. Moving to a more modern core in KNL that provides support out-of-order execution is clearly a step in the right direction. We are also promised more freedom in memory access patterns as KNL will not be as sensitive to cache alignment. Intel is also building into these chips a tiered memory hierarchy that lets the developer explicitly place data into high-capacity, “far” RAM or high-bandwidth, “near” RAM. It’s a nice feature which will give developers more fine-grained control on their data pipelines.
Intel also plans on releasing two versions of KNL, one in the PCIe add-on form factor and one as a socketed chip. With the latter, the concept of accelerator goes away entirely: the accelerator *is* the host. It’s an interesting concept, as what you’ll get is a massively parallel server, plain and simple–no more PCIe bottleneck or partitioned memory space to wrestle. With the socketed chip variant, Intel also provides an on-die OmniPath interconnect thereby providing opportunities to let the devices connect directly to a network.
The KNL variant of the Xeon Phi is certainly a promising follow on to the KNC chip that is already in the wild. Combining the upgraded compute cores, tiered memory model, and on-die interconnect, we should have greater flexibility in the computing pattens KNL can accelerate.