The VPU – Our New Processing Unit Overlord

September 23, 2010 by John Ellis

Does anyone happen to recall the days of deciding between a 468/SX and a 486/DX2? Ah, those were indeed the days. The additional floating point processor offered by the DX-branded Intel chipsets were initially cited to just be helpful in "certain mathematical operations" like spreadsheets. This was a rather weak pitch… but in reality there was no killer application to flaunt its power.

That all changed when John Carmack revealed QTest.

After id Software released Doom, everyone wanted to see where "Engine John" was going to go next. His more immersive, truly dimensional environments required floating point calculations… and the killer application demonstrating a 486/DX2 was born. I remember this vividly because I did not have a floating point unit… either one living in a stand-alone socket (back when FPU’s could be discrete parts) or on the die itself. I was a sad monkey.

We’re on a similar precipice today with vector processing units. The technology behind vector processing definitely isn’t new, and it has been commercially available for decades. Today we see hundreds of VPUs within discrete graphics cards available at retail stores everywhere. The power of vector processing could extend far beyond graphics cards… indeed NVIDIA has been offloading physics calculations to its GPU’s and the OpenCL standard has been working to provide a standard way for programmers to leverage parallel vector processing units.

On-die vector processing is starting to take hold in the traditional CPU market. IBM’s Cell Architecture attempted to make vector processing the hallmark of their processing units – something that granted the PS3 enormous processing power but gave software engineers heartburn. AMD has announced that vector processing will obviate the CPU by means of the APU: an "Accelerated Processor Unit" that weaves floating point, vector processing and integer instructions into one processor. Intel originally planned to offer vector processing as part of their Larrabee architecture in 2010, however they have since pushed that off to other product lines destined for some other means of release.

While the VPU has the hardware in place there hasn’t been a "killer application" for it just yet. True, massively parallel physics calculations and cracking GSM is neat, but a mainstream application could really drive demand for VPU silicon.

Cloud computing may just be the killer application that interwoven VPU/FPU/CPU has been looking for. Analysis presented at the 2010 GPU Tech Conference demonstrated how a 1,000 Fermi GPU cluster had greater density, much lower power consumption, greatly reduced cost and far higher reliability than a 250,000 CPU cluster. And by "far greater" we’re talking about ten times the reliability, one tenth the cost and occupying 3.5% of the data center real estate. Dang.

The big problem is, as The Register put it, making your code "GPU-riffic." Engineers need to create applications geared towards the instruction set of vector processors, not the traditional instruction sets of today’s CPUs. I half wonder if the proper hypervisor might help bridge this gap… although I know there are plenty of PhD’s who just threw their pipes at me for insinuating such a thing. Still – what if the vCPU not only provided a hardware abstraction on top of a physical CPU but also did a bit of transformation taking sequences of CISC instructions that were well suited for vector processing and offloading them to a vector processing unit? Kinda a convoluted version of what Transmeta did for x86 support in their low-power processor architecture? Yes, I’m completely oversimplifiying things and this introduces a slew of problems. But still… what if? What if our cloud technology transparently leveraged new compute architectures that achieved far higher density and much lower power consumption?

That would be one killer cloud.