Why C is faster than assembly
I am very interested in the pros and cons of various programming languages for embedded applications. Although I mostly favor C, I sometimes prefer C++, but I am open to suggestions. I started out writing assembly language and still feel that this is the “real thing”. This is a topic I posted about some time ago. I expanded my thoughts in a more recent article on embedded.com
I am always pleased when one of my colleagues offers to write a guest blog, as I think that different views can be interesting. Brooks Moses [who has kindly contributed a number of times before, like here] has some very interesting views and experience to share …
The debates about which programming language is fastest are never ending. Is Fortran faster than C? Is C++ slower? What about Java? The only thing people seem to agree on is that when you really want the absolute best performance on some critical inner loop, you use assembly code.
Like most things people say in these debates, this is wrong.
That’s a pretty audacious claim to be making, so let me back this up with some real-world numbers. The Embedded Software Division at Mentor Graphics has recently been writing a computational library for Freescale’s new e6500 Power-architecture processors. We call it the Mentor Embedded Performance Library, so it’s our reputation (as well as the reputation of Freescale’s e6500 chips) that’s on the line for this to perform well.
I’ve recently been working on the elementwise functions in this library — these are the functions that compute the square root or the cosine or so forth for each element in an array. They’re pretty simple functions, but there are also about 190 of them in the library, and we have an aggressive schedule that meant I had only a couple weeks to get all of these done.
The e6500 has an AltiVec SIMD vector unit, so the inner loop of one of these functions looks about like this:
for (i = 0; i < N; i+=4)
load a 4-element SIMD vector of input
process it with several AltiVec instructions
save a 4-element SIMD vector of output
Whether you write that in C or assembly, it’s going to be slow if you actually write it like that. (For example, a square root function written this way takes 26 cycles for each SIMD vector, while a better-optimized version can do it in 11.) The e6500 CPU can start executing a typical AltiVec floating-point instruction every cycle, but there is a 6-cycle latency between when the instruction starts executing and when the output register can be used by another instruction. If you try to use the output register sooner than that, the CPU waits in a “pipeline stall” until it’s ready. That’s what’s happening here; each AltiVec instruction in the chain of instructions that processes the input is working with the output from the previous instruction, so the CPU is spending most of its time stalled.
Any good assembly programmer knows the way to solve that. The computation for each SIMD vector of four elements is independent, so if we partly “unroll” the loop and compute four of them on each pass through the loop, we can interleave the instructions and avoid stalls. The CPU can start the first AltiVec instruction for the first vector, and then rather than stalling on the next cycle it can start the first AltiVec instruction for the next vector, and so on. Oh, and there are some additional nuances; an e6500 CPU can start a computation instruction and a load-store instruction on the same cycle, so we want to adjust the instruction order to take advantage of that as much as we can. Then, because all these instructions are operating at once, we need to allocate which register is used by which set of computations and make sure that’s all kept straight.
Or we could just let the compiler do all the instruction ordering and register allocation. This is one of the things that compilers are particularly good at and have been for decades. Compilers aren’t (currently) very good at unrolling loops or converting simple C code to AltiVec instructions, but we can write an unrolled loop in C with GCC “compiler intrinsic” functions to tell it what AltiVec instructions to use. The compiler will then find a near-optimal order for the instructions and allocate the registers in a fraction of a second rather than a fraction of an hour, and we can then take a minute or two to have a look at the generated assembly code to make sure it hasn’t done something dumb. The result will be just as good as what someone might write by hand in assembly code.
But that’s not “faster”, I hear you objecting; it’s “just as good.” Where does “faster” come in?
Earlier, I said we’d compute four vectors on each pass through the loop. Why four? It turns out that four is a reasonable guess for the best choice, but for some of these functions doing eight at once is better, while for some it’s better to do only two, or occasionally only one.
And that’s where “faster” comes in. In assembly, changing the number of vectors we do on each pass through the loop requires laboriously going through that ordering and register-allocation step all over again, and probably debugging a few bugs. But, in C, it’s a simple cut-and-paste. So simple even a Python script could do it — and that’s what I did, and generated a table of the performance results for a bunch of different options for all of the various functions. Here’s an excerpt — remember, there are over a hundred lines of this and several additional columns:
|loop unroll amount||1||2||4||8|
Loop execution times (cycles per SIMD vector)
It’s immediately clear that for square root, division, and addition we want to do eight vectors at a time, but only two for arctangent and cosine, and only one for tangent and arccosine. If we’d written everything in assembly, there’s no way we could have done this sort of experimentation in the time we had. About the best we could have done was pick a sample function and try a few things, and then use that to guess the right number to use for all the other functions — and we’d probably get it wrong on many of them. As you can see in the table, guessing wrong can lead to measurably slower code.
So. if you want the absolute best performance on a critical inner loop, write it in C. Or Fortran, or C++; it doesn’t matter. Read the generated assembly, and adjust the C code until it looks good. Then experiment with unrolling loops and trying different variations, and measure each one until you get the best numbers. The reason C is faster than assembly is because the only way to write optimal code is to measure it on a real machine, and with C you can run many more experiments, much faster.
Oh, and use the right algorithm; that matters more than everything else put together. Last week I was working on our floating-point sort routine, and after I had a rather good quicksort implementation working, I tried a better sort algorithm and cut the execution time by more than half — again, a win for experimentation. But that’s a story for another time!
Posted February 18th, 2013, by Colin Walls
- Choose your weapons – options for debugging
- Dissatisfaction, customer service and surprises
- Video blog about getting into embedded software
- Embedded software article: RTOS Revealed #9
- Lost in translation
- One return from a function: a good idea?
- The A380 experience
- Embedded Software Masterclass
- How to get rich