Why C is faster than assembly

I am very interested in the pros and cons of various programming languages for embedded applications. Although I mostly favor C, I sometimes prefer C++, but I am open to suggestions. I started out writing assembly language and still feel that this is the “real thing”. This is a topic I posted about some time ago. I expanded my thoughts in a more recent article on embedded.com

I am always pleased when one of my colleagues offers to write a guest blog, as I think that different views can be interesting. Brooks Moses [who has kindly contributed a number of times before, like here] has some very interesting views and experience to share …

The debates about which programming language is fastest are never ending. Is Fortran faster than C? Is C++ slower? What about Java? The only thing people seem to agree on is that when you really want the absolute best performance on some critical inner loop, you use assembly code.

Like most things people say in these debates, this is wrong.

That’s a pretty audacious claim to be making, so let me back this up with some real-world numbers. The Embedded Software Division at Mentor Graphics has recently been writing a computational library for Freescale’s new e6500 Power-architecture processors. We call it the Mentor Embedded Performance Library, so it’s our reputation (as well as the reputation of Freescale’s e6500 chips) that’s on the line for this to perform well.

I’ve recently been working on the elementwise functions in this library — these are the functions that compute the square root or the cosine or so forth for each element in an array. They’re pretty simple functions, but there are also about 190 of them in the library, and we have an aggressive schedule that meant I had only a couple weeks to get all of these done.

The e6500 has an AltiVec SIMD vector unit, so the inner loop of one of these functions looks about like this:

for (i = 0; i < N; i+=4)
{
load a 4-element SIMD vector of input
process it with several AltiVec instructions
save a 4-element SIMD vector of output
}

Whether you write that in C or assembly, it’s going to be slow if you actually write it like that. (For example, a square root function written this way takes 26 cycles for each SIMD vector, while a better-optimized version can do it in 11.) The e6500 CPU can start executing a typical AltiVec floating-point instruction every cycle, but there is a 6-cycle latency between when the instruction starts executing and when the output register can be used by another instruction. If you try to use the output register sooner than that, the CPU waits in a “pipeline stall” until it’s ready. That’s what’s happening here; each AltiVec instruction in the chain of instructions that processes the input is working with the output from the previous instruction, so the CPU is spending most of its time stalled.

Any good assembly programmer knows the way to solve that. The computation for each SIMD vector of four elements is independent, so if we partly “unroll” the loop and compute four of them on each pass through the loop, we can interleave the instructions and avoid stalls. The CPU can start the first AltiVec instruction for the first vector, and then rather than stalling on the next cycle it can start the first AltiVec instruction for the next vector, and so on. Oh, and there are some additional nuances; an e6500 CPU can start a computation instruction and a load-store instruction on the same cycle, so we want to adjust the instruction order to take advantage of that as much as we can. Then, because all these instructions are operating at once, we need to allocate which register is used by which set of computations and make sure that’s all kept straight.

Or we could just let the compiler do all the instruction ordering and register allocation. This is one of the things that compilers are particularly good at and have been for decades. Compilers aren’t (currently) very good at unrolling loops or converting simple C code to AltiVec instructions, but we can write an unrolled loop in C with GCC “compiler intrinsic” functions to tell it what AltiVec instructions to use. The compiler will then find a near-optimal order for the instructions and allocate the registers in a fraction of a second rather than a fraction of an hour, and we can then take a minute or two to have a look at the generated assembly code to make sure it hasn’t done something dumb. The result will be just as good as what someone might write by hand in assembly code.

But that’s not “faster”, I hear you objecting; it’s “just as good.” Where does “faster” come in?

Earlier, I said we’d compute four vectors on each pass through the loop. Why four? It turns out that four is a reasonable guess for the best choice, but for some of these functions doing eight at once is better, while for some it’s better to do only two, or occasionally only one.

And that’s where “faster” comes in. In assembly, changing the number of vectors we do on each pass through the loop requires laboriously going through that ordering and register-allocation step all over again, and probably debugging a few bugs. But, in C, it’s a simple cut-and-paste. So simple even a Python script could do it — and that’s what I did, and generated a table of the performance results for a bunch of different options for all of the various functions. Here’s an excerpt — remember, there are over a hundred lines of this and several additional columns:

loop unroll amount 1 2 4 8
addition 9 8 7 5
arccosine 155 260 311 357
arctangent 56 34 42 64
complex multiply 39 28 23 30
cosine 66 53 73 82
division 21 15 11 8
error function 89 102 107 104
exponent 57 34 34 48
natural log 56 33 42 66
reciprocal 15 12 9 6
square root 26 16 13 11
tangent 149 172 189 224

Loop execution times (cycles per SIMD vector)

It’s immediately clear that for square root, division, and addition we want to do eight vectors at a time, but only two for arctangent and cosine, and only one for tangent and arccosine. If we’d written everything in assembly, there’s no way we could have done this sort of experimentation in the time we had. About the best we could have done was pick a sample function and try a few things, and then use that to guess the right number to use for all the other functions — and we’d probably get it wrong on many of them. As you can see in the table, guessing wrong can lead to measurably slower code.

So. if you want the absolute best performance on a critical inner loop, write it in C. Or Fortran, or C++; it doesn’t matter. Read the generated assembly, and adjust the C code until it looks good. Then experiment with unrolling loops and trying different variations, and measure each one until you get the best numbers. The reason C is faster than assembly is because the only way to write optimal code is to measure it on a real machine, and with C you can run many more experiments, much faster.

Oh, and use the right algorithm; that matters more than everything else put together. Last week I was working on our floating-point sort routine, and after I had a rather good quicksort implementation working, I tried a better sort algorithm and cut the execution time by more than half — again, a win for experimentation. But that’s a story for another time!

Post Author

Posted February 18th, 2013, by

Post Tags

, , , , , ,

Post Comments

2 Comments

About The Colin Walls Blog

This blog is a discussion of embedded software matters - news, comment, technical issues and ideas, along with other passing thoughts about anything that happens to be on my mind. The Colin Walls Blog

Comments

2 comments on this post | ↓ Add Your Own

Commented on 24 February 2013 at 19:05
By bandit gangwere

Good analysis. People rush to “optimize the code” but they don’t bother to measure or look for the best algorithm.

I have had to optimize code four times in the last 20 years, and knew, by design, where that was going to be. I got the code to work, measured – yep, that needs to run faster – and optimized it. Either re-arraigned the code to help the compiler, or cloned the code and cut out the error checking. I could run the error checking version when time was not critical, then the really fast when time was critical.

Commented on 24 February 2013 at 19:06
By bandit gangwere

BTW, this architecture reminds me of the Floating Point Systems array processors. Adds to one cycle, mults two (and you had to push data thru the pipe, and memory fetches were three.

Add Your Comment

You must be logged in to post a comment.

Archives

September 2014
  • ARM TechCon – See you there?
  • “Six of the Best” – Cities
  • Multiple constructors, the :: operator and memory speed – more questions answered
  • “Six of the Best” – iPad apps
  • A hypervisor on a multicore system
  • Do I spit it out? A wine tasting 101
  • Embedded software articles – from incrementing in C/C++ to using dynamic memory
  • Pulling the cork
  • Default parameters, C++ determinism and other C++ questions
  • August 2014
  • How about a “Penny Lick”? Ice-cream the Victorian way
  • Multicore Linux, DO-178B and RTOS performance
  • Two steps forward, one step back – learning from setbacks
  • Embedded software articles – from assembly language to software IP
  • The 5:2 way of eating
  • Undefined behavior and other delights of (bad) C programming
  • July 2014
  • Deafness, sign language and babies
  • Quick and dirty code checking and execution
  • Words on tap – are streaming books next?
  • Multicore systems: heterogenous architectures – untangling the technology and terminology
  • Words – new, old and odd
  • Another programming language survey – what are you using?
  • What to do with all those pictures
  • Uniqueness of C++ methods and class member variables
  • The Big Mac question
  • June 2014
  • C++ – yet more Questions and Answers
  • IQ – myth or meaningful?
  • C++ – more Questions and Answers
  • Asia [the band, not the continent]
  • Embedded software resources: a reading list for engineers
  • Doing a good turn
  • struct vs class in C++
  • May 2014
  • RIP Gordon Cameron
  • Problems with pointers: out of scope, out of mind
  • 1st world and 3rd world technology
  • C++ Exception handling continued – words from an insider
  • Internet shopping and customer service
  • C++ Q&A
  • Exception handling in C++ – developers are wary
  • 1000 words
  • Dynamic memory in real time systems – a solution?
  • Genetic modification – where will it all end?
  • April 2014
  • C++ for embedded – your input needed
  • The Power Distance Index and safe flying
  • Wi-Fi, ZigBee, Miracast and Bluetooth: wireless networking on the move – without a router
  • Are you good enough?
  • USB 3.1 – more speed, more power and new connectors
  • California drought: how water is wasted in huge volumes
  • How was EE Live! for you?
  • Teaching kids to program – why would we want to do that?
  • March 2014
  • See you at EE live!
  • Alien technology?
  • A date at DATE [the software perspective on virtual prototyping]
  • Is it normal to be normal?
  • Selecting a CPU
  • Set in stone
  • Embedded software engineering priorities – a response
  • Eggheads
  • Embedded World – “This is not EDA”
  • February 2014
  • Astrology
  • Power Management, Multicore: Embedded World 2014
  • An RTOS with a GUI
  • Me and my iPad
  • Google hangout on IoT
  • Embedded software engineering priorities
  • I am an ambassador
  • Embedded systems in cars – challenges and solutions
  • January 2014
  • A time for celebration?
  • Low power multicore from a software designer’s perspective
  • Learning to swim
  • Low power modes
  • The 4th dimension
  • Safety critical sensors in cars
  • Goose bumps
  • Selecting an embedded operating system
  • December 2013
  • Christmas traditions
  • Using an unsigned integer to store a pointer in C
  • How selfish are you? Is there such a things as altruism?
  • GENIVI Diagnostic Log and Trace
  • Gravity – a memorable movie?
  • Embedded virtualization: Out-of-the-Box and into-the-fire?
  • November 2013
  • The harpist
  • Hypervisor applications
  • Back to Berlin
  • Conferences in perspective
  • Mars again
  • Embedded hypervisors
  • Stockholm taxi rip-off – a warning
  • ECS & IP-SoC
  • ARM Tech Con in Sixteen Hours: Top Five Moments
  • October 2013
  • The 100 Year Old Startup
  • ARM TechCon
  • How does this shape taste?
  • Busy, busy, busy
  • Donating blood
  • The demise of the reset button
  • (It’s not) All about me
  • The next Big Thing
  • The Week Plan
  • September 2013
  • Embedded file systems
  • There was no big bang
  • Smart Energy panel
  • How old is your car?
  • On vacation
  • Ah! Lua
  • Knots and splices
  • Software integrity testing
  • August 2013
  • Forgetting stuff
  • Self-test
  • How to live longer
  • big.LITTLE
  • Is it Summer yet?
  • Resistive RAM
  • No calls please!
  • Measuring scheduling latency
  • Careers and ambition
  • July 2013
  • An interview
  • Lego and the Pyramids
  • Programming languages and prize winners
  • Shopping woes
  • Munich conferences
  • The Nun’s Prayer
  • Three degrees of freedom
  • Water
  • How long is a piece of string?
  • June 2013
  • QR codes
  • Which embedded programming language?
  • Which way is up?
  • SEP 2.0
  • Genetic art
  • Device Firmware Upgrade through USB
  • My genes?
  • Non-intrusive debug
  • May 2013
  • Two heads are better than one
  • Using an SMTP client
  • First non-contact
  • Book review [part 1]
  • Everyday rhetoric
  • Embedded education
  • Gone flying
  • Hardware and software development in synch
  • Why is traveling so hard?
  • April 2013
  • After Design West
  • Great expectations
  • Design WEST
  • Get lost!
  • What is an FPGA?
  • How big is a sheet of paper?
  • Debugging with printf() or not …
  • New thoughts on Evernote
  • Power management webinar
  • March 2013
  • Time annoyance
  • Reading the meter
  • Integer money
  • Endianness
  • An intimate evening with Al
  • Counting on your fingers
  • Getting sorted
  • Judging distance
  • Innovate!
  • Wildebeest
  • February 2013
  • Embedded World 2013
  • A 1000 days with an iPad
  • Why C is faster than assembly
  • Chocolate is good
  • Reentrant write-only ports
  • Losing the penny
  • More on low power CPU design
  • January 2013
  • The Hype Cycle
  • Write-only ports in C++
  • An alternative to online dating
  • Designing a low power CPU
  • The mayonnaise jar and the two pints of beer
  • Write-only ports
  • The new Box Brownie
  • OS configuration
  • New Year in Berlin
  • December 2012
  • Holiday time!
  • This cannot continue
  • Not so much of a puzzle
  • Another article
  • Why bother with a DSLR?
  • In the news
  • Back to Mars
  • Less puzzled
  • November 2012
  • Connecting online
  • USB 3.0
  • Rant: insurance and sex discrimination
  • Power and compilers at the NMI
  • Are you shy?
  • Webinars
  • Big Issue!
  • Blinking is good
  • Just Julian
  • October 2012
  • ARM TechCon 2012
  • Repair or replace?
  • A puzzle
  • On the weekend
  • VSIPL++ Standard goes Global
  • A surprise meeting
  • HPEC ‘12
  • Choice = stress?
  • ECS2012
  • September 2012
  • Viva Las Vegas
  • Embedded software tools – then and now
  • Dimensionally intelligent
  • ESC Boston
  • Doh!
  • The floating point argument
  • iPad apps
  • The Power Pyramid
  • August 2012
  • The smallest room
  • Floating point
  • Left to right
  • Evaluation boards
  • 5 + 5 + 5 – 5 + 5 + 5 – 5 + 5 x 0 = ?
  • Curiously embedded
  • Infected by Olympic fever
  • RTOS memory footprint
  • Are you happy?
  • July 2012
  • Why should I care about software standards?
  • I give up
  • Going to sleep
  • Having fun
  • Munich Embedded Developers’ Forum
  • Evernote hints & tips
  • A more powerful phone
  • A very special day
  • HPC
  • June 2012
  • Reinventing the wheel
  • More on low power
  • Power Suckers
  • Too many pixels
  • In a state
  • Multicore thread synchronization
  • Let’s all hold hands
  • Freescale Technology Forum
  • Don’t jump!
  • Measuring interrupt latency
  • May 2012
  • PowerPoint hints and tips #3
  • RTOS in demand
  • Snaps and splicing
  • GTC – mission accomplished
  • Coffee in Italy
  • Measuring RTOS performance – a Web seminar
  • The Italian cuckoo
  • Preparing for GTC
  • Photography rebooted
  • April 2012
  • Who needs a Web server?
  • PowerPoint hints and tips #2
  • New book
  • Video
  • How do you tell the time?
  • In an open-source world, it’s all about integration
  • What are meetings for?
  • Comparing apples with apples
  • Poison yogurt
  • Embedded software performance optimization: Forget about it!
  • March 2012
  • Universal time
  • ESC imminent
  • PowerPoint hints and tips #1
  • Not the Embedded Systems Conference
  • Signs of the times
  • Trends
  • Hail the straw man
  • Embedded World visited
  • Seeing the light
  • February 2012
  • Looking forward to Embedded World
  • Death by PowerPoint
  • A slick UI
  • The first 12 notes of a song
  • IPv6 is really coming
  • Am I fit?
  • More on System-C
  • A random act of kindness
  • January 2012
  • Q&A
  • Who do I think I am?
  • Interview
  • Joining the embedded bandwagon
  • Not going metric
  • Goodbye ICT
  • Collectomania
  • Everything is wireless
  • Crystal ball gazing
  • December 2011
  • Season’s greetings
  • The invisible RTOS continued
  • Stop, thief!
  • The invisible RTOS
  • Good health to you
  • On the road again
  • Culture
  • November 2011
  • System-C for embedded systems programming
  • Who am I?
  • What is the plural of Linux?
  • Data in the cloud
  • Device drivers on SMP systems
  • The death of the CD?
  • Who needs a debugger anyway?
  • How many?
  • October 2011
  • Hardware designers and software
  • Becoming a millionaire
  • Dennis Ritchie and C for embedded
  • How green is my aircraft?
  • The value of software
  • Winter fuel
  • Efficient code – quiz answers
  • The alarm
  • Conference season
  • September 2011
  • All the data in the world
  • Efficient code – a quiz
  • Sonos
  • C libraries
  • A key to success
  • VDC survey
  • And where were you?
  • Embedded expectations
  • Being rich
  • August 2011
  • Stuffing bits
  • Pecha Kucha
  • Web seminar
  • Instrumentation
  • Working with fire and steel
  • Monolithic or not
  • How did we do it?
  • Get packing
  • Monopoly
  • The ideal programming language?
  • July 2011
  • Festival fever
  • Embedded tools – the third way
  • Drawing on the right side of the brain
  • What language?
  • Swifts
  • Data types and code portability
  • Traveling light
  • ARM Development Conference
  • June 2011
  • Giving it away
  • USB 3.0 and data storage
  • Tempus fugit
  • On the move
  • OGWT
  • After DAC
  • How would you like it done?
  • System power
  • Tilting at windmills
  • More on DAC
  • May 2011
  • Memory signature follow-up
  • Handedness follow-up
  • DAC
  • USB power
  • E-reading
  • Web seminar
  • Memory signature
  • Non-human handedness
  • And the winner is …
  • What time is it?
  • ESC Silicon Valley
  • April 2011
  • Do e-books make me read more?
  • USB 3.0
  • Too many mega-XXX
  • Choosing an embedded operating system
  • Fireworks
  • Web seminar
  • USB – class drivers
  • Legal anachronisms
  • Why does power matter?
  • Web seminar
  • March 2011
  • Precise or random?
  • Device registers in C
  • Moderate cycling on the road to Damascus
  • Share and share alike
  • Evernote
  • Why move to C++?
  • Perspectives on a missed opportunity
  • Embedded World 2011
  • Web seminar
  • The parcel that wasn’t stolen
  • February 2011
  • Hiring embedded software engineers
  • Cool Katie
  • A question of honor
  • C++ reference parameters – the downside
  • The happy ending
  • A different angle on multicore
  • Jubilee!
  • Who needs hardware designers?
  • Big vs little
  • January 2011
  • Where does embedded software come from?
  • The King’s Speech
  • C++ – for loops
  • iPad/iPhone iOS 4 closing apps
  • Secret codes
  • A movement
  • Software saves lives
  • Best tech of 2010
  • December 2010
  • Have a drink on me
  • The power of a greeting
  • Heavy elements
  • Embedded Software Engineering Kongress 2010
  • Why grow?
  • More uses for an MMU
  • A one way trip
  • November 2010
  • Embedded software in 2011
  • Friends, Romans and Dunbar’s number
  • Firmly in line
  • In judgment
  • ARM Tech Con 2010
  • Bad food
  • The UI debate
  • And you are?
  • USB – even more
  • October 2010
  • What’s my iPad for?
  • Scandinavian embedded
  • Back to school
  • Go Forth!
  • Up hill
  • The interview
  • Starting a business
  • A busy season
  • September 2010
  • Moving house
  • RTOS deployment
  • Element 13
  • What date is it?
  • USB – real speed
  • The wrong word
  • The one line RTOS
  • Handedness
  • August 2010
  • C++ at fault
  • RPN
  • Embedded Linux – why?
  • Thanks for the memory
  • USB – the need for speed
  • What is stress?
  • Function parameters
  • The wardrobe
  • Chinese embedded
  • July 2010
  • iPad – how I feel
  • Why write code?
  • More laziness
  • Electronics for the sick
  • Can you do the placebo?
  • AMP & SMP revisited
  • iPad – an interim report
  • When compilers do magic
  • Arthur Smith and the Italians
  • The Soviet Empire – my part in its downfall
  • June 2010
  • EDA and embedded software
  • No comment on the World Cup
  • Laziness as a business asset
  • MCAPI – lessons learned
  • Being a mentor
  • How did I get into this?
  • Uniquely annoying
  • AMP vs SMP
  • Reality
  • Static or static
  • May 2010
  • A good day out
  • Hacking your car
  • What sign are you?
  • Using an MMU
  • Books and e-books
  • OS influence on power consumption
  • iPad impressions
  • ESC Silicon Valley 2010
  • April 2010
  • What is my PC for?
  • Introducing MCAPI
  • The perfect day
  • Vintage multi-core – the IPC
  • The 4 Ds principle
  • Vintage multi-core – introduction
  • Out of gas?
  • RTS Embedded Systems 2010
  • Learning a language
  • March 2010
  • USB – what we learned
  • Perfect chocolate
  • Graphical medicine
  • Thinking is important
  • Technology trends
  • Words on the move
  • EW 2010
  • The AIRpod
  • What is “real time”?
  • February 2010
  • A taxi in Munich
  • Staying in line
  • The Names Game
  • Android beyond mobile
  • Avatar
  • Thanks for the memory
  • What is my phone for?
  • Overloading or obfuscation?
  • January 2010
  • Indefinite
  • Product quality: belief or proof?
  • Vacation for the PC repair man
  • Agile revisited
  • Time’s arrow
  • Small or fast?
  • aMAZEd
  • C, C++ and the family tree
  • December 2009
  • A Google mobile lab, anyone?
  • Crystal ball
  • Stand-by or boot-up
  • There, their
  • Agile embedded
  • Innovation as an Asset
  • IP/ESC’09
  • Keeping the lights on
  • November 2009
  • Blocking and non-blocking APIs
  • Cars made from wood
  • Yes!
  • 8 bits anyone?
  • Driving on the right side of the road
  • C is great, but…
  • Droids dropping like bombs
  • Parlez vous Fortran?
  • Palindromic numbers
  • Multi-core, multi-OS confusion
  • October 2009
  • Muffins in Colorado
  • Where does Ford make its paint?
  • Time travel
  • Introducing the iBrush
  • Sex, money and the weather
  • Heap contiguity revisited
  • Going green
  • Simulation – better than the real thing?
  • Mind your language
  • September 2009
  • MMU and heap contiguity
  • What kind of pillows do you want?
  • Creeping elegance
  • Government sponsored charity
  • How many mobile phones?
  • Stand aside! I’m an expert
  • Seeing is believing
  • A plea
  • August 2009
  • volatile
  • Stolen!
  • Multi-core/multi-OS – terminology
  • Feedback
  • The RTOS business – a story
  • Reading aloud
  • ++i or i++?
  • The .99 phenomenon
  • USB: easy, but …
  • July 2009
  • Problem solving
  • Devices that phone home
  • An odd coincidence
  • Y2K redux
  • Mind mapping
  • The Works
  • Best barman in the world
  • When a fancy UI is not a luxury
  • Live music
  • June 2009
  • Language extensions
  • The 8 day week
  • Fix it in the software!
  • How to sell more toothpaste
  • What is an embedded system?
  • Glass blowing
  • There are 10 kinds of people in the world
  • You want a macchiato?
  • Who needs OS source code?
  • May 2009
  • By royal appointment
  • Assembly language is always smallest/fastest – not!
  • Exhibitionism
  • Would you buy a TCP/IP stack from me?
  • Linux in the embedded world
  • How much of a geek am I?
  • Why medical systems need WiFi
  • Boldly splitting infinitives
  • ESC Silicon Valley 2009: The Fashion Parade