Fun with toolchain versions

Uprgrading Your Toolchain

I am often reminded by my customers that changing toolchain versions can be fraught with peril.  Indeed, as any software project (open or closed sourced) progresses, inevitably new “features” render old ones obsolete or worse.  Probably one of the most common issues one might encounter with a new toolchain version are the class of compile errors due to warnings being promoted (or would that be demoted 😉 ) to errors in newer versions of gcc.  It seems that the gcc developers are trying to force good coding practice by converting some classes of warnings to errors.  In this case, the application developer has two possible courses of action: fix the code or use a -W compiler option to disable the warning as error.  We’ll call this the right way or the lazy way!

Then there are the inevitable compile-time errors due to header file changes.  This is also a common category of error when a new toolchain is applied to an existing codebase, especially when the codebase in question is large and has been around for some time.  Often these changes are due simply to changes in the standards (C/C++).  These errors are usually fairly easy to locate and resolve.  These types of issues are compounded when moving to a newer version of glibc along with the newer compiler.

The Really Tough Issues

The most insidious errors are those runtime errors which have no obvious cause.  An application that was working when compiled on one gcc version crashes or otherwise fails when compiled on a later version of compiler.  Frequently there are no unusual compiler warnings, no runtime diagnostic messages, etc. Small changes in compiler behavior can introduce subtle runtime issues, or expose a software bug that has been lurking but undetected. I recently encountered one such case.

I had been working on a side project to get a Yocto Project Poky image running on my old Dell Mini 10.  The platform has an Intel Atom Z530 processor, and a graphics controller (GMA500/Poulsbo) developed by PowerVR. Let’s just say I don’t think this graphics controller has enjoyed the attention from developers that some of the more popular ones have.  I’ve had alot of difficulty getting graphics to work with modern distributions such as Ubuntu and others on this platform.

My first Poky build for this platform was remarkably trouble-free.  I added the meta-intel and meta-emenlow layers and built for MACHINE=emenlow.  This booted my Z530 but when the xserver started it would crash and leave the display unreadable.  It looked like an old black and white TV with the horizontal sync out of adjustment. Thank goodness for the dropbear ssh server and ssh logins!

Xorg Crashes

Following that exercise, I decided to try another build based on Mentor Embedded Linux 5 technology (our Yocto Project-based product), using our commercially supported gcc 4.6 toolchain – Sourcery CodeBench 2011.09-101  (http://www.mentor.com/embedded-software/sourcery-tools.)  To my surprise,  this time it booted to a nice pretty sato user interface screen.  After some trial and error, I determined that the Xorg binary itself was to blame.  Xorg compiled with gcc 4.7 crashed and left the display unusable, but the same exact source compiled with Sourcery CodeBench 2011.09 (gcc 4.6) resulted in a working Xorg binary.  I’ll say right up front that the 4.7 compiler was not to blame.  Read on.

At first I started by comparing the numerous compiler warnings from Xorg (way too many to even count, but that’s another story!) but that yielded nothing interesting.  Just at the point where I was about to give up, Gary Thomas posted a patch to the oe-core mail list with a fix to xserver-kdrive that looked promising. The subject line listed xserver and gcc 4.7, and the content looked promising that it could in fact be the issue I was facing.  If you want the gory details, the patch came from this bug: https://bugs.freedesktop.org/show_bug.cgi?id=18451.  Let’s look at the offending code:

This listing is shortened for purposes of this discussion, with non-interesting lines removed and indicated by an ellipsis:

int XaceHook(int hook, ...)
{
    pointer calldata;   /* data passed to callback */
    int *prv = NULL;    /* points to return value from callback */
    va_list ap;         /* argument list */
    va_start(ap, hook);
    ...
    switch (hook)
    {
        case XACE_RESOURCE_ACCESS: {
            XaceResourceAccessRec rec;
            rec.client = va_arg(ap, ClientPtr);
            rec.id = va_arg(ap, XID);
            ...
            calldata = &rec;
            prv = &rec.status;
                    break;
            }
            ...
    }
    /* call callbacks and return result, if any. */
    CallCallbacks(&XaceHooks[hook], calldata);
    return prv ? *prv : Success;
}

Notice the use of the local stack variable inside the case statement, the declaration of the rec structure:

XaceResourceAccessRec rec;

This variable is only valid in the scope in which it is declared, that is, between the braces of the case statement.  In that same scope, the structure is filled in, and a pointer to that structure is stored in the variable calldata.  (Each case statement in this switch construct had similar logic.)  Later near the end of the function, outside of the switch statement, calldata is passed to the callback function in CallCallbacks().  By this time, the pointer stored in calldata is invalid, as it is out of scope.

We can only presume that some subtle change in compiler behavior exposed this coding error in gcc 4.7, while in gcc 4.6 this bug did not result in a runtime error.  Technically, the compiler is free to reuse the memory locations (in this case stack memory) that stored the structure in the case statement, once that variable goes out of scope, and in the case of gcc 4.7, we assume that it did.  In the 4.6 case, for reasons which certainly elude me, we got away with it.  I am the farthest thing from a compiler expert, but it isn’t hard to imagine that as compilers get better at doing what they do, a bug like this can be exposed.

Summary

One day soon, I’m going to count the actual individual compile and link operations that go into building a typical embedded Linux distribution.  That starts to illustrate the scope of the problem and potential for failure when anything changes.  Suffice it to say that it’s probably in the tens of thousands.  A purely non-scientific observation shows the Poky-derived Sato image on which this article was based has somewhere around 135,000 C files, performs around 450 compile tasks at the recipe level, and logs over 70,000 calls to the compiler.  (And you wondered why it took so long to build?)

If you ever wondered why development organizations loathe the thought of changing compilers, this example might help clear that up.  Serendipity led me to the solution to the real problem I have described in this article.  Had I not gotten lucky, this problem could have easily taken days, weeks or worse to resolve, especially since I have no particular expertise with Xorg or X servers in general.

The toolchain (along with the C runtime library) are the foundation of your project. The best advice I can offer here is to chose wisely when making this decision.

Post Author

Posted May 18th, 2012, by

Post Tags

Post Comments

No Comments

About The Chris Hallinan Blog

Mentor Embedded’s Open Source Experts discuss recent happenings in the Embedded Open Source world. The Chris Hallinan Blog

Comments

Add Your Comment