lots of little pieces

Observations and opinions from a software guy about embedded systems, especially virtualization and partitioning.

2 December, 2013

Ever had to work closely with someone else to debug a problem? Maybe there’s a nasty bug that Alice can’t reproduce, but Bob seems to hit it every time. In large software projects, Alice and Bob might not even work for the same company: one company doing integration reports problems with a component that another company built… but problems that are only visible in the integrated system.

In situations like that, it’s often a good approach for Alice to use trace analysis in Sourcery CodeBench to understand the problem. If Alice could directly access Bob’s system over the network, Sourcery CodeBench would automate collection of the trace data, and then she could use Sourcery Analyzer‘s powerful trace analysis features to debug the problem. However, Bob is behind a different firewall, and he doesn’t know how to collect the trace data. What to do?

In the old days, Bob would need to knuckle down and learn how manually collect the data Alice needs, and probably do it repeatedly (and without typos) as Alice narrows in on the problem. Bob would be annoyed at the time commitment, Alice would get frustrated at the errors and slow turnaround time, and they’d be unhappy with each other.

No longer. Now Alice can use Sourcery CodeBench to generate a standalone script that takes care of all the trace data collection. She emails it to Bob, Bob runs it and emails the results back, and everybody’s happy.

Oh, and the bug gets fixed faster. What’s not to like?

Try it out now as part of a free evaluation of Sourcery CodeBench!

, ,

31 August, 2012

I was in San Diego yesterday for the Tracing Summit, part of the Linux Plumbers’ Conference. Topics included SystemTap, ftrace, perf, and GDB tracepoints… and of course LTTng.

Even better than putting developers from all those projects in the same room, in the morning there were a couple presentations from tracing users: Frank Rowand and Vinod Kutty who use tracing on embedded systems and financial trading servers, respectively. The danger of inward-facing development conferences is that design decisions can be made based on what developers think the end-users want… but talking to those end-users directly is essential. Kudos to the organizers, Dominique Toupin and Mathieu Desnoyers, for arranging for those talks.

So what did they have to say?

  • the term “embedded” spans the range from low-end cameras to wall-sized video editing consoles (this isn’t news to anybody doing embedded development)
  • embedded devices are often very storage-constrained, so they can’t hold a lot of trace data, and CPU-constrained, so the tracing needs to be low-overhead
  • desirable things to trace in embedded systems include power consumption, memory usage, and boot time
  • high-frequency trading is about minimizing latency (on the order of microseconds), to the point that they disable power-saving because CPU speed transitions take too long
  • recent technologies like RDMA, OpenOnload, and DPAA are bypassing the kernel, so tracing solutions must cover userspace too

Some additional details were recorded in an etherpad (a communal note-taking page). Slides should eventually be available on the Tracing Summit wiki page.

, , , , , ,

30 March, 2011

Mentor recently shared OpenMCAPI, our MCAPI implementation, with the world under an open source license (BSD). We’re proud of it because it was especially designed for portability, and while it’s sophisticated enough to offer advanced features like asynchronous communication, it’s simple enough that it can be easily understood and deployed on low-resource embedded systems.

So what is MCAPI? We’ve written about it before, published whitepapers and presented webinars, but basically it’s an IPC library designed for closely-distributed systems (think AMP with shared memory). OK, actually it’s a specification and there can be many implementations, but we’re pretty fond of ours. Using MCAPI, you can communicate between two different operating systems, such as Linux or Android and an RTOS, on a multi-core processor.

OpenMCAPI includes only Linux OS support right now, but obviously we’ve run it on our Nucleus RTOS as well, and adapting it to other RTOSs or software environments should be quite simple. Our choice of the BSD open source license allows easy integration with the wide variety of software which can found in embedded systems. However, unlike some open source announcements, this isn’t just an “as-is” code drop; we continue to maintain and improve the project and our developers regularly participate in discussions on the mailing list.

If this technology sounds like it might fit your application’s needs, take a look at our wiki, download the code, and join the mailing list with any questions, feature proposals, or (even better) patches. See you there!

, , , , , ,

11 September, 2010

ARM Ltd recently unveiled the virtualization capabilities in ARMv7-A, and they are impressive. Taking a step back though, here’s what impresses me the most: these guys jumped in with both feet.

Consider for a moment the very measured approach Intel took with their virtualization extensions to the x86 architecture. They started by adding privilege modes beyond the traditional “ring” model. A later implementation added nested MMU support, allowing guests to directly manipulate their page tables without host software intervention. A later implementation added an IOMMU, allowing direct guest access to IO devices without sacrificing isolation.

In contrast, ARM is doing all of these, and more, in their very first foray into hardware virtualization.

If you talk to me long enough, you’ll undoubtedly hear me say something about priorities and engineering tradeoffs. In the real world, real sacrifices must be made to do almost anything new, and adding virtualization support to hardware is another good example. Numerous projects have demonstrated that one can virtualize the ARM architecture without hardware support, and in those cases the engineering tradeoffs boil down to performance vs isolation vs code changes. ARM is dramatically simplifying that equation, and while it’s great for us software people, they’re paying a price in hardware: design complexity, die size and cost, power consumption, and the all-important verification process. Improving the software comes at a steep cost elsewhere in the ecosystem, hoping for a net benefit to the overall system.

Of course, having the hardware capabilities is just the start; you also need software to drive it. That’s where it starts to get fun for me…

, ,

10 August, 2010

Some people believe code is be self-evident, and, like debuggers, comments in source code are a crutch for the weak developer. I am not one of those people. Commenting code is good. Comments not only help other people understand my code, they also help me understand my code when I re-read it 6 months later.

But sometimes I see comments that make me cringe. Some examples:

/* Check to see if this header file has been included already.  */
#ifndef _FOO_
#define _FOO_
...
#endif

Any programmer who’s taken Introduction to C 101 should know this construct. If they have not taken the class, they should not be editing this code. So depending on the reader, the comment is either obvious or isn’t nearly helpful enough; either way it’s not useful.

/* This routine implements a spinlock to acquire the lock of a shared resource present in the system. */

If the function name weren’t enough (“foo_spinlock”), the pedant author in me wants to know why all the extra words were deemed important enough to include. After all, effective communication is supposed to be concise, so does that mean there is another function somewhere that acquires a lock that isn’t for a shared resource? Or a shared resource that isn’t present in the system?

Comments like these don’t help the reader, but they do hurt in a few ways. First, they waste the time of the author, who in my experience spends as much time reformatting the whitespace to make it pretty as they spend writing it. This is basic opportunity cost. Second, they decrease code density. The benefit of a large display to a programmer is that they can see more code at once, and that means they don’t have to try to remember what that function returns in case of an error; it’s right in front of them. These days we have wonderful displays with resolutions that can show us lots of code at once, and now we’re filling it with wasted pixels.

What would be better? In the last example, a brief note about how it differs from a normal spinlock would be helpful. For example, “This spinlock uses the legacy ARMv5 swp instruction” could actually help a future reader who’s not too familiar with the ARM architecture, but is trying to figure out why code that worked fine on the old core doesn’t work on the new one. Or information like “Lock instruction sequence copied from Power ISA 2.06″ tells the reader that the existing or missing barrier instructions are probably correct because they came from an authoritative source.

I think that’s really the trick: when writing to communicate, you need to figure out who your audience is. The examples I showed above seemed to assume a lower level of competence in the reader than is useful. If the reader doesn’t know basic C, or doesn’t already know why to use a spinlock, a comment isn’t going to teach them. Instead, you need to assume a base level of competence, and once you do that, your comments will become far more valuable.

,

4 August, 2010

There has been a lot of chatter about deploying ARM servers in the data center recently. This page, for example, chronicles the formation of a small startup devoted to creating such systems. Most speculation about the Microsoft ARM license include this concept as well.

I joined IBM as a PowerPC Linux developer in 2001, a time when IBM was trying to use Linux to expand the market for their pSeries (now “System p”) servers. The hardware had some advantages (and disadvantages) compared to Intel servers of the day, but AIX was a limiting factor and Linux was a small but quickly growing ecosystem. It didn’t have a lot of legacy software, and it could be co-opted with minimal effort: port and maintain the Linux kernel and toolchain, and that whole juicy open source world is just a recompile away.

Although most open source software was indeed just a recompile away, it turns out that wasn’t good enough. For a long time, Linux on pSeries suffered from the lack of key enterprise applications, such as Oracle, SAP, and even IBM’s own DB2 and WebSphere suites. (In hindsight, it’s hard to fault those software providers for their reticence to port to yet another tiny platform: even when the code is just a recompile away, there is a lot more to launching a product.) As recently as 2008, IBM launched PowerVM Lx86, which performs runtime binary translation to run Linux x86 apps on Linux PowerPC servers. This was an explicit acknowledgment that PowerPC hardware sales were still being lost due to an incomplete software ecosystem. That was near the height of the PowerPC development ecosystem, when low-cost high-powered workstations were still generally available to developers.

I think the parallels with ARM are clear. Let’s say ARM servers would have advantages (power consumption) and disadvantages (poor single-thread performance, missing enterprise IO adapters) compared to traditional server hardware. Let’s also say that the ARM Linux infrastructure (kernel, toolchain, etc) is good enough or can be improved to the point that all the open source code out there “just works.” That’s still not good enough for the data center. It may not even be good enough for in-house software without dependencies on the hardware instruction set.

To be sure, there are specific use cases that could work really well for ARM servers (or MIPS servers, for that matter). Highly parallel applications (web serving tier 1) can be run on large clusters of relatively slow cores, especially if existing open source software is all that’s needed. Job postings suggest that a number of big-name companies are toying with the idea. It could happen. But ecosystems take a very long time to grow, and while we may indeed “see something out there” in less than a year, it will make far more ripples in the trade press than in the data center.

, , ,

21 July, 2010

POSIX signals have a long history and at least a couple unpleasant limitations. For one thing,¬†with some threading implementations (those with fewer processes than threads) you can’t reliably target a specific thread as a signal recipient. However, luckily for me, that is not my problem.

My problem is both organizational and technical. Signal masks are for an entire process, and that means that masking a signal in your code may unintentionally impact code elsewhere in your application that expected signal delivery to work. This could in theory affect any codebase written by more than one person, but it really becomes an issue when your process uses code written by third parties.

The Background

We recently did some work to enable Android applications to use our MCAPI library. Most Android developers work with Java, with each application run in its own virtual machine. However, our MCAPI library is “native code” (i.e. C, not Java), and for that Android uses its own C library called “bionic” and its own threading implementation. The first problem is that bionic doesn’t implement one of the POSIX thread APIs: pthread_cancel().

As it so happens, we use pthreads in MCAPI for internal control messages. When the user de-initializes MCAPI, we need to shut those threads down, and so on Linux we ordinarily use pthread_cancel(). Since that’s unavailable in an Android environment, we implemented our own by sending a signal to wake our control thread. The thread is typically blocked in the kernel waiting for a hardware interrupt, so a signal causes it to be scheduled again, at which point it notices it should exit. Not a lot of code; tested on Linux and worked great; problem solved. When we ran it on Android though, it did nothing at all.

Remember how signals are process-wide? Well, as it turns out, Dalvik uses some signals for itself, including the signal we chose for MCAPI: SIGUSR1. When it came time to kill our thread, we sent the signal… but unbeknown to us, Dalvik code elsewhere in the application had masked SIGUSR1. Our thread never woke up and never exited.

The Solution (?)

The fix? Use SIGUSR2 instead. Works great; problem solved. ;) Longer term though, there’s no guarantee that Dalvik won’t start using that too, or an application will link with some other library that (like us) tries to use SIGUSR2. Since there is no standard API to request and reserve signal numbers, conflicts seem inevitable.

So what to do? The best general solution I can come up with is one that embedded software developers should be familiar with: punt the problem to the integrator. The developer who writes the application using our library should be able to configure MCAPI to use an arbitrary signal, which they ensure won’t conflict with the rest of the application and libraries through code inspection. (Sure hope their third-party libraries come with source code.)

That doesn’t feel very satisfying to me either.

, , ,

24 June, 2010

When I learned Python I became enamored with the idiom of dispatching, which looks something like this:

name = "foo"
function = getattr(object, "prefix_" + name)
function()

In this way we can call object.prefix_foo() without big switch statements or if/else if/else if constructs.

Of course, I usually program in C. While we can’t do exactly the same thing there, the closest analogy is the function pointer:

void (*func_ptr)(void) = foo;
func_ptr();

If you know whether func_ptr should be foo or bar ahead of time, the above code is far more elegant than this:

if (condition_a)
    foo();
else if (condition_b)
    bar();
...

The Test

I’ve always heard that branches are to be avoided at all costs, since on some processors even a correctly-predicted branch can still stall the pipeline. A switch statement can be implemented by the compiler as a series of compare and conditional branches, and that would compound the problem. On the other hand, an indirect function call could be more difficult for branch prediction, and of course still requires that branch even when predicted correctly. So I ran some completely artificial and unscientific tests to see what the performance break-even point is on a sampling of real hardware: how many conditional branches can you use before it would be faster (and prettier) to use a function pointer?

I wrapped a conditional function call (as shown above) in a 1-billion count loop, and averaged the runtime of that executable over a number of runs. I found some surprises.

Surprises

One surprise: on an Intel Core2, there’s almost no performance difference between using a function pointer and even a single conditional.

Another surprise was how much variation there was in the numbers. In the code sequence above, I expected that a run where condition_a were true would be faster than condition_b. Not always true. I expected that runs with fewer conditions would always be faster than runs with more conditions. Again, not always true, due to differences in the instruction sequences generated by the compiler and cache effects.

Summary

I won’t post the full data here because it’s so informal and hasn’t been subjected to rigorous performance analysis. The summary though is this: talking only about performance, after only 3 or so conditions on a Freescale PowerPC e500v2 core, function pointers are worth it. They’re always worth it on a Core2. If you’re curious, try your own tests and let me know how your core behaves. I’d love to know if there’s a reason not to heavily adopt function pointers in all my code…

(Oh, and as for comparison between a non-conditional direct branch vs an indirect function pointer: on the e500v2, the function pointer added about 33% overhead. The difference was negligible on the Core2.)

, , ,

22 March, 2010

I noticed an interesting virtualization article at EETimes today. Some good general points were made, but I do have a couple of comments…

This “type 1″ and “type 2″ (a.k.a. “hosted” and “bare-metal”) distinction is very popular in presentations and trade magazines. However, as a colleague has previously pointed out, it is also an academic issue and has nothing to do with the properties of actual virtualization implementations. Why is it so popular then? I think it’s because it provides a taxonomy for a very complicated area, and that brings (false) comfort to people confronted with virtualization for the first time. It also provides a nice straw-man for marketing material: “in contrast to those Type 2 hypervisors, ours is Type 1 and therefore much better!”

Also, can we all agree that binary translation has almost no applicability to embedded systems? The performance and memory footprint tradeoffs are quite large, and of course it throws any sort of determinism out the window. The only hypervisor that employs binary translation is VMware products on older hardware, and I’ve never heard anybody bring that up in the context of embedded systems… and I’ve heard a lot of far-flung embedded virtualization use cases!

In the article, isolation gets just a single paragraph of discussion, in which only aerospace and defense (A&D) applications are cited. There is a lot to be said about the engineering tradeoffs around isolation in virtualization, but I will just note that in most embedded systems, hardware vendors have seen fit to provide isolation enforcement mechanisms, and software designers have seen fit to use them, in markets far more diverse than A&D. I personally have been very thankful for those properties for bringup and debugging, regardless of vertical market.

At this point in the virtualization hype/adoption curve, I would like to see meatier discussion of these issues. It may be my skewed perspective, but I think most systems designers have heard something about virtualization by now, and it’s time to move the discussion from the abstract (“you should think about performance”) to specific technology issues that apply to real use cases.

That said, I do welcome all efforts to help dispel the myth that virtualization can solve everybody’s problems without drawbacks.

,