lots of little pieceslots of little pieces RSS
Even better than putting developers from all those projects in the same room, in the morning there were a couple presentations from tracing users: Frank Rowand and Vinod Kutty who use tracing on embedded systems and financial trading servers, respectively. The danger of inward-facing development conferences is that design decisions can be made based on what developers think the end-users want… but talking to those end-users directly is essential. Kudos to the organizers, Dominique Toupin and Mathieu Desnoyers, for arranging for those talks.
So what did they have to say?
- the term “embedded” spans the range from low-end cameras to wall-sized video editing consoles (this isn’t news to anybody doing embedded development)
- embedded devices are often very storage-constrained, so they can’t hold a lot of trace data, and CPU-constrained, so the tracing needs to be low-overhead
- desirable things to trace in embedded systems include power consumption, memory usage, and boot time
- high-frequency trading is about minimizing latency (on the order of microseconds), to the point that they disable power-saving because CPU speed transitions take too long
- recent technologies like RDMA, OpenOnload, and DPAA are bypassing the kernel, so tracing solutions must cover userspace too
ARM Ltd recently unveiled the virtualization capabilities in ARMv7-A, and they are impressive. Taking a step back though, here’s what impresses me the most: these guys jumped in with both feet.
Consider for a moment the very measured approach Intel took with their virtualization extensions to the x86 architecture. They started by adding privilege modes beyond the traditional “ring” model. A later implementation added nested MMU support, allowing guests to directly manipulate their page tables without host software intervention. A later implementation added an IOMMU, allowing direct guest access to IO devices without sacrificing isolation.
In contrast, ARM is doing all of these, and more, in their very first foray into hardware virtualization.
If you talk to me long enough, you’ll undoubtedly hear me say something about priorities and engineering tradeoffs. In the real world, real sacrifices must be made to do almost anything new, and adding virtualization support to hardware is another good example. Numerous projects have demonstrated that one can virtualize the ARM architecture without hardware support, and in those cases the engineering tradeoffs boil down to performance vs isolation vs code changes. ARM is dramatically simplifying that equation, and while it’s great for us software people, they’re paying a price in hardware: design complexity, die size and cost, power consumption, and the all-important verification process. Improving the software comes at a steep cost elsewhere in the ecosystem, hoping for a net benefit to the overall system.
Of course, having the hardware capabilities is just the start; you also need software to drive it. That’s where it starts to get fun for me…
Some people believe code is be self-evident, and, like debuggers, comments in source code are a crutch for the weak developer. I am not one of those people. Commenting code is good. Comments not only help other people understand my code, they also help me understand my code when I re-read it 6 months later.
But sometimes I see comments that make me cringe. Some examples:
/* Check to see if this header file has been included already. */
Any programmer who’s taken Introduction to C 101 should know this construct. If they have not taken the class, they should not be editing this code. So depending on the reader, the comment is either obvious or isn’t nearly helpful enough; either way it’s not useful.
/* This routine implements a spinlock to acquire the lock of a shared resource present in the system. */
If the function name weren’t enough (“foo_spinlock”), the pedant author in me wants to know why all the extra words were deemed important enough to include. After all, effective communication is supposed to be concise, so does that mean there is another function somewhere that acquires a lock that isn’t for a shared resource? Or a shared resource that isn’t present in the system?
Comments like these don’t help the reader, but they do hurt in a few ways. First, they waste the time of the author, who in my experience spends as much time reformatting the whitespace to make it pretty as they spend writing it. This is basic opportunity cost. Second, they decrease code density. The benefit of a large display to a programmer is that they can see more code at once, and that means they don’t have to try to remember what that function returns in case of an error; it’s right in front of them. These days we have wonderful displays with resolutions that can show us lots of code at once, and now we’re filling it with wasted pixels.
What would be better? In the last example, a brief note about how it differs from a normal spinlock would be helpful. For example, “This spinlock uses the legacy ARMv5 swp instruction” could actually help a future reader who’s not too familiar with the ARM architecture, but is trying to figure out why code that worked fine on the old core doesn’t work on the new one. Or information like “Lock instruction sequence copied from Power ISA 2.06″ tells the reader that the existing or missing barrier instructions are probably correct because they came from an authoritative source.
I think that’s really the trick: when writing to communicate, you need to figure out who your audience is. The examples I showed above seemed to assume a lower level of competence in the reader than is useful. If the reader doesn’t know basic C, or doesn’t already know why to use a spinlock, a comment isn’t going to teach them. Instead, you need to assume a base level of competence, and once you do that, your comments will become far more valuable.
There has been a lot of chatter about deploying ARM servers in the data center recently. This page, for example, chronicles the formation of a small startup devoted to creating such systems. Most speculation about the Microsoft ARM license include this concept as well.
I joined IBM as a PowerPC Linux developer in 2001, a time when IBM was trying to use Linux to expand the market for their pSeries (now “System p”) servers. The hardware had some advantages (and disadvantages) compared to Intel servers of the day, but AIX was a limiting factor and Linux was a small but quickly growing ecosystem. It didn’t have a lot of legacy software, and it could be co-opted with minimal effort: port and maintain the Linux kernel and toolchain, and that whole juicy open source world is just a recompile away.
Although most open source software was indeed just a recompile away, it turns out that wasn’t good enough. For a long time, Linux on pSeries suffered from the lack of key enterprise applications, such as Oracle, SAP, and even IBM’s own DB2 and WebSphere suites. (In hindsight, it’s hard to fault those software providers for their reticence to port to yet another tiny platform: even when the code is just a recompile away, there is a lot more to launching a product.) As recently as 2008, IBM launched PowerVM Lx86, which performs runtime binary translation to run Linux x86 apps on Linux PowerPC servers. This was an explicit acknowledgment that PowerPC hardware sales were still being lost due to an incomplete software ecosystem. That was near the height of the PowerPC development ecosystem, when low-cost high-powered workstations were still generally available to developers.
I think the parallels with ARM are clear. Let’s say ARM servers would have advantages (power consumption) and disadvantages (poor single-thread performance, missing enterprise IO adapters) compared to traditional server hardware. Let’s also say that the ARM Linux infrastructure (kernel, toolchain, etc) is good enough or can be improved to the point that all the open source code out there “just works.” That’s still not good enough for the data center. It may not even be good enough for in-house software without dependencies on the hardware instruction set.
To be sure, there are specific use cases that could work really well for ARM servers (or MIPS servers, for that matter). Highly parallel applications (web serving tier 1) can be run on large clusters of relatively slow cores, especially if existing open source software is all that’s needed. Job postings suggest that a number of big-name companies are toying with the idea. It could happen. But ecosystems take a very long time to grow, and while we may indeed “see something out there” in less than a year, it will make far more ripples in the trade press than in the data center.
POSIX signals have a long history and at least a couple unpleasant limitations. For one thing, with some threading implementations (those with fewer processes than threads) you can’t reliably target a specific thread as a signal recipient. However, luckily for me, that is not my problem.
My problem is both organizational and technical. Signal masks are for an entire process, and that means that masking a signal in your code may unintentionally impact code elsewhere in your application that expected signal delivery to work. This could in theory affect any codebase written by more than one person, but it really becomes an issue when your process uses code written by third parties.
We recently did some work to enable Android applications to use our MCAPI library. Most Android developers work with Java, with each application run in its own virtual machine. However, our MCAPI library is “native code” (i.e. C, not Java), and for that Android uses its own C library called “bionic” and its own threading implementation. The first problem is that bionic doesn’t implement one of the POSIX thread APIs:
As it so happens, we use pthreads in MCAPI for internal control messages. When the user de-initializes MCAPI, we need to shut those threads down, and so on Linux we ordinarily use
pthread_cancel(). Since that’s unavailable in an Android environment, we implemented our own by sending a signal to wake our control thread. The thread is typically blocked in the kernel waiting for a hardware interrupt, so a signal causes it to be scheduled again, at which point it notices it should exit. Not a lot of code; tested on Linux and worked great; problem solved. When we ran it on Android though, it did nothing at all.
Remember how signals are process-wide? Well, as it turns out, Dalvik uses some signals for itself, including the signal we chose for MCAPI:
SIGUSR1. When it came time to kill our thread, we sent the signal… but unbeknown to us, Dalvik code elsewhere in the application had masked
SIGUSR1. Our thread never woke up and never exited.
The Solution (?)
The fix? Use
SIGUSR2 instead. Works great; problem solved. Longer term though, there’s no guarantee that Dalvik won’t start using that too, or an application will link with some other library that (like us) tries to use
SIGUSR2. Since there is no standard API to request and reserve signal numbers, conflicts seem inevitable.
So what to do? The best general solution I can come up with is one that embedded software developers should be familiar with: punt the problem to the integrator. The developer who writes the application using our library should be able to configure MCAPI to use an arbitrary signal, which they ensure won’t conflict with the rest of the application and libraries through code inspection. (Sure hope their third-party libraries come with source code.)
That doesn’t feel very satisfying to me either.
About lots of little pieces
Observations and opinions from a software guy about embedded systems, especially virtualization and partitioning.
- Tracing Summit: hearing from the users
- Introducing OpenMCAPI
- jumping in with both feet
- Who are you writing that for?
- ARM servers: not so fast
- Mixed Signals (or, Why Hasn’t This Been Solved Yet?)
- August 2012 (1)
- March 2011 (1)
- September 2010 (1)
- August 2010 (2)
- July 2010 (1)
- June 2010 (1)
- March 2010 (1)