It's all about the software.

31 October, 2011

For Halloween this year I’ll be blogging a real horror story.  And when I say “real”, I mean real – everything you are about to read is the truth.  Not some Spielberg made up scary stuff.  Knowing that it’s real makes it just a bit scarier.  I’ve obfuscated some details to protect the identity of both the innocent and the guilty.

It all started some months ago, I was getting off a train in a European capital city and stepping into a thick swirling fog.  It was close to midnight. As I crossed the station platform I heard an owl hoot and fly away.  Climbing into a taxi, off in the distance, I could hear a wolf’s lonely howl – or could it be a werewolf? A shiver ran down my spine as I knew that I would be rising early in the morning to attend a post mortem. (Cue that creepy music from “The Exorcist”).

Well, not a real “dead body” post mortem, but I was asked to sit in on a review of a project; a project which was quite definitely dead.  Racked with delays, cost overruns, and lots of blame the project did not have a happy life.   I had been involved in the project from close to its outset, and even early on I could see that it was destined for trouble.  Needless to say, the corpse was not a pretty sight, and this meeting was not going to be a fun one.

The project had two major elements.  But the project teams working on the different elements were not coordinated.  They did not work together at all – in fact, they rarely communicated.  They started off by writing a specification of how the two sides would interface.  The specification was incomplete.  It contained errors, and was out of date by the time it was published.  As the project progressed, no one ever went back to update it – of course, this is inexcusable after the invention of the wiki.  But that wasn’t all that bad, since few people read the spec.  And fewer still even expected it would be complete or accurate.

Upper management decided not to co-locate these engineering organizations.  While both engineering groups had some contributors in far flung corners of the world, the bulk of one group was in a city in the U.S. and the locus of the other was in Europe.  There were many hours of time zone changes and cultural differences between the two.  The two groups did not even start on the project at the same time, one started a full 8 months after the other.

After many staff years of effort on the part of both groups – literally, a multi-million dollar/euro investment by the parent company – the momentous day came where the two parts of the system were brought together to form the whole.  The upper management of the project was shocked – shocked I tell you – that the whole thing did not come together and work the first time.  Shocked that there would be months more development and verification effort before the product could be released.

Here’s where the delays and blame started to pile up. Both sides accused the other of gross failures. Functionality was thrown overboard, as there were features that simply could not be implemented without redesigning major parts of the system.  The delays piled up.  Promised delivery dates came and went.  The marketeers saw their precious market window closing on them.  Everyone wanted someone to blame.  Smart folks were moving onto other projects.

A Horror story, no doubt.  You may be wondering how a competent engineering organization could execute so poorly. Surely your own organization would never do anything like this.  But there is one small detail I left out.

One group was designing software and the other was designing hardware.

Yes, lots of systems are still designed this way.  And I hear you saying “Well, that’s just software”.  Sure, software is more malleable (at least in theory).  It’s layered, so you only need to worry about the layer that touches hardware, right?  If you prove that A works with B, and B works with C, no need to verify that A, B and C all work and play well together, right?  And, my favorite “That’s the software guy’s problem”.  As a hardware developer, you were charged with developing the hardware – and you did it.  If software team can’t get their job done, well, that’s not your issue.  I’m guessing those shocked upper management types would beg to differ.

In my next blog post I’ll describe why you might want to worry about this, if you’re not already.  There is functionality which is starting to break the traditional boundaries between hardware and software. I’ll describe it, and what you can do about it.


25 May, 2011

…my producer, my agent, and all the little people that made this possible.

Yes, my product won an “Embeddie”. Awesome! Which is about as close as I’m ever going to get to an Emmy. An “Embeddie” is the award that VDC gives to the best new product at the Embedded Systems Conference. They give out one for a hardware product and one for a software product. This year for the software product, they chose Mentor Graphic’s Embedded Systems Division System Analyzer, one of the products which I spend most of my time on lately.

This is a real honor. I was chosen to accept the award on stage on behalf of the team. And I would like to thank my colleagues who made this possible. First, thanks to the marketing team behind System Analyzer, especially Brad Dixon and Shay Behchorin. Brad and Shay took my geek speak explanations of System Analyzer and translated it into an eloquent description that the editors could grok (like I said – geek speak). Kudos also go to Manfred Kreutzer and his team who patiently listen to my rants and turn my crazy ideas into working product – even when it’s impossible. From what they tell me the impossible just takes longer. Finally, thanks to Glenn Perry and Mark Mitchell for their vision, for giving me this opportunity and trusting me to lead this project.

So what’s so special about Mentor’s System Analyzer? A couple of things.

Steve Wozniak gave a “fireside chat” as one of the Keynotes at the Embedded Systems Conference. As he mused about what makes a great product, he encouraged the folks in the audience to build things that they themselves would want to use. Our System Analyzer truly falls into this category. I actually use it – a lot. And so do all the engineers that work on it. Other groups inside of Mentor have also used it. The other day I was talking about System Analyzer in the hallway, and an engineer overheard and asked if he could use it, too. I gotta agree with the “Woz” on this one. I believe that this makes the difference between a good product and a great one.

But there are lots of profiling tools out there. Windriver has one, Green Hills has one, and there are a plethora of open source projects working on profiling both desktop and embedded systems. The difference our system analyzer brings to the party is that it’s more than just a pretty face, it has a brain.

Think about the difference between Adobe’s Acrobat and Microsoft’s Excel. “Huh?” you might ask. How about comparing oranges and DSPs? No one would even think to compare Acrobat and Excel – they do completely different things, right? Well, that’s my point. As an example, Adobe’s Acrobat can show you some text and some numbers – say a list of capital cities and their populations, longitude, and latitude. Excel can show you the same thing. But excel can also total the populations, put them in a bar graph, or a pie chart. It can compute the standard deviation of the populations. Using the longitude and latitude, it can calculate the distance between the cities. And so much more.

Acrobat lets you see. Excel lets you figure out.

That’s what our System Analyzer can do for you. It gives you a really powerful computational capability which can operate on the data that you collect on your system. So you decide what bar graphs you want to see. You’re not stuck with the ones that come pre-packaged with a lesser profiling package. You decide what data matters to you, and you can collect it, correlate it, sort it, filter it, and anything else you want to do.

Where this gets interesting is that you can trace and operate on application data, not just artifacts of the application’s execution. This is, in my humble opinion, is the game changer. And I trust that this is what the folks at VDC saw which had them choose System Analyzer for the award. This, also, is where we are straying from the conventional into new territory. So to fully explain it will take longer blog posts (and probably a few of them). But don’t worry, Brad will help me translate that geek speak into something comprehensible. Stay tuned.

14 February, 2011

As Yogi Berra once said, “It’s déjà vu, all over again.”

I was out at a customer helping them debug a multi-core system. Unlike my Sudoku solver, this was not a multi-threaded SMP application. The customer had 4 processes, each running on a different core and specifically tied to that core. While all the programs were running under the same operating system – the designer did not rely on the OS to schedule the processes – so this was more like a “managed AMP” type of set up.

As we were trying to hunt down this really elusive nasty bug, which only raised its ugly head occasionally, I was struck by how similar the process was to a debugging session from a few years ago. Funny thing is, back then I was debugging a hardware design written in Verilog.

Is debugging hardware and debugging software becoming the same activity?

Well, no – hardware and software are still very different domains. Designing and debugging them are worlds apart. But there are some striking similarities creeping in, especially when we get to the world of multi-core. When designing and debugging hardware you have lots of things all going on at the same time. Often, functions are implemented as state machines. These state machines will interact, allowing for more complex behaviors than could practically be implemented in a single state machine. Hardware designers have tools which support the implementation, debug, and verification of state machines. There are state machine tools for software developers, too. We even have some here at Mentor. However, I’ve noticed that software folks tend to implement state machines on a more ad-hoc basis.

The programmers who coded up the linux FTP clients seemed to be real fans of state machines, as one can see just by using it. There is a state machine which keeps track of the state of the connection. A state machine which keeps track of the mode (ascii or binary), “bell” toggles a state, “prompt” toggles a state, and there are scads of other very simple state machines in this program. All these state machines interact to produce a rich set of behavior. Of course, it makes sense to implement a connection as a collection of state machines – as this is a good way to represent, both to the programmer and the user, the operation of the system.

The problem comes when one needs to validate or debug these state machines. Debugging a single state machine, pretty easy; debugging a collection of them, not so much. As I mentioned, on the hardware side there are tools to support this – even formal verification, a method of proving the correctness of a set of state machines based on more complex math than I was brave enough to attempt in college. There are some simpler tools available, too. One approach which is tremendously valuable in debugging state machines is the waveform trace. One can look at a collection of state variables over time and identify the transitions between states and the interactions between the machines.

As I watch software folks trying to debug state machines implemented in software, more often than not they resort to some type of logging or printfs (see my earlier post on “Guilty Pleasures”). A log of what all the state machines are doing is a good way to start to unravel what’s going wrong.

Of course, I’m getting awfully abstract here. I always find it a lot easier to explain things, and to understand them, with concrete examples. Now I could share with you the example from my aforementioned customer, but if I did – I’m pretty sure he’d be required to kill me (and I’m neither saying nor denying that this customer is in any way related to a government or a security apparatus attached to the said – or um, uh, “unsaid” government – which I didn’t mention. Conspiracy theorists draw your own conclusions, especially if I inexplicably disappear in the next few days). Besides, that example would be way too complicated to follow. And all kidding aside, here at Mentor Graphics we do take the confidentiality of our customers very seriously, and I would never divulge anything learned from any customer in confidence.

So here’s my simplified state machine example, the “Speed” card game. This game is a really interesting exercise in multi-core programming, so for those of you playing along at home, break out your IDEs – this is a fun one. The defining feature of the game is that unlike most card games, player’s turns do not follow a round robin scheduling algorithm. Each player makes a move whenever they can – even if this means several moves in succession. In the classic game, there are two “play piles” so two players can actually move simultaneously. I’ll give you a brief description of the game; you can read more about it at either Wikipedia or review the rules according to Hoyle.

In the game of Speed, each player is dealt 20 cards from a standard deck of cards. From those 20, they take 5 as “hand” and leave 15 face down as a replenish pile. Two cards are placed face up between the players, we’ll call these the “play piles” and there are 2 piles of 5 cards, placed face down. These face down cards are turned up on the play piles if neither player has a move, allowing the game to proceed. Players play a card by taking a card from their hand which is one higher or one lower in rank than the card on the top of the play pile. As the player uses the cards in their hand, they replace them from the replenish pile. Play ends when one player runs out of cards or when neither player can move and there are no remaining cards to put on the play pile.

My implementation of this game consists of 4 threads. The main thread that controls everything, 2 player threads and a referee thread that shuffles, deals, and keeps the game play fair. The referee thread also watches the game to see if both players get stuck, if so it turns over 2 new cards on the play piles. It also notes the end of the game and declares a winner. There are 2 player threads which act as the players, each operating on a hand, which is held privately, and the 2 play piles, which are common to both players and the referee.

I created a few classes – the “card”, a “stack”, and a “deck”. A “stack” is a collection of “cards” which is accessible as a FIFO or randomly. The “deck” is a special case of the stack class which is initialized to 52 elements which represent the standard deck of playing cards. Using these classes it was pretty easy to define and program the game.

The state machine for the player is fairly simple and is shown in figure #1. It is quite a simple state machine – each player starts out in the state “wait for start”. The referee will signal the start of the game by setting a flag which both players watch. Each player transitions from “wait for start” to “play”. Both players will look at their cards and the play stacks and determine if they have a move. If they are unable to find a move, they transition to a state of “stuck”, where they remain until a new card is placed on one of the play piles. This can be done by another player or if all players are stuck, the referee.

Figure #1 - Player State Machine

Figure #1 - Player State Machine

You’ll find the code that implements this in listing #1. As with most state machines that get implemented in software, it was not explicitly written as a state machine – it just ended up that way. Doing a formal implementation of a state machine using switch statements or a C++ template is a good way to avoid bugs and enhance maintainability – especially when they get big. This was written more for simplicity and clarity.

Listing #1 - Player Function

Listing #1 - Player Function

As usual, my implementation had bugs. Most of the bugs I was able to shake out the normal way, running the debugger and inspecting the code and the variables I set the game to run in a loop – re-shuffling and re-dealing and replaying over and over again. To give myself confidence that everything was programmed correctly, I ran 1,000 games and logged how often player #1 won, how often player #2 won, and how often there was a tie. If the system favored one player over another, then there probably was a bug. I also wrung out a few bugs by checking the deck at the end of each game. By virtually moving the cards from one stack to another under the control of 3 concurrent processes it was possible to occasionally lose a card or duplicate a card. By checking the consistency of the deck at the end of each game I was able to spot and fix a couple of race conditions.

But one bug was fairly tough, and similar to my customer’s bug. In running a loop of 1,000 games, once in a while, the program would pick up some extra cards, and once and only once it dumped core. I was unable to reproduce this fault. Attaching a debugger to the resulting core file, I saw something very interesting – there were 5 threads running. Huh!?!? What I saw was the main thread, the referee thread, a player #1, a player #2, and a second player #2. Now 3 players were not invited to the game. Looking at the code, it was not clear to me how two player #2 threads could possibly be running at the same time. The core dump was the result of a bad pointer access. Inspection of the core files led me to believe that some memory corruption was going on.

So let’s go on a bug hunt. For those of you whose debugging skills are truly Jedi-level, you’ll only need the player thread above, listing #2 – the main function of the program, and listing #3 – the referee thread. That’s all the code which is involved in this bug. Recall the symptoms: a core dump (not reproducible) where there are 3 player threads, not 2 as expected – more frequently, gaining extra cards in the deck. Initially, I thought that the debugger was in error when it displayed 5 active threads on the core file. Sometimes memory corruption can mess up the state of a program so badly that the debugger can’t make sense of what’s going on. Looking at the code I did not think it was possible for 3 player threads to exist at the same time. The main thread starts the referee thread, and then the two player threads. Then it simply does a pthread join, and waits for all 3 threads to terminate before repeating the loop. The player threads don’t start doing anything until the referee thread goes through the shuffle and deal functions.

Listing #2 - main function

Listing #2 - main function

Listing #3 - referee funciton

Listing #3 - referee funciton

Now, the good old fashioned symbolic debugger is probably not going to be much help here. The game fails about 1 in 10,000 runs. I can set a breakpoint in the code where I check the deck and find extra cards have mysteriously appeared. But at that point all the player threads are gone, and no amount of interrogation of the code was helping. It was amazingly frustrating. I assumed that it was a race condition. All code which references the shared play piles use a mutex to ensure that only one player or the referee is accessing the stack at any one point in time. And I tried to make the operations on these stacks properly atomic. My initial stab at trying to fix this was to lock larger and larger sections of the player code to see if the problem would go away. No luck there. I had the player threads almost fully locked, and the program for all practical purposes fully serial. The problem seemed to be less frequent but did not go away. Putting all the threads on one core, using affinity, I was able to make the problem go away (or had become so unlikely that I did not see it any more). This reinforced my belief that I had a race condition.

This is where I got the idea of trying to debug this using the same approach that I would if I were debugging the same problem in Verilog. It’s pretty simple, I would start by tracing the state variables which drove each of the state machines in question.

Now, C++ does not have a “trace” command, like my favorite logic simulator does. (Hey, maybe we should add that – note to self: check with the new boss on that one.) But I can add an LTTng user trace point to each location in the code where the variable is assigned. Fortunately, the code is pretty clean, and the state variables are an enumerated type (yeah, the influence of doing years of HDL coding) with a small scope. If you are a software person, your state variables state machines may not be as cleanly defined, but in 10 or 15 minutes you should be able to find all the state transition points in the code and annotate them.

Listing #4 - LTTng User Trace annotated player function

Listing #4 - LTTng User Trace annotated player function

Listing #4 shows the instrumented state machine. The trace_mark calls are the LTTng user space tracing calls. These will put a record into the LTTng database when that line of code is executed. The code is admittedly a bit messy and it needs to be recompiled, but it does allow us to trace the progress of the state machine as it executes. There is some work going on the gcc community to add tracepoints, these could be used in a similar fashion – but would eliminate the need to instrument the code and recompile, which will be nice. Pedro Alves was showing this to me a couple weeks ago. I’m looking forward to being able to use it.

Figure #2 - Player and Referee Event on a Timeline

Figure #2 - Player and Referee Events on a Timeline

Figure #3 - One Game, Up Close

Figure #3 - One Game, Up Close

Once this newly instrumented version of the program is run, we can look at the time line of the events. Figure #2 shows the events displayed on a time line in the System Analyzer. These events can be sorted, selected, and searched. But looking at it from a high level can sometimes allow you to see patterns and anomalies, which can reveal where bugs are hiding. Each game is a cluster of events for each player and the referee, and the gaps between the games are where the playing card deck data structures are checked for data integrity. At the end of each game the referee thread checks to see that all the cards are present and accounted for, and we didn’t accidently duplicate any. Figure #3 shows a close up view of one game. Each vertical line is where a “trace_mark” call was run over. If you hover the cursor over an event, a tool tip will pop up and show its contents.

Here’s the first clue. Figure #4 shows a closer in view of several games. What is notable here is that the player 2 events are completely absent from the center game. Another glaringly obvious feature is that there is a state transition for player 2 completely outside of the cluster of events which constitutes a game.

Figure #4 - Events That Don't Fit the Pattern

Figure #4 - Events That Don't Fit the Pattern

Zooming in on the events in the game following the one with a missing player 2 (figure #5) we can see that the density of events for player 2 is higher than for player 1. Based on this I’m concluding that there are in fact two player 2 threads at work at the same time – looks like the debugger was right, after all. I will also conclude that it is the player 2 from the prior game which is joining the final game uninvited – as he apparently missed his own game.

Figure #5 - The Bad Game, Up Close

Figure #5 - The Bad Game, Up Close

Why is a player able to skip one game and then join another? Looking closely at the events on the timeline, I can work out what’s going on. Let’s start by looking at the game prior to the one in figure #5, and the sequence of events. First, player 1 goes into the initial state of “wait_for_start”. A short time later, the referee, signals the start of the game. The state variable of player 2 has not been altered, its value is still “game_over” from the prior game. It should kick over the “wait_for_start” just after the thread starts, but, of course, we cannot predict when the OS will launch the thread. Linux is obviously off doing something more important and not starting my thread as I expected. But with the start of the game flagged by the referee thread, player 1 begins to play its cards. A bit of an unfair advantage, but that’s life. Also unexpected, but I guess possible, player 1 finishes playing all of its cards. It then transitions its state to “game_over”. The referee notices that both players are in a state of “game_over” so it dutifully checks the deck and exits the game.

Here’s where the really bad luck happens – the referee and player 1 threads have exited, but apparently, the player 2 thread did not yet get started. When I call pthread_join to wait for the player 2 thread to exit (and foolishly neglect to check the error code) pthread_join returns, and I can only surmise that it returns with the error code ESRCH – no such thread could be found – because it hasn’t started yet. But this does not terminate the thread. The player 2 thread goes happily along and soon starts. We can see that while the referee is off checking the deck, player 2 comes to life. It checks to see if the referee has started the game, it has. The referee has set the “start_game” bit to false, so the thread waits. A new referee, and a new player 1 and player 2 threads are all created. Thus the mystery of the 3 players is solved. The extra cards occur as a result of player 2 taking his stack of cards from one game to another.

There are several failure points here. First, I did not check the return status on pthread_join call – this would have alerted me to a problem. But root cause of the problem would have been hard to find without a timeline or log view of events. Probably more germane, is that the starting state for the player state machine was the last state from the prior game. My state machine was missing a state. Also, while the referee thread prevented players from starting too early, it did not check that all players had reached a state of being ready to play. Fixing these issues, the program runs flawlessly.

The key point here (besides *always* check your system call return statuses) is how the cause of the problem is exposed and almost obvious when looking at a timeline of the events. Contrast that with how you might try to debug this type of problem with a traditional debugger. Of course, this is a really simple state machine – 5 states in all (once I got the bugs worked out). I’m betting your designs are probably a bit more complicated.  This approach might help you with debugging where you have multiple interacting state machines.

, ,

14 January, 2011

I never cared for really long blog entries, so I’m splitting this up into a couple of posts. If you missed my last post you’ll want to go back and read it – or you’ll probably be hopelessly lost as we dive into my multi-core Sudoku solving dilemma.

As you’ll recall, last time I took my clever Sudoku solver and ran it on an 8 core PPC design, and found I got a 3 X speed up. When I looked into why, it turned out that the process which was creating the worker threads was being preempted. This limited the number of active worker threads to between 3 and 5, instead of the expected 25.

I solved the problem by adding a pthread barrier to the created threads, this had the effect of getting all 25 threads launched before diving into the solving phase of the program. And it worked, instead of only a few active threads, I have 25. They are all runnable. And I have lots of cores – 8 to be exact. But things ran more slowly. Using the System Analyzer I could easily see why:
figure_21
Figure #1

All of the threads are created and quickly reach a state of blocking on I/O – waiting for the pthread barrier to be released by Linux. But if you look at the cpu utilization (at the top of the graph) you’ll see that we have lots of idle CPUs (recall grey is idle and blue is running a user thread). Each thread goes through each open square in the puzzle and determines if its digit goes in that square. Then it returns a list of solved squares to the main process and exits. The main process collates all the results and then launches a new set of threads if the puzzle is not solved. For each group of 25 threads Linux appears to be clustering those threads on only 2 or 3 cores. I assume that Linux does this presuming that the threads will be communicating with each other, and that communication across different cores is expensive and therefore is clumping them together on just a few cores. This is just a guess – I have some friends who are experts in this and I’m going to consult them to try to get to the bottom of this.

In the meantime, I’m still trying to get to that elusive multi-core performance and have my Sudoku puzzles solved faster. So I decide to try something – processor affinity. As I create each thread, I can force the thread to run on a particular core. This is done using the function call sched_setaffinity(). As a quick start to this I decide to force the threads in a round robin fashion to the cores. The thread solving the 1st digit (‘A’) is forced to core 0, the thread solving for ‘B’ to core 1, and so on. The 9th thread goes on core 0, and the 10th on core 1. I take the bottom 3 bits of the thread number and set the processor affinity of that thread to that core. Here’s how I coded it:
listing_21
Listing #1

Note: this only works if “CORES” is an even power of 2.

When tracing it and viewing it with the System Analyzer, here’s what I see:
figure_22
Figure #2

Much, much better. You’ll notice that we get all 8 cores running full steam at the start of a round, but as they run out of work to do, the cores start to idle. This happens about halfway through the round.

What’s happening is that the time to solve each digit differs based on the configuration of the puzzle. Some digits get solved quickly, and some take more time. Inevitably, one of the cores will have a couple of digits which take a long time to solve, while other cores will have a couple of digits which can be solved quickly. Which leaves some cores idle at the end of a round This implementation is more than 5X faster than a single core solution, so this is pretty good, but if I can evenly spread the work across the cores better, clearly, I can get closer to 8X.

The next thing I tried was to create a “work queue”. Instead of each core solving a predetermined set of digits, each worker thread will take a digit from a queue. Using affinity I fixed each worker thread to a core. As long as there are unsolved digits in the queue the thread will keep on working. If one core gets several digits which can be solved quickly, it will ask for more work. This approach should effectively “level” the load across the cores. Once the queue is empty, the worker thread returns its solved squares to the main thread and expires.
Now the trace looks like this:
figure_23
Figure #3

This multi-core version of the solver runs over 7 times faster than the original single threaded version. As the work gets completed there are still some idle CPU cycles at the end of a round of solving, I can think of a couple of ways these could be put to good use, but I think I am at a point of diminishing returns – I am running the algorithm as a multi-threaded SMP program in about 13.8% of the time that it took to run the single threaded version. The theoretical upper limit is 12.5% (1/8th) – so if I were to squeeze out every last CPU cycle, I only have about 1% of the original run time left to optimize. At this point, it’s not worth the effort (and my boss wants me to get back onto “real” work :-) ). You might notice that I am launching a new set of threads on each “round” of solving – it was easy to code that way. I was originally concerned that the overhead of launching and tearing down so many threads would impact performance – some urban legend I remember from my youth. It turns out that the time taken to launch the threads is less than 0.1% of the run time of the program. Again, this is so small that there is no point in convoluting the code to optimize this out.

I ended up running 7.2 times faster on 8 cores – not a perfect utilization of the additional cores, but pretty good. Without the visibility of seeing how the threads were being handled I don’t think I could have done it. In the end, the code changes were really pretty minimal, and mostly related to thread launching and the management of a work queue. The main solving algorithm was completely untouched. Doing a diff between the original solver and the final implementation shows 123 lines added and 17 lines changed and 8 deleted. Just for the record, that is out of 1557 total lines of code.

There is one thing left to investigate. All 8 worker threads are making lots of read references to the puzzle array, which is in a common memory area shared by all 8 threads. Only the main thread updates this area once all the worker threads have completed, so there are no race conditions. I can’t help but wonder if the cache coherency logic in the processor is getting thrashed by all these threads accessing a common area. My thought is to make a local copy of the puzzle for each thread and have it reference only the copy as it goes about finding the solution. It will be interesting to see if that has any impact on performance. Given the small amount of performance left to reach the theoretical limit, I’m guessing I’ll find the shared caches working pretty efficiently in this case.

, ,

10 January, 2011

Yeah, it’s been a while since my last blog post. Of course, it’s all about the software, but in blogging it’s all about content – and keeping it fresh. So, here I am, with some fresh content.

I took a new job here at Mentor and now I’m working in our Embedded Software Division. Despite being employed by a hardware tools company I think that this software stuff is going to be around for a while. My new job has given me access to a couple of cool things. One is a brand spankin’ new Freescale P4080DS development board. This is a development board which has a P4080 processor running at 1.2 GHz with 4 gigabytes of RAM. The P4080 has 8 PowerPC e500 cores. The other thing I have access to, and also brand spankin’ new, is a system analysis tool – which reveals just gobs of details about what your program and system are up to.

Of course, my most deeply devoted readers will recall that I made a presentation last year at the Multi-Core Expo in San Jose. There I talked about several programs which I tried to run on a hypothetical multi-core system (actually, I ran a simulation of a design with 8 ARM cores) and described the power savings that could be achieved by running on smaller, simpler cores but more of them. When most of those programs were run on a multi-core system they were not able to achieve anywhere near a linear scaling with the addition of more cores. “That’s to be expected” – the experts say. But these were “embarrassingly parallel” applications, and being a reasonably smart guy I expected to be able to, well, to do better than I did.

So with this new 8 core chip and the visibility promised by the System Analyzer at my disposal, I thought I would take another try at getting these programs to scale well on a multi-core system. So I dusted off my multi-threaded Sudoku solver. Hoping it had not suffered too much code rot, I ported it to the P4080DS and a fairly fresh version of embedded SMP Linux (2.6.30, to be specific).

Sudoku, for those of you who have lived in a cave for the past decade or so, is a number puzzle similar to a magic square. I’m not going to go into details on the puzzle or strategies for solving them, there are lots of websites that do a better job of that than I ever could – such as this one and this one. – or just check the puzzle section of your local paper. Sudoku has some very interesting computational properties (well, if you’re a real nerd) at first glance it appears that it should be mostly parallelizable, but there are portions of any solver implementation which have to be serial.

If you want to solve a Sudoku puzzle in parallel, one way to do it is to make “n” copies of the puzzle – where “n” is the number of digits in the puzzle. Then get “n” of your closest friends together (friends who put up with you nerdiness). Give each friend a copy of the puzzle you want to solve. Have your first friend solve for as many squares as they can for the first digit – but ignore the rest of the digits in the puzzle. Have your second friend solve for the second digit, but, again, ignore everything else. Repeat until you have used up all your friends. When your friends have solved as many squares as possible, have them hand you back their copy of the puzzle. You collate their answers onto one puzzle – and again make “n” copies of that and pass them out again – with more spaces filled in, your friends will be able to make further progress the puzzle. Repeat this process until you find the complete solution. Of course, the bottleneck in this process will be the collation of the puzzles, copying them and passing them out again. But you probably will solve the puzzle much faster this way than doing it yourself.

I basically implemented the same thing – but using pthreads instead of friends. When I ran it on a beefy quad core Pentium, I saw about a 30% improvement in throughput. But I was expecting at least a 50% improvement – since each of the pthreads is completely independent and I had a lot more threads than cores, which in theory should be able to take advantage of all that raw compute power. Except for the collation and copy step, this should be an embarrassingly parallel application.

Next I tried this on the P4080 – the octo-core beast. Here, the multi-threaded version of the program ran in about 1/3rd the time as the single threaded version. Much better – but why not 1/8th? Or at least 1/7th? Using traditional *nux utilities – time, top, gprof, etc. show me how long the program takes to run, and they also show that I am not completely utilizing all the cores. But “why” is it not performing as I expected is inscrutable.

So it tried out my new system analyzer on it. One source of data for the System Analyzer is LTTng, or the Linux Trace Tool – next generation. This shows what the Linux system is doing – processes, threads, cores, interrupts, and such. The System Analyzer puts this data on a timeline and allows easy viewing and computation on that data. See Figure #1. In this diagram, the top 8 lines represent the 8 cores in the P4080. Blue is busy running user code, grey is idle. Below these lines, are the threads (in this case, actually, processes) – here green is running, red is waiting on I/O events, and yellow is runnable, but waiting for a core to become free.
figure11
Figure #1

So I looked at the “resource view” of what was going on on the P4080. One thing that was clear was that there were lots of idle cores. As you might guess from the story so far, there were 3 cores busy and 5 cores just sitting there idling. Idling, I tell you, while there was work to be done. Interestingly, it was not the same 3 cores that were idle – in fact, the threads were pretty evenly distributed across all 8 cores over the duration of the program.

The program I’m describing here is chewing on a super monster Sudoku puzzle, that is not your ordinary 9×9 Sudoku, or 16×16 monster Sudoku, but a 25×25 super monster Sudoku (for those playing along at home the actual super monster Sudoku is posted at the end of this blog entry). There are 25 digits in the puzzle, and I am creating one thread which solves the puzzle for each digit. So I expected to see 25 threads all get created and be runnable – while the Linux SMP operating system would dutifully and efficiently put those runnable threads on all the cores. That didn’t happen. In fact, at any one point in time – viewing this on my handy timeline feature of my System Analyzer – I could plainly see that there were no more than 5 threads at any one time. Often there were only one or two threads active. WT_?!? (Aagh, censored by my corporate overlords again!)
figure21
Figure #2

Next I looked at the gantt chart of the process which was launching the threads (bright green bar in figure #2), paired along with the state of that thread (top Sudoku_0 process in figure #2) and I had that slap-the-hand-on-the-forehead moment. It is amazing how obvious a problem is once you “get it” – and how confounding it can be before that. Here’s the code that launches the threads:
listing11

See the problem? I didn’t. Take a close look at figure #2 – you might have the “aha” moment. Note that the main thread – running the snippet of code above – is the top “./sudoku_0” trace, and the created threads are the “./sudoku_0” traces below.

The program creates a thread on each iteration of the (above) loop. Notice that create_thread() calls pthread_create() and returns quickly – as per the gantt chart – and at the same time you see a new thread created. Then the thread goes into a state of “cpu_wait” (yellow) and remains in the function create_thread. Just after returning from pthread_create() the main thread is suspended. What’s happening is that the thread being created will be placed by the operating system on the same core where the main thread is running – preempting it. The process creating the threads gets preempted. No more threads get created until this process gets swapped back in. For some reason, which is baffling to me, it does not get swapped in until some or all of the worker threads are done with their work. But while the main thread is suspended, there are idle CPUs. I probably need to start to understand the mysteries of the scheduler – but I don’t want to know that stuff and I don’t think I should have to get involved in that to write an efficient multi-threaded program. To me “SMP” means never having to think about your scheduler.

I did come up with a solution which I thought would allow me to remain blissfully ignorant of the scheduler. The main thread simply needs to run to the end of the thread creation loop before any of the created threads run. So we need to either start the threads in a suspended state, or have them block on something just after being created. Alternatively, I think, if the priority of the process creating the threads is higher than the priority of the created threads that they would all be created before the main process was preempted. After some quick googling I decide on the pthread_barrier approach – a simple way to ensure that a set of threads all reach the same point in the code before any proceed. The listing is below, rather clever if I don’t say so myself.
listing21

And at the start of each worker thread, I added a pthread_barrier_wait() call:
listing31
Listing #3

This minor change will force all the worker threads to suspend at the point of the pthread_barrier_wait() call until all the worker threads reach that point. This will ensure that all the threads are created by the main thread before running.

When I run this I find that program doesn’t run any faster, in fact, it runs slower – considerably slower. About 50% slower. Why?

More in the next post…

Sudoku Puzzle (from http://www.colinj.co.uk – where you can print this or play online):

figure31

, ,

19 October, 2009

Constrained random seems all the rage recently.    Last week I was visiting one of our field offices.  I was discussing with some of the applications engineers about how to use a processor to drive stimulus into a design.  One chimed in and said that this was all very interesting, but it could not be used to drive constrained random stimulus.

<Cue sound of a record scratching>

What??

“You can’t have a processor generate random traffic like you can with SystemVerilog,” he asserted (no pun intended).

The thing that got me is that this about the 20th person who works for Mentor to tell me this.  So, this week’s Mythbusting blog entry is devoted to generating random stimulus from a processor.

Now let me start by saying that it is my belief that when hunting bugs you are far better off with a rifle than a shotgun.  Testing (or verification) of computer languages is a pretty well defined science – these things should not be left to chance.  The folks who work for me know that they will be chastised severely if they go about testing with a “spray and pray” approach.  So, I never thought much of the constrained random approach to HDL verification.

However, after reading a paper by Richard McGee, Paul Furlong, and Fabian Delguste called “VHDL to SystemVerilog: Constrained Random Verification of a USB 2.0 Host Controller Sub-System.” I started to understand the approach better and will grudgingly admit that there may be some merit to this approach.  And, I can see the appeal from a developer’s point of view. The paper was presented at the Synopsys User’s Group meeting in Europe in 2006 and is available here.  While I admit there is some merit here, I do think that systematic approaches to the problem are superior and worth a look.  Information about Mentor’s offering in this area can be found here.

Now, on to the programming side of things. Contrary to widely held belief, the standard C run time library includes the function rand().  So, any C program that you want to run on a processor in simulation should have access to the rand() function – an adequate random number generator.  I say “should” because embedded compiler vendors do remove functions from the standard C library when they retarget the compiler for embedded purposes.  Oftentimes functions like printf() and fopen() are dropped to conserve space – since they are not likely to be used.  I know that rand() is present in the C runtime library for ARM’s RealView development suite.

If your C runtime library does not contain rand(), here is a simple one you can use:

int rand(void)
{
static long a = 0x1234;  // seed value
a = (a * 32719 + 3) %32749;
return a;
}

This is an implementation of “Gerhard’s generator” and I stole it from this website. It seems to do a good job of efficiently generating the lower 16 bits of pseudo-random numbers.

The C run time library function rand() returns 32 random bits.  To get a random number between 0 and N, you need to a bit of math.  The simplest way I know of to do this is to use the modulus operator or p = rand()%N – this sets p to a random number between 0 and N-1.  So if you want number from 1 to 10, you could use the expression 1+rand()%10.  You can use the following macro to make the code look a bit cleaner

#define RAND(x) (rand()%(x))

Now “RAND(10)” placed anywhere in your code will get you a random number between 0 and 9.

Of course you will need your simulations to be repeatable – this can be done with a random number seed.  C provides a function srand(), which takes a 32 bit value as an argument.  By calling this at the start of a run of a program, you can get exactly the same sequence of random numbers every time you run the program.

Now let’s get a little fancier.  Say you want a particular distribution of numbers.  For example you want 1/3rd probability for a value of 46, a 25% probability of 128, and for the remaining numbers, normal distribution from 50 to 100, with a standard deviation of 25.  Here is a function that would deliver this:

func()
{
if (0==RAND(3)) return 46;
if (0==RAND(4)) return 128;
i = 0;
n = rand() & 0xFFFF;
while (n<lookup_table[i++]);
return i+46;
}

where lookup_table is initialized with data for an appropriate cumulative distribution, which can be generated pretty easily in Excel.  If you run this for a while, it produces this distribution:

Distribution of N

You can even generate Poisson, binomial, uniform, or geometric distributions.  With a little programming you can even create something as obscure as Fischer’s non-central hypergeometric distribution, (I knew that advanced probability and statistics class would pay off in the real world.)  Wikipedia has a good discussion of probability distributions and their applications.  If you have forgotten everything from MATH-251 in college, this website has a good refresher tutorial…Worth brushing up on if you’re going to be doing a lot of constrained random programming.

OK, so now let’s apply this to the classic constrained random example.  We’ll make a routine to create some random Ethernet packets.  Here’s the SystemVerilog example (taken from Wikipedia here)

class eth_frame;
    rand bit [47:0] dest;
    rand bit [47:0] src;
    rand bit [15:0] type;
    rand byte       payload[];
    bit [31:0]      fcs;
    rand bit [31:0] fcs_corrupt;
    constraint basic {
        payload.size inside {[46:1500]};
    }
    constraint good_fr {
        fcs_corrupt == 0;
    }
endclass

So in C we would define the structure as follows:

struct {
    unsigned char  dest[6];
    unsigned char  src[6];
    unsigned short type;
    unsigned char  payload[1500];
    unsigned long  fcs;
} eth_frame;

Then we can define a random function to populate the structure:

random_eth_packet(struct eth_frame *ep)
{
    for(i=0; i<6; i++) {
        ep.dest[i] = RAND(256);
        ep.src[i] = RAND(256);
    }
    ep.type = eth_size();
    for (i=0; i<ep.type; i++) {
        ep.payload[i] = RAND(256);
    }
    ep.fcs = cyc_check(ep);
}
eth_size()
{
    if (0==RAND(3)) return 46;
    if (0==RAND(4)) return 1500;
    i = 0;
    n = rand() &
    while (n<lookup_table[i++]);
    return i++
}

This is equivalent to the random method on the SystemVerilog class.

Now I will admit that the application of constraints is far more elegant in SystemVerilog than it can be in C.  Using C++ one can probably construct something which approaches what SystemVerilog has implemented.  But C++ can be a high overhead language; its use on a simulated processor should be limited to Jedi-level programmers who really understand the performance implications of any particular construct…or those who have some type of simulated processor acceleration. (There’s an app for that, which I have, but that’s another post.)

For C one would need to use either #defines to apply constraints, or plumb in the logic to a particular random function.  As an example, if we wanted to have the random_eth_packet generate a bad checksum for 1% of the packets, we could add a parameter “bad_packet” to the function signature, and then change the assignment to ep.fcs as:

if (bad_packet) {
    ep.fcs = RAND(0x10000);
} else {
    ep.fcs = cyc_check(ep);
}

We would then call the function as:

random_eth_packet(ep, 0==RAND(100));

To change the percentage of bad packets from the compilation command line, we can substitute the “0==” with a macro definition, say “PCT_BAD_CRC” being less than RAND(100). You can set this at compilation time with a -dPCT_BAD_CRC=20. With PCT_BAD_CRC set to 0, we will get no bad crcs; with 100, we will get all bad CRCs (except in the very rare occurrence where the rand(0×1000) actually equals the CRC).

Now as the constraints become more sophisticated, C runs out of gas when compared to SystemVerilog.  You can apply a Boolean expression to set of random values (as below)  - but there is nothing like the real constraint solver that you get with SystemVerilog, and not enough CPU power to implement one.

do {
    a = RAND(32);
    b = RAND(32);
    c = RAND(32);
} while ((a<b) && (b<c));

A simple do-while loop works, but you should make sure the probability of success is at least 1 in 4 or you could burn a lot of simulation time spinning here.

Of course, random stimulus requires that we set up cover groups and bins to appropriately record the coverage.  With Questa or ModelSim we have a way to export software variables in the code on the processor for inclusion in the UCDB (Unified Coverage DataBase).  Then they can be included in coverage analysis.  I’m not sure how you’d do this with VCS or NCSim – if anyone has some ideas please post a comment.

So now for the sixty-four kilo-buck question, if we can create random Ethernet packets from either a processor or from a SystemVerilog task and an AMBA transactor, why would we pick one over the other?  The overriding criteria on this decision should be where you will get the highest return on your coding efforts.  Let me give you a couple of reasons why you might consider writing this type of stimulus in C running on a processor, and a couple reasons why you might not.

The first has to do with verification IP reuse.  While your job may not be affected by it, verification of a design extends well beyond the RTL coding phase.  Yes, the RTL coding phase is a significant and expensive step, but there may be FPGA prototypes or emulators in your design’s future.  You’ll need to drive stimulus (and check responses) in those environments, too.  Undoubtedly, there is a physical prototype, and at some point someone is going to need to think about verification of manufactured systems (though this often comes under the heading of “test”).  Also, there are diagnostics which will need to run on released systems.  Verification IP written in SystemVerilog doesn’t fit, or doesn’t fit well, in these environments.  Verification IP written to run on the processor can be leveraged across all of these environments.

The second reason you might want to run stimulus from the processor is that it will provide a better grade of, well – for lack of a better term, randomness.  A SystemVerilog task randomly driving stimulus on an AXI transactor has a large state space to run in.  There are a lot of different transactions which can be driven, and if we consider the possible permutations of sequences of 3 or 4 transactions, we are way beyond what can be run in simulation—which means that you’re only going to hit a fraction of the state space.  The actual processor (and corresponding software) cannot and will not drive anywhere near the universe of possible permutations.  In fact, it will tend to repeat the same patterns over and over again – the same patterns that will occur in the real system.

So, do you want your random walk through the stimulus to spend most of its simulation cycles exercising conditions that will exactly match the activity of a specific processor in the final system?  Or, do you want to spend your time simulating the broad spectrum of possible activity that might possibly occur?  Your actual answer will depend on your verification goals. If you are trying to validate that a particular block of IP will work with a specific processor, you should drive the stimulus from that processor.  Conversely, if you are trying to validate that a block of IP will work with any valid bus master, then you will not want to use the processor.

Another reason you might not want to put your stimulus generation on the processor is if you have no debug environment for code which is running on the processor in simulation.  Your productivity of writing and debugging complex code without access to a modern graphical debugger is horrendous.  Would you write SystemVerilog if you had no debug environment for it?  Probably not.  In this case you are better off writing the HDL verification in a language you can be productive in and let someone else in your organization write and debug it again for the downstream environments.

So, you can generate constrained random stimulus while running code on a processor.  In some cases this can be a better way to create stimulus, as it can be reused in downstream applications and provides a more realistic set of stimulus than you’ll get otherwise.  This myth is totally busted; and we didn’t even have to use a crash test dummy.

28 September, 2009

They say one of the first steps to fixing a problem is to acknowledge that you have one. Hi, I’m Russ, and I have a porting problem. If you work on software in the embedded world, you probably have a porting problem, too. Your code may be an embedded application, drivers, or diagnostics, but at some point the code probably started out on a desktop machine of some sort. When you moved it to run on the real hardware, you needed to port it.

I find that most folks underestimate the effort involved in porting code…especially C code. C and C++ are particularly pernicious in this area, because, unlike many other languages, they leave quite a few details up to the compiler writer. This is probably one of the reasons that it is so prevalent today – it is easy to move the compiler from one platform to another. And, this makes folks think that they can move a working C program from one platform to another.

Last week I spent a lot of time helping one of my customers figure out why a piece of code that was running perfectly on his Dell was now failing when running on his ARM platform. The funny thing is that this code was not touching the hardware at all – this was simply some C code that was running in memory. Usually when code breaks as it is moved from a Dell to an ARM, the cause is the boundaries where hardware gets involved. So, this had us stumped.

Ultimately, we tracked it down to this snippet of code:

Listing #1

Listing #1

What we found was that code was incorrectly trying to process the channel connections of channels that had never been used.  When the program was originally written, the array “count” was counting up the number of connections that were made to a channel. Later, they needed to distinguish between a connection count that was 0 and a channel that had never had a connection. Rather than create another variable, the programmer added a constant called “UNUSED” which had the value -1 to denote that a channel had never been connected. He though this was safe, since a count can never take on a negative value. “UNUSED” was assigned to the “count” array on initialization and then compared later.

Just a quick aside on programming style – you should NEVER overload the meaning of a variable like this. It is an invitation for bugs to crawl into your code. However, I see this done all over the place. Consider the Linux function fgetc– it returns data, unless it returns -1 (note the similarity here), which is a status denoting an end of file condition. Good coding practice would dictate that the status of the function should always be returned through one path and the data always returned through another path.

I have had developers defend this type of variable overloading, citing Unix or Linux as “really good code” that uses this technique. Yes, Linux does it. But that does not make it a good example. Don’t even get me started about their use of a global variable called “errno”. My freshman computer science professor would have thrown me out of the degree program, with prejudice, had I ever proposed such a monstrosity.

Anyway, back to the debugging. The code in Listing #1, when running on the ARM, took the else path. But when running on the Dell, it took the if clause. We double-checked this, and it was indeed happening. Next we checked to see if “count[chan]” had changed, or if “chan” had changed between the assignment and the comparison – neither had changed. For folks who like puzzles and consider themselves competent in C coding, think about how this could be the case before reading on.

This just looked broken to me. I wrote a small snippet of code that replicated the problem:

Listing #2

Listing #2

On the Dell machine, as you would expect, we see that the world is sane. However, when we are on the ARM system, the code takes the else clause. The world has gone insane! How is it possible for this simple code to run differently between the ARM and the Pentium?

Next I looked at the assembly code that was produced by the ARM compiler:

Listing #3

Listing #3

I looked at the constant being loaded into R0, and it was in fact a -1. The CMN instruction is a compare negative. It will negate the second argument and then compare it with the first. By my math, a -1 and a -1 (or a 1 negated) are equal. I started to suspect a compiler bug.

Now, let me say that I have never in my life found a compiler bug… despite seeing a lot of unbelievable things happen in code – and often wanting to blame the compiler for not implementing “what I meant” (rather than “what I coded”). Compiler bugs are really quite rare – and the solution is often tied back to an incorrect understanding of the language or a simple programmer error. But, in this case, it really looked to me like the load instruction was incorrect. It was doing an unsigned load (LDRB) of the -1 (or 0xFF) in memory, which was not sign extending it as it was placed in R0. So the CMN instruction, which did perform a sign extension, would result in a comparison of 0xFF with 0xFFFFFFFF – which are not equal. Doing an unsigned load here is incorrect; as every high school student knows, the default type for integer variables in C is “signed.”

To test this theory, I punched the LDRSB (load register signed byte) opcode in place of the LDRB and re-ran the program. The code executed as I expected. The world was once again sane.

So why was the ARM compiler broken in this case? Had I found the first compiler bug in my career?

Some careful reading of the K and R C book was in order. Before I claimed a compiler bug, I wanted to review the rules for type promotion and sign extension.

It turns out that both the ARM compiler and the GCC we used on the Dell machine were compiling the code correctly. Differently, yes, but both are correct as per the C language definition.

If the variable is defined as a “signed char” the code will run the same on both compilers. Likewise, if the variable is declared as “unsigned char” the code will run the same. And, it is true that the default type for integers in C is “signed.” However, for “chars” and “bit fields,” it is left to the compiler writer to decide if sign extension will take place when the variable is promoted for assignment or comparison when the signedness (yeah, I know that’s not a word) is not declared.

So GCC treats this as signed, and the ARM compiler treats it as unsigned. Because of this, you will get different results. This seems broken to me. But, greater minds than mine have pondered this question. It turns out that Dennis Ritchie himself has weighed in on this one, and he feels that the language specification is correct. You can read his justification here (also, here is the GCC view on this problem). The solution, Dennis claims, is if you want portable code you must declare your chars as either signed or unsigned. This ensures that the code will be portable – well, you won’t run into this problem. C has a couple of dozen other gotchas.

The moral of today’s post is that you need to validate your software on the target system where it will be run. No matter how much validation you do on your laptop, you cannot be sure that there are no porting or platform dependencies in your code. The only way to find these is to test on the actual design where the code will finally run. Another example of this happened when I was validating code that was working fine on a desktop machine and on a development card using the actual processor. But when put on the real hardware, it failed miserably. Turned out that there was a region of memory on the design which was only accessible 8 bits at a time (narrow bus, for some hardware centric reason); all the other platforms executed 32 bit accesses just fine to all of memory. But, it would bus fault on the real hardware. This was a really hard problem to find. Yeah, the hardware team did mention this limitation – once on one line on page 386 (of a 487 page technical specification). Somehow we missed it.

Bottom line is – there is no substitute for running on the actual design.

,

16 September, 2009

You know you shouldn’t, but you do. I do. We all do. Though we don’t like to admit it. Yes, I use print statements to debug my code. And I work on debug tools for programmers and hardware engineers. I guess that makes me even guiltier.

Despite having a plethora of debugging tools and scripting languages, I’ll admit it – one of the first things I do to try to get a handle on a bug is to throw in a few printfs. Even though I can use valgrind with the best of ‘em, there is something comforting about seeing exactly where your code is as it makes progress (or doesn’t).

But what about embedded folks? If you’re putting an application together for a desktop machine, you have a file system and terminal windows – you can create log files and printouts galore. In the embedded world you may not have that luxury. You might have a display or a file system, but more likely you don’t. Many of today’s embedded systems do have a serial port where you can connect it to a terminal (usually virtually). One common alternative I’ve seen used – especially with systems based on embedded Linux – is to create a file system in flash memory. This can later be accessed by the debug system.

ARM has put together an approach called “semi-hosting” that they have integrated into their debugger and compiler tools. Semi-hosting allows the embedded developer to put printfs in their embedded code. The linker redirects these calls to the host system (for simulated designs) or to the debugger (for prototype boards) for output. Not only can you use printfs with semi-hosting, but you can call just about any standard I/O function. You can open log files – or even input files. I’m not going to teach you all about ARM’s semi-hosting capability in this post, but you can read about it here.

So ARM has you covered if you are running one of their “programmer’s view” virtual prototypes or one of their development boards, but what about running the processor in a logic simulation? ARM never bothered to add support for the semi-hosting in their logic simulation models. And I don’t think they should – these are silicon sign-off models that need to perform exactly as the real gates will. Tweaking them to support debug features would not be a good idea. But if you’re designing hardware for an ARM processor, at some point you’re going to need to run some code on that simulation model. And – thanks to Murphy’s Law – it’s not going to run the first time. Then you’re going to need to debug it. While my company has a fancy debugger that solves this problem, your first instinct is probably to “throw in a few printfs.”

One common way to support some kind of a print statement is to add a virtual character device to the logic simulation of the design. Basically, this is just an address where you can write a series of characters which, through the magic of HDL, will end up in the simulation transcript. With this you can write a “P” if the test passes or an “F” if the test fails. Of course, you now need to hunt through your simulation transcript to find all your characters.

One clever approach I saw one of my customers using was to grep the log.eis file for writes to that address. In his software he wrote a series of characters to address 0×80000000. This was done with a simple modification to the retarget.c file supplied by ARM. After the program ran in the logic simulation he did a grep of the log.eis file (on newer processors this is now the “tarmac.log” file – I really wish ARM could be consistent with these things).

% grep –i –e “Write Byte.*80000000” log.eis > foo

Then he lined up the characters using cut and a simple filter program.

% cat foo | cut –c 111- | to_string > bar

Now the file “bar” has a nicely formatted set of characters as they were written to address 0×80000000. For reference, here is the listing of the “to_string” program:

% cat to_string.c

#include <stdio.h>

main()

{

int n;

while (EOF != scanf(“%x”, &n))

printf(“%c”, n);

}


grep

Example - extracting printfs from log.eis

One of my engineers found a way to add a virtual terminal to an ARM processor that works with this same approach – and with the semi-hosting, too – to put the characters in a terminal window live during the simulation (rather than post processing, as above). It’s a pretty neat approach, and it lets you keep track of what that pesky code is doing as the simulation proceeds. It works with no changes in the software at all; it just gives the printf a place to go. However, with a slight modification to the code he can eliminate all the printf processing (which is really sloooow) in the logic simulation. If you’re interested, shoot me an e-mail and I can set you up with it.

Printfs and log files are often used as debug tools. They are particularly useful when you don’t want to stare at a debugger while a system is running for long periods of time. Although the computer science purists may denigrate them, they are simple, easy, and they work. With a bit of ingenuity you can use them on a wide variety of targets.

,

28 August, 2009

I spent most of this week working with a customer debugging a bus matrix. Yeah, I’m a manager, but debugging HDL code is a whole lot more fun than finishing up those performance reviews I need to get to. Anyway, the design is a 3 processor design – the first processor comes out of reset and runs for several thousand instructions. Then we see a problem. For no apparent reason the processor branches back to the reset vector. It does this over and over again. The customer and I were trying figure out what was causing this.

A data value overwrites the opcode during a burst access

A data value overwrites the opcode during a burst access

The software running on the processor was fully validated with the old bus interconnect – so we doubted it was a software bug. But the new bus matrix was fully validated as a stand alone block of HDL. I know the guys who did the verification, and they are very thorough. They did run every possible bus cycle from every master port, and exercised every slave port. There are 6 master ports and 12 slave ports on the matrix. It has 2 active channels, with full connectivity – that is any of the 6 masters can reach any of the 12 slaves across either of the two channels. Both channels can transfer data concurrently.

The data from this load instruction clobbers the opcode

The data from this load instruction clobbers the opcode

After a bunch of debugging we found out that an opcode that was being fetched was getting lost between the boot ROM and the processor. A burst transaction on the instruction port of the processor was getting a good opcode in the first word, but the second opcode was bad. The 3rd and 4th words were OK. The second word in the burst transaction was getting the data value from the second channel, in this case a 0 – which on this processor results in a no-op. This no-op was causing the problem we were seeing.

Some work with the author of the bus matrix ultimately exposed the RTL coding problem, and we got things up and running. It turned out that to expose the problem, 4 of the 6 masters on the matrix needed to be idle. One of the remaining 2 masters needed to drive a burst read and the other needed to drive a non-sequential word read cycle. The second master needed to start the transaction at the same time as the address phase of the second beat of the burst started. Also, the master driving the single cycle needed to have a higher priority than the master driving the burst cycle. The priorities of the masters are changed dynamically as the bus matrix processes transactions. One final requirement for the bug to manifest itself is that the two masters must be accessing different slaves and the slaves must have the same response time, i.e. both need to return after the same number of wait states. If the data phases of the bus cycles were not aligned, then the problem did not occur.

This seemed to me a pretty unlikely combination of events. So I did a little math to figure out if we had driven an even distribution of random bus cycles against the 6 master ports on the bus matrix, what would be the likelihood of getting this exact combination that exposed this bug.

I came out with a probability of 1 in 10 to 24th power. Those are about the same odds as winning the lottery – 3 times in a row. But the strange thing is that during one of our debugging session we forced the good opcode into the processor’s instruction bus – to get past this point and see what else would happen with the software. We hit the same problem after another 100 or so instructions – at a different place in the program. Patching that problem, we saw it again after another 200 instructions. How could such an unlikely combination of events happen so frequently in such a short time?

Recall that we need 4 of the 6 masters to be idle for the problem to be exposed. The design is based on Harvard architecture CPUs, so each processor brings an instruction port and a data port to the party. In this design one processor boots up, and then brings up the others later. So there is a long period of time where two master ports are active, but the other 4 are idle. But exposing the bug also required a particular alignment of bus cycles on the 2 active buses. While we may think of the activity on the 2 active bus ports to be random, they are controlled by the same IP block (the processor) and are therefore coordinated.

It turned out that the instruction pattern laid down by the compiler (with no optimizations enabled) for a “for” loop elicited the exact conditions that exposed the problem. But only if the for loop initial instruction was on a burst boundary (at an address divisible by 16). A quick scan through the executable image for the design uncovered literally hundreds of the problematic instruction sequences. Despite appearing impossibly unlikely, this problem we uncovered would be very common when executing code that was compiled using our compiler. Interestingly (or perhaps most scary) is that this combination of instructions was never seen in the hand written assembly code used in most of the tests.

Now, as the developer of the bus matrix, how could you ever be expected to know what combination of bus cycles would be driven into your design? I’m not sure. It does help to have the larger view of what’s going on in the design. In this example, if the bus matrix verification team knew that only two buses would be active during part of system execution, it might be practical to exhaustively test the activity of only 2 active masters. But the coordination of the bus cycles from the processor across the instruction and data buses is only known to the folks who designed the processor – and they’re not telling. So either learn to be psychic, or drive an actual processor as part of your verification.

I have an on going debate with a colleague about verification of an HDL module. If the HDL module will interface (either directly or indirectly) to a processor, I argue that you should run some code on the processor that accesses that module. This will expose combinations of stimulus that you won’t find using any other method. My colleague argues that it’s too much trouble and won’t expose any problems that couldn’t be found using traditional verification methods. But if the state space you’re searching is large, and processor activity may not fill that state space. Processors can take a very narrow and well worn path. On the other hand they may wander a path broadly though that space. Shouldn’t you test that path – at least once – before committing your HDL?

I’m not advocating throwing away all your verification tools and only driving activity from a processor. But I am saying you might want to include a processor running a program as part of your verification strategy for those parts of the design that interact with the processor.

One final thought, every embedded system project I have ever been involved with has had hardware problems exposed when software is first run. Perhaps your experience is different, but I think it is pretty rare for a system to come together and not have software expose some kind of problem in the hardware.

There are a bunch of products and languages on the market to help hardware designers find problems in the HDL that they are writing. And some of these languages and products have been very successful. It seems odd to me that with a source of design stimulus that will almost always expose real bugs (not unusual corner cases) that only a small percentage of designers take advantage of software as a part of their verification.

, ,

11 August, 2009

The inspiration for this week’s blog entry comes from a comment left by Jason Andrews from Cadence. Jason is an all around good guy who has been working this area of hardware and software for quite a few years, too. Here’s a link to his blog over in Cadence-land – Jason Andrew’s Blog.

Anyway, Jason pointed out that the EDA companies need to get with the program and realize that software is a part of the product that their customers deliver. And he’s right. (Although, if you listen to what Aart, Wally, and the executives from Cadence are saying, there is some recognition of a problem here.) But, it’s not just the EDA companies that need to make this connection. There are folks in the design community who need to wake up to this reality, too.

Since I work with both hardware and software folks, I keep up with what’s going on in both worlds. I try to go to the big conferences for both hardware and software developers. On the hardware side this is the Design Automation Conference (DAC) and DVCon. On the software it is the Embedded Systems Conferences (ESC).

Recently at DAC, I was talking with a hardware developer. I was explaining to him the benefits of Seamless, one of the products that I work on. One of the key value points for Seamless is that it ensures that the hardware/software integration time will be minimized. Essentially, it allows the software developer to exercise their code against the hardware before it is built. This averts lots of nasty surprises when the software guys get their hands on a prototype.

The hardware developer, we’ll call him Mr. H, was unimpressed. He kinda shrugged. I was curious about this reaction, so I asked him what he thought. He said that it wasn’t really his job to make sure that software could run on the hardware. If the software guys can’t get things to run, that’s really their issue – not his. But he thought that the software team should really see a demo of Seamless, since they really do have a lot of trouble getting things running once they get their hands on a prototype board. He thought something like Seamless would really help them out.

This reminded me of a conversation I had at ESC last Spring. I was explaining how Seamless works to a software developer; we’ll call him Mr. S. Mr. S nodded a lot and thought it was a pretty good tool. But, he didn’t think he could use it. He never gets access to the hardware design early on. Even if he did, he wouldn’t know what to do with it.

“You should really show this to our hardware team,” he told me enthusiastically. “We take a long time to get hardware and software integrated, usually because of problems in the hardware. If they could use something like Seamless, I think it would help them a lot.” He explained to me that it wasn’t his job to make sure that the hardware was built right. “If they screw up the hardware design, it makes the project late – but there’s not much I can do about that.”

Both Mr. S and Mr. H are right, of course. They have been hired to perform a specific task in the development of an embedded system. It’s enough for most of us to accomplish the tasks we’ve been assigned by our employer – let alone take on other person’s work, even if that work has not been assigned to anyone. While it’s nice (for his employer) that Mr. H worries about the software guys, it’s also nice that Mr. S worries about the hardware development. However, until management makes it a requirement – part of their jobs – integrating hardware and software will be an afterthought in the design process.

The real problem is that if neither Mr. S nor Mr. H take on the task of getting hardware and software to work together, it doesn’t get done…and then it gets left to the end of the project. We’ve all seen the studies that show how the costs to fix problems increase exponentially as they are found later in the design cycle.

There are a lot of folks in the industry that have seen a problem here. But, 75 to 80 percent of all embedded systems are delivered late; and almost half do not meet their performance goals (according to a 2009 survey done by Embedded Market Forecasters). These numbers have not budged in the last decade that I’ve been reading this survey. If you want to start improving the outcome of these projects, one of the areas that you’ll need to look at is getting hardware and software integrated earlier.