Hunting Multi-core Bugs
As Yogi Berra once said, “It’s déjà vu, all over again.”
I was out at a customer helping them debug a multi-core system. Unlike my Sudoku solver, this was not a multi-threaded SMP application. The customer had 4 processes, each running on a different core and specifically tied to that core. While all the programs were running under the same operating system – the designer did not rely on the OS to schedule the processes – so this was more like a “managed AMP” type of set up.
As we were trying to hunt down this really elusive nasty bug, which only raised its ugly head occasionally, I was struck by how similar the process was to a debugging session from a few years ago. Funny thing is, back then I was debugging a hardware design written in Verilog.
Is debugging hardware and debugging software becoming the same activity?
Well, no – hardware and software are still very different domains. Designing and debugging them are worlds apart. But there are some striking similarities creeping in, especially when we get to the world of multi-core. When designing and debugging hardware you have lots of things all going on at the same time. Often, functions are implemented as state machines. These state machines will interact, allowing for more complex behaviors than could practically be implemented in a single state machine. Hardware designers have tools which support the implementation, debug, and verification of state machines. There are state machine tools for software developers, too. We even have some here at Mentor. However, I’ve noticed that software folks tend to implement state machines on a more ad-hoc basis.
The programmers who coded up the linux FTP clients seemed to be real fans of state machines, as one can see just by using it. There is a state machine which keeps track of the state of the connection. A state machine which keeps track of the mode (ascii or binary), “bell” toggles a state, “prompt” toggles a state, and there are scads of other very simple state machines in this program. All these state machines interact to produce a rich set of behavior. Of course, it makes sense to implement a connection as a collection of state machines – as this is a good way to represent, both to the programmer and the user, the operation of the system.
The problem comes when one needs to validate or debug these state machines. Debugging a single state machine, pretty easy; debugging a collection of them, not so much. As I mentioned, on the hardware side there are tools to support this – even formal verification, a method of proving the correctness of a set of state machines based on more complex math than I was brave enough to attempt in college. There are some simpler tools available, too. One approach which is tremendously valuable in debugging state machines is the waveform trace. One can look at a collection of state variables over time and identify the transitions between states and the interactions between the machines.
As I watch software folks trying to debug state machines implemented in software, more often than not they resort to some type of logging or printfs (see my earlier post on “Guilty Pleasures”). A log of what all the state machines are doing is a good way to start to unravel what’s going wrong.
Of course, I’m getting awfully abstract here. I always find it a lot easier to explain things, and to understand them, with concrete examples. Now I could share with you the example from my aforementioned customer, but if I did – I’m pretty sure he’d be required to kill me (and I’m neither saying nor denying that this customer is in any way related to a government or a security apparatus attached to the said – or um, uh, “unsaid” government – which I didn’t mention. Conspiracy theorists draw your own conclusions, especially if I inexplicably disappear in the next few days). Besides, that example would be way too complicated to follow. And all kidding aside, here at Mentor Graphics we do take the confidentiality of our customers very seriously, and I would never divulge anything learned from any customer in confidence.
So here’s my simplified state machine example, the “Speed” card game. This game is a really interesting exercise in multi-core programming, so for those of you playing along at home, break out your IDEs – this is a fun one. The defining feature of the game is that unlike most card games, player’s turns do not follow a round robin scheduling algorithm. Each player makes a move whenever they can – even if this means several moves in succession. In the classic game, there are two “play piles” so two players can actually move simultaneously. I’ll give you a brief description of the game; you can read more about it at either Wikipedia or review the rules according to Hoyle.
In the game of Speed, each player is dealt 20 cards from a standard deck of cards. From those 20, they take 5 as “hand” and leave 15 face down as a replenish pile. Two cards are placed face up between the players, we’ll call these the “play piles” and there are 2 piles of 5 cards, placed face down. These face down cards are turned up on the play piles if neither player has a move, allowing the game to proceed. Players play a card by taking a card from their hand which is one higher or one lower in rank than the card on the top of the play pile. As the player uses the cards in their hand, they replace them from the replenish pile. Play ends when one player runs out of cards or when neither player can move and there are no remaining cards to put on the play pile.
My implementation of this game consists of 4 threads. The main thread that controls everything, 2 player threads and a referee thread that shuffles, deals, and keeps the game play fair. The referee thread also watches the game to see if both players get stuck, if so it turns over 2 new cards on the play piles. It also notes the end of the game and declares a winner. There are 2 player threads which act as the players, each operating on a hand, which is held privately, and the 2 play piles, which are common to both players and the referee.
I created a few classes – the “card”, a “stack”, and a “deck”. A “stack” is a collection of “cards” which is accessible as a FIFO or randomly. The “deck” is a special case of the stack class which is initialized to 52 elements which represent the standard deck of playing cards. Using these classes it was pretty easy to define and program the game.
The state machine for the player is fairly simple and is shown in figure #1. It is quite a simple state machine – each player starts out in the state “wait for start”. The referee will signal the start of the game by setting a flag which both players watch. Each player transitions from “wait for start” to “play”. Both players will look at their cards and the play stacks and determine if they have a move. If they are unable to find a move, they transition to a state of “stuck”, where they remain until a new card is placed on one of the play piles. This can be done by another player or if all players are stuck, the referee.
You’ll find the code that implements this in listing #1. As with most state machines that get implemented in software, it was not explicitly written as a state machine – it just ended up that way. Doing a formal implementation of a state machine using switch statements or a C++ template is a good way to avoid bugs and enhance maintainability – especially when they get big. This was written more for simplicity and clarity.
As usual, my implementation had bugs. Most of the bugs I was able to shake out the normal way, running the debugger and inspecting the code and the variables I set the game to run in a loop – re-shuffling and re-dealing and replaying over and over again. To give myself confidence that everything was programmed correctly, I ran 1,000 games and logged how often player #1 won, how often player #2 won, and how often there was a tie. If the system favored one player over another, then there probably was a bug. I also wrung out a few bugs by checking the deck at the end of each game. By virtually moving the cards from one stack to another under the control of 3 concurrent processes it was possible to occasionally lose a card or duplicate a card. By checking the consistency of the deck at the end of each game I was able to spot and fix a couple of race conditions.
But one bug was fairly tough, and similar to my customer’s bug. In running a loop of 1,000 games, once in a while, the program would pick up some extra cards, and once and only once it dumped core. I was unable to reproduce this fault. Attaching a debugger to the resulting core file, I saw something very interesting – there were 5 threads running. Huh!?!? What I saw was the main thread, the referee thread, a player #1, a player #2, and a second player #2. Now 3 players were not invited to the game. Looking at the code, it was not clear to me how two player #2 threads could possibly be running at the same time. The core dump was the result of a bad pointer access. Inspection of the core files led me to believe that some memory corruption was going on.
So let’s go on a bug hunt. For those of you whose debugging skills are truly Jedi-level, you’ll only need the player thread above, listing #2 – the main function of the program, and listing #3 – the referee thread. That’s all the code which is involved in this bug. Recall the symptoms: a core dump (not reproducible) where there are 3 player threads, not 2 as expected – more frequently, gaining extra cards in the deck. Initially, I thought that the debugger was in error when it displayed 5 active threads on the core file. Sometimes memory corruption can mess up the state of a program so badly that the debugger can’t make sense of what’s going on. Looking at the code I did not think it was possible for 3 player threads to exist at the same time. The main thread starts the referee thread, and then the two player threads. Then it simply does a pthread join, and waits for all 3 threads to terminate before repeating the loop. The player threads don’t start doing anything until the referee thread goes through the shuffle and deal functions.
Now, the good old fashioned symbolic debugger is probably not going to be much help here. The game fails about 1 in 10,000 runs. I can set a breakpoint in the code where I check the deck and find extra cards have mysteriously appeared. But at that point all the player threads are gone, and no amount of interrogation of the code was helping. It was amazingly frustrating. I assumed that it was a race condition. All code which references the shared play piles use a mutex to ensure that only one player or the referee is accessing the stack at any one point in time. And I tried to make the operations on these stacks properly atomic. My initial stab at trying to fix this was to lock larger and larger sections of the player code to see if the problem would go away. No luck there. I had the player threads almost fully locked, and the program for all practical purposes fully serial. The problem seemed to be less frequent but did not go away. Putting all the threads on one core, using affinity, I was able to make the problem go away (or had become so unlikely that I did not see it any more). This reinforced my belief that I had a race condition.
This is where I got the idea of trying to debug this using the same approach that I would if I were debugging the same problem in Verilog. It’s pretty simple, I would start by tracing the state variables which drove each of the state machines in question.
Now, C++ does not have a “trace” command, like my favorite logic simulator does. (Hey, maybe we should add that – note to self: check with the new boss on that one.) But I can add an LTTng user trace point to each location in the code where the variable is assigned. Fortunately, the code is pretty clean, and the state variables are an enumerated type (yeah, the influence of doing years of HDL coding) with a small scope. If you are a software person, your state variables state machines may not be as cleanly defined, but in 10 or 15 minutes you should be able to find all the state transition points in the code and annotate them.
Listing #4 shows the instrumented state machine. The trace_mark calls are the LTTng user space tracing calls. These will put a record into the LTTng database when that line of code is executed. The code is admittedly a bit messy and it needs to be recompiled, but it does allow us to trace the progress of the state machine as it executes. There is some work going on the gcc community to add tracepoints, these could be used in a similar fashion – but would eliminate the need to instrument the code and recompile, which will be nice. Pedro Alves was showing this to me a couple weeks ago. I’m looking forward to being able to use it.
Once this newly instrumented version of the program is run, we can look at the time line of the events. Figure #2 shows the events displayed on a time line in the System Analyzer. These events can be sorted, selected, and searched. But looking at it from a high level can sometimes allow you to see patterns and anomalies, which can reveal where bugs are hiding. Each game is a cluster of events for each player and the referee, and the gaps between the games are where the playing card deck data structures are checked for data integrity. At the end of each game the referee thread checks to see that all the cards are present and accounted for, and we didn’t accidently duplicate any. Figure #3 shows a close up view of one game. Each vertical line is where a “trace_mark” call was run over. If you hover the cursor over an event, a tool tip will pop up and show its contents.
Here’s the first clue. Figure #4 shows a closer in view of several games. What is notable here is that the player 2 events are completely absent from the center game. Another glaringly obvious feature is that there is a state transition for player 2 completely outside of the cluster of events which constitutes a game.
Zooming in on the events in the game following the one with a missing player 2 (figure #5) we can see that the density of events for player 2 is higher than for player 1. Based on this I’m concluding that there are in fact two player 2 threads at work at the same time – looks like the debugger was right, after all. I will also conclude that it is the player 2 from the prior game which is joining the final game uninvited – as he apparently missed his own game.
Why is a player able to skip one game and then join another? Looking closely at the events on the timeline, I can work out what’s going on. Let’s start by looking at the game prior to the one in figure #5, and the sequence of events. First, player 1 goes into the initial state of “wait_for_start”. A short time later, the referee, signals the start of the game. The state variable of player 2 has not been altered, its value is still “game_over” from the prior game. It should kick over the “wait_for_start” just after the thread starts, but, of course, we cannot predict when the OS will launch the thread. Linux is obviously off doing something more important and not starting my thread as I expected. But with the start of the game flagged by the referee thread, player 1 begins to play its cards. A bit of an unfair advantage, but that’s life. Also unexpected, but I guess possible, player 1 finishes playing all of its cards. It then transitions its state to “game_over”. The referee notices that both players are in a state of “game_over” so it dutifully checks the deck and exits the game.
Here’s where the really bad luck happens – the referee and player 1 threads have exited, but apparently, the player 2 thread did not yet get started. When I call pthread_join to wait for the player 2 thread to exit (and foolishly neglect to check the error code) pthread_join returns, and I can only surmise that it returns with the error code ESRCH – no such thread could be found – because it hasn’t started yet. But this does not terminate the thread. The player 2 thread goes happily along and soon starts. We can see that while the referee is off checking the deck, player 2 comes to life. It checks to see if the referee has started the game, it has. The referee has set the “start_game” bit to false, so the thread waits. A new referee, and a new player 1 and player 2 threads are all created. Thus the mystery of the 3 players is solved. The extra cards occur as a result of player 2 taking his stack of cards from one game to another.
There are several failure points here. First, I did not check the return status on pthread_join call – this would have alerted me to a problem. But root cause of the problem would have been hard to find without a timeline or log view of events. Probably more germane, is that the starting state for the player state machine was the last state from the prior game. My state machine was missing a state. Also, while the referee thread prevented players from starting too early, it did not check that all players had reached a state of being ready to play. Fixing these issues, the program runs flawlessly.
The key point here (besides *always* check your system call return statuses) is how the cause of the problem is exposed and almost obvious when looking at a timeline of the events. Contrast that with how you might try to debug this type of problem with a traditional debugger. Of course, this is a really simple state machine – 5 states in all (once I got the bugs worked out). I’m betting your designs are probably a bit more complicated. This approach might help you with debugging where you have multiple interacting state machines.