Dude, Where’s My Multi-Core Performance?
Yeah, it’s been a while since my last blog post. Of course, it’s all about the software, but in blogging it’s all about content – and keeping it fresh. So, here I am, with some fresh content.
I took a new job here at Mentor and now I’m working in our Embedded Software Division. Despite being employed by a hardware tools company I think that this software stuff is going to be around for a while. My new job has given me access to a couple of cool things. One is a brand spankin’ new Freescale P4080DS development board. This is a development board which has a P4080 processor running at 1.2 GHz with 4 gigabytes of RAM. The P4080 has 8 PowerPC e500 cores. The other thing I have access to, and also brand spankin’ new, is a system analysis tool – which reveals just gobs of details about what your program and system are up to.
Of course, my most deeply devoted readers will recall that I made a presentation last year at the Multi-Core Expo in San Jose. There I talked about several programs which I tried to run on a hypothetical multi-core system (actually, I ran a simulation of a design with 8 ARM cores) and described the power savings that could be achieved by running on smaller, simpler cores but more of them. When most of those programs were run on a multi-core system they were not able to achieve anywhere near a linear scaling with the addition of more cores. “That’s to be expected” – the experts say. But these were “embarrassingly parallel” applications, and being a reasonably smart guy I expected to be able to, well, to do better than I did.
So with this new 8 core chip and the visibility promised by the System Analyzer at my disposal, I thought I would take another try at getting these programs to scale well on a multi-core system. So I dusted off my multi-threaded Sudoku solver. Hoping it had not suffered too much code rot, I ported it to the P4080DS and a fairly fresh version of embedded SMP Linux (2.6.30, to be specific).
Sudoku, for those of you who have lived in a cave for the past decade or so, is a number puzzle similar to a magic square. I’m not going to go into details on the puzzle or strategies for solving them, there are lots of websites that do a better job of that than I ever could – such as this one and this one. – or just check the puzzle section of your local paper. Sudoku has some very interesting computational properties (well, if you’re a real nerd) at first glance it appears that it should be mostly parallelizable, but there are portions of any solver implementation which have to be serial.
If you want to solve a Sudoku puzzle in parallel, one way to do it is to make “n” copies of the puzzle – where “n” is the number of digits in the puzzle. Then get “n” of your closest friends together (friends who put up with you nerdiness). Give each friend a copy of the puzzle you want to solve. Have your first friend solve for as many squares as they can for the first digit – but ignore the rest of the digits in the puzzle. Have your second friend solve for the second digit, but, again, ignore everything else. Repeat until you have used up all your friends. When your friends have solved as many squares as possible, have them hand you back their copy of the puzzle. You collate their answers onto one puzzle – and again make “n” copies of that and pass them out again – with more spaces filled in, your friends will be able to make further progress the puzzle. Repeat this process until you find the complete solution. Of course, the bottleneck in this process will be the collation of the puzzles, copying them and passing them out again. But you probably will solve the puzzle much faster this way than doing it yourself.
I basically implemented the same thing – but using pthreads instead of friends. When I ran it on a beefy quad core Pentium, I saw about a 30% improvement in throughput. But I was expecting at least a 50% improvement – since each of the pthreads is completely independent and I had a lot more threads than cores, which in theory should be able to take advantage of all that raw compute power. Except for the collation and copy step, this should be an embarrassingly parallel application.
Next I tried this on the P4080 – the octo-core beast. Here, the multi-threaded version of the program ran in about 1/3rd the time as the single threaded version. Much better – but why not 1/8th? Or at least 1/7th? Using traditional *nux utilities – time, top, gprof, etc. show me how long the program takes to run, and they also show that I am not completely utilizing all the cores. But “why” is it not performing as I expected is inscrutable.
So it tried out my new system analyzer on it. One source of data for the System Analyzer is LTTng, or the Linux Trace Tool – next generation. This shows what the Linux system is doing – processes, threads, cores, interrupts, and such. The System Analyzer puts this data on a timeline and allows easy viewing and computation on that data. See Figure #1. In this diagram, the top 8 lines represent the 8 cores in the P4080. Blue is busy running user code, grey is idle. Below these lines, are the threads (in this case, actually, processes) – here green is running, red is waiting on I/O events, and yellow is runnable, but waiting for a core to become free.
So I looked at the “resource view” of what was going on on the P4080. One thing that was clear was that there were lots of idle cores. As you might guess from the story so far, there were 3 cores busy and 5 cores just sitting there idling. Idling, I tell you, while there was work to be done. Interestingly, it was not the same 3 cores that were idle – in fact, the threads were pretty evenly distributed across all 8 cores over the duration of the program.
The program I’m describing here is chewing on a super monster Sudoku puzzle, that is not your ordinary 9×9 Sudoku, or 16×16 monster Sudoku, but a 25×25 super monster Sudoku (for those playing along at home the actual super monster Sudoku is posted at the end of this blog entry). There are 25 digits in the puzzle, and I am creating one thread which solves the puzzle for each digit. So I expected to see 25 threads all get created and be runnable – while the Linux SMP operating system would dutifully and efficiently put those runnable threads on all the cores. That didn’t happen. In fact, at any one point in time – viewing this on my handy timeline feature of my System Analyzer – I could plainly see that there were no more than 5 threads at any one time. Often there were only one or two threads active. WT_?!? (Aagh, censored by my corporate overlords again!)
Next I looked at the gantt chart of the process which was launching the threads (bright green bar in figure #2), paired along with the state of that thread (top Sudoku_0 process in figure #2) and I had that slap-the-hand-on-the-forehead moment. It is amazing how obvious a problem is once you “get it” – and how confounding it can be before that. Here’s the code that launches the threads:
See the problem? I didn’t. Take a close look at figure #2 – you might have the “aha” moment. Note that the main thread – running the snippet of code above – is the top “./sudoku_0” trace, and the created threads are the “./sudoku_0” traces below.
The program creates a thread on each iteration of the (above) loop. Notice that create_thread() calls pthread_create() and returns quickly – as per the gantt chart – and at the same time you see a new thread created. Then the thread goes into a state of “cpu_wait” (yellow) and remains in the function create_thread. Just after returning from pthread_create() the main thread is suspended. What’s happening is that the thread being created will be placed by the operating system on the same core where the main thread is running – preempting it. The process creating the threads gets preempted. No more threads get created until this process gets swapped back in. For some reason, which is baffling to me, it does not get swapped in until some or all of the worker threads are done with their work. But while the main thread is suspended, there are idle CPUs. I probably need to start to understand the mysteries of the scheduler – but I don’t want to know that stuff and I don’t think I should have to get involved in that to write an efficient multi-threaded program. To me “SMP” means never having to think about your scheduler.
I did come up with a solution which I thought would allow me to remain blissfully ignorant of the scheduler. The main thread simply needs to run to the end of the thread creation loop before any of the created threads run. So we need to either start the threads in a suspended state, or have them block on something just after being created. Alternatively, I think, if the priority of the process creating the threads is higher than the priority of the created threads that they would all be created before the main process was preempted. After some quick googling I decide on the pthread_barrier approach – a simple way to ensure that a set of threads all reach the same point in the code before any proceed. The listing is below, rather clever if I don’t say so myself.
And at the start of each worker thread, I added a pthread_barrier_wait() call:
This minor change will force all the worker threads to suspend at the point of the pthread_barrier_wait() call until all the worker threads reach that point. This will ensure that all the threads are created by the main thread before running.
When I run this I find that program doesn’t run any faster, in fact, it runs slower – considerably slower. About 50% slower. Why?
More in the next post…
Sudoku Puzzle (from http://www.colinj.co.uk – where you can print this or play online):