Why is it that restricting multithreaded applications to one core make it run faster?

I have a native multithreaded Win32 application written in C++ which has about 3 relatively busy threads and 4 to 6 threads that don't do that much. When it runs in a normal mode total CPU usage adds up to about 15% on an 8-core machine and the application finished in about 30 seconds. And when I restrict the application to only one core by setting the affinity mask to 0x01 it completes faster, in 23 seconds.

I'm guessing it has something to do with the synchronization being cheaper when restricted to one physical core and/or some concurrent memory access issues.

I'm running Windows 7 x64, application is 32-bit. The CPU is Xeon X5570 with 4 cores and HT enabled.

Could anyone explain that behavior in detail? Why that happens and how to predict that kind of behavior ahead of time?

Update: I guess my question wasn't very clear. I would like to know why it gets faster on one physical core, not why it doesn't get above 15% on multiple cores.

Answers


Without stating the application it is difficult to just guess what is causing the slow running of the application. If you want to go for a detailed analysis, we can consider following factors -

  • InterProcessor Communication : How much the threads in your application communicate with each other. If they communicate very often, then you will have overhead due to this behavior

  • Processor Cache Architecture : This is another important factor to see. You should know how the caches of the processor are going to be affected due to threads running on different processor. How much thrashing is going to happen at shared caches.

  • Page Faults : Maybe running on single processor is causing less number of page faults due to sequential nature of your program?

  • Locks : Lock overheads in your code? This should not cause a slowdown. But in addition to the above mentioned factors, this might add up to some overhead.

  • NoC on the processor : Definitely, if you allocate different threads to different processor cores, and they are communicating, then you need to know what is the path they are taking. Is there a dedicated connection between them? Perhaps you should have a look at this link.

  • Processor Load : Last but not the least is that, I hope you are not having other tasks running on other processor cores, causing a lot of context-switches. Context switch is typically very expensive.

  • Temperature : One effect you should consider is of the processor clock being slowed down if the cpu core is heating up. I think, you will not have this effect, but it also largely depends on the ambient temperature.


The question is extremely vague, so just some random guesses based on typical threading problems.

An obvious candidate is contention, the threads fighting over a lock and in effect running serial instead of parallel. You'll end up paying for the thread context switches and gaining no benefit. This is a problem that's easy to miss in C++, there's a lot of low-level locking going on in the CRT and the C++ standard library. Both originally designed without any regard for threading.

A problem that's common on cpu cores with a strong memory model, like x86 and x64, is "false sharing". It occurs when multiple threads update memory locations that are within the same L1 cache line. The processor then spends a lot of horse power keeping the core caches synchronized.

You only gain a benefit from multiple execution cores if the program is actually execution bound. You cannot get a benefit if its memory bound. Your machine still has only one memory bus and its a strong bottleneck if the data you manipulate cannot fit the cpu caches. The cores will simply stall, waiting for the bus to catch up. It is still counted as cpu time, so won't be visible in cpu usage statistics, but little real work is getting done.

Clearly you'll need a good profiler to chase these kind of problems.


It's almost certainly to do with caching, given the huge effect memory latency has on performance.

By being on a single core, the first and second level caches are kept particularly hot - much more so than when you're spreading over multiple cores.

The third level cache will be shared between all cores, so it won't be any different, but it is of course a lot slower, so you gain a lot by moving locality to the first and second level caches.


"When it runs in a normal mode total CPU usage adds up to about 15% on an 8-core machine"

The only 15% usage suggests me another possible explanation: don't your threads do I/O? My guess is that the I/O operations determine the overall time of your application and not the CPU usage. And in most cases I/O intensive apps become slower when the I/O jobs are multithreaded (just think about copying two files at the same time vs one after the other).


As far as the problem is concerned, the threads communicate between each other while running on multiple cores resulting in relatively slower process execution speed. Whereas limiting the thread to a single physical core doesn't require any inter-communication between the threads, therefore the process speeds up.

This may also be dependent on the tasks being performed: if the threads require low resources, this may be true, otherwise limiting the physical cores to one core may not be fruitful in all the cases.


Need Your Help

Issues with RCurl crawler based on concurrent requests

r rcurl httr

The following is a script to reproduce the problems i'm facing when building a crawler with RCurl that performs concurrent requests.

iOS - MPMediaPickerController doesn't work with videos

ios objective-c mpmediaitem mpmediapickercontroller

I recently found this question about MPMediaPickerController not recognizing videos, which is exactly what I'm trying to do: let the user select videos from their iPod/Video library. (Not user-fil...