C# ThreadPool Implementation / Performance Spikes
In an attempt to speed up processing of physics objects in C# I decided to change a linear update algorithm into a parallel algorithm. I believed the best approach was to use the ThreadPool as it is built for completing a queue of jobs.
When I first implemented the parallel algorithm, I queued up a job for every physics object. Keep in mind, a single job completes fairly quickly (updates forces, velocity, position, checks for collision with the old state of any surrounding objects to make it thread safe, etc). I would then wait on all jobs to be finished using a single wait handle, with an interlocked integer that I decremented each time a physics object completed (upon hitting zero, I then set the wait handle). The wait was required as the next task I needed to do involved having the objects all be updated.
The first thing I noticed was that performance was crazy. When averaged, the thread pooling seemed to be going a bit faster, but had massive spikes in performance (on the order of 10 ms per update, with random jumps to 40-60ms). I attempted to profile this using ANTS, however I could not gain any insight into why the spikes were occurring.
My next approach was to still use the ThreadPool, however instead I split all the objects into groups. I initially started with only 8 groups, as that was how any cores my computer had. The performance was great. It far outperformed the single threaded approach, and had no spikes (about 6ms per update).
The only thing I thought about was that, if one job completed before the others, there would be an idle core. Therefore, I increased the number of jobs to about 20, and even up to 500. As I expected, it dropped to 5ms.
So my questions are as follows:
- Why would spikes occur when I made the job sizes quick / many?
- Is there any insight into how the ThreadPool is implemented that would help me to understand how best to use it?
Here's my take on your two questions:
I'd like to start with question 2 (how the thread pool works) because it actually holds the key to answering question 1. The thread pool is implemented (without going into details) as a (thread-safe) work queue and a group of worker threads (which may shrink or enlarge as needed). As the user calls QueueUserWorkItem the task is put into the work queue. The workers keep polling the queue and taking work if they are idle. Once they manage to take a task, they execute it and then return to the queue for more work (this is very important!). So the work is done by the workers on-demand: as the workers become idle they take more pieces of work to do.
Having said the above, it's simple to see what is the answer to question 1 (why did you see a performance difference with more fine-grained tasks): it's because with fine-grain you get more load-balancing (a very desirable property), i.e. your workers do more or less the same amount of work and all cores are exploited uniformly. As you said, with a coarse-grain task distribution, there may be longer and shorter tasks, so one or more cores may be lagging behind, slowing down the overall computation, while other do nothing. With small tasks the problem goes away. Each worker thread takes one small task at a time and then goes back for more. If one thread picks up a shorter task it will go to the queue more often, If it takes a longer task it will go to the queue less often, so things are balanced.
Finally, when the jobs are too fine-grained, and considering that the pool may enlarge to over 1K threads, there is very high contention on the queue when all threads go back to take more work (which happens very often), which may account for the spikes you are seeing. If the underlying implementation uses a blocking lock to access the queue, then context switches are very frequent which hurts performance a lot and makes it seem rather random.
Using threads has a price - you need context switching, you need locking (the job queue is most probably locked when a thread tries to fetch a new job) - it all comes at a price. This price is usually small compared to the actual work your thread is doing, but if the work ends quickly, the price becomes meaningful.
Your solution seems correct. A reasonable rule of thumb is to have twice as many threads as there are cores.
As you probably expect yourself, the spikes are likely caused by the code that manages the thread pools and distributes tasks to them.
For parallel programming, there are more sophisticated approaches than "manually" distributing work across different threads (even if using the threadpool).
See Parallel Programming in the .NET Framework for instance for an overview and different options. In your case, the "solution" may be as simple as this:
Parallel.ForEach(physicObjects, physicObject => Process(physicObject));