CriticalSection: what are performance issues on multicore CPUs?
I'm not concerned about the case when multiple threads try to acquire a CS. What about a case where one threads locks and unlocks a CS? I'm particularly interested how it works in current multicore CPUs (in x86 and ARMs).
Based on my simple guess, I think there is a lockcount and when a thread tries to acquire a CS it does interlocked exchange/add. Does this exchange/add affect other cores/threads in any way? For example, cores have their own memory caches (am I correct?) and interlocked exchange/add would need to interact with all other cores to make sure that the piece of ram isn't cached somewhere else.
I assume, that it's better to add a CS in place that isn't going to have multiple threads to access data, than to miss a CS in place that might rarely use multiple threads. But I'd like to verify that if there is a thread that enters/exists for a CS in a loop, then other cores/threads wouldn't stall because of inter-cpu synchronization going on behind the scenes.
SMP systems where cores don't share a single cache need to provide cache coherency logic for precisely the eventuality that a memory location is extant in more than one cache at once. If your critical section is implemented as a spin-lock, this isn't even a theoretical, but highly likely.
An interlocked exchange/add is potentially very expensive due to the sychronsiation of caches. This was particularly the case in NUMA systems where the links between some pairs of cores were considerably slower than for others.
ARM has replaced the SWP instruction (interlocked exchange) with a slightly different strategy that ought to be considerably cheaper.
To answer your question about the relative performance of a critical section, just how heavy it is rather depends on how its implemented. Modern Linux systems check the lock in user-space and only call the kernel in the contended cases (e.g. to wait or to unblock waiters). This means that you can use critical sections with gay abandon, and not worry too much about the cost (particularly on ARMv7 systems).
Other operating systems implement the whole lot in kernel-space, which makes any operation on a critical section expensive.
If there's any potential for a race condition occurring, you need to protect against it. If the performance of this worries you, consider changing your architecture to use queuing to decouple your threads.