Intro
This is the 3rd project in less than a year where I’ve seen issues with spin-loops. I’ve been dealing with spinning threads for many years now, and I won’t lie: over the years I’ve been both on the offender and victim side.
I’m getting tired of seeing the same issues again and again, which usually makes for a good reason to write a blog post so that, hopefully, people will read it and stop making the same mistakes others did.
Actually, many others have written about this, covering various issues related to spin locks 1 2 3 4 5 6. But I guess there’s never enough material on those subjects. Some are about speed, others about fairness, a few about priority inversion, NUMA, and sometimes even about actually broken code.
If this list hasn’t convinced you that things do spin out of control when using spin-locks, and that you should use OS primitives instead, keep reading. I’ll cover what you should not do when implementing your own spin-lock. Notice I said what you should NOT do, because, again, you should probably not use a spin-lock at all these days.
And if you do… make sure you really, REALLY, REALLY know what you’re doing (spoiler: it will always come back to bite you when you least expect it).
Note this is a story about spin loops in general, not about locking algorithms for which there are many 5.
The broken spin-lock
Let’s start with the basics, you want to implement your own spinlock.
🤪 “It’s easy! You simply have a boolean, a
lockand anunlockfunction.”
Right…
For demonstration purposes, we are using int instead of bool as you might have something more complicated to do with it, such as storing metadata (for example: the thread ID). There are also quite a few pieces of code around that do not implement a spin-lock per se, but mutate some other content such as pointers.
class BrokenSpinLock
{
// Using int32_t instead of bool on purpose, don't mind it.
int32_t isLocked = 0;
public:
void lock()
{
while (isLocked != 0) // (1)
{
// Loop again until not locked anymore
}
// (2)
isLocked = 1; // (3)
} // (4)
void unlock()
{
isLocked = 0;
}
};
Those who have dealt with multi-threading before will immediately spot the issue. The code is not thread-safe as, if multiple threads attempt to use this lock, we could read invalid values of isLocked (in theory, and on a CPU where tearing could happen on its word size).
Worse, even if this could not happen, a wild race-condition could appear.
Consider the following example where two threads would call lock at the exact same time:
(1) ThreadA: Sees `isLocked == 0` | ThreadB: Sees `isLocked == 0`
(2) ThreadA: Leaves the loop | ThreadB: Leaves the loop
(3) ThreadA: Writes 1 to isLocked | ThreadB: Writes 1 to isLocked
Now we have two threads who think they have successfully acquired the lock!
Some may also have heard about this shiny little thing called atomic variables/operations.
To oversimplify: atomic operations guarantee that other threads cannot observe a partial/intermediate state of the operation and thus race-conditions can not occur (on those specific operations and memory).
💡 While named after the Greek
atomosthat means “that which cannot be divided”,atomicoperations might as well be as dangerous and difficult to use as nuclear energy.
Let’s replace isLocked by an atomic version: std::atomic<int>. Though our code does not suffer from a race-condition on the data itself, we still do not know if the thread that sets isLocked to 1 is the one that now owns the lock. But we can now do an exchange operation atomically, which solves our little problem!
Instead of first checking if the lock is locked, then writing, we actually write our value and get the previous value, in a single atomic operation! If the previous value was 0, then it means we’re the one who actually did the locking. Otherwise we will see a 1, meaning the lock was already held either before we tried, or because another thread’s exchange completed before ours.
void lock()
{
while (isLocked.exchange(1) != 0) {}
}
Let’s replay the scenario. Even if both threads execute the exchange simultaneously, atomicity guarantees one will finish before the other, for example Thread B’s:
ThreadA: `isLocked.exchange(1)` | ThreadB: `isLocked.exchange(1)`
ThreadA: Writes 1, sees 1 | ThreadB: Writes 1, sees 0
ThreadA: Writes 1, sees 1 | ThreadB: Now owns the lock!
ThreadA: ... | ThreadB: ...
ThreadA: ... | ThreadB: `unlock()`, writes 0
ThreadA: Writes 1, sees 0 | ThreadB: ...
ThreadA: Now owns the lock! | ThreadB: ...
Good, we now have a working spin-lock, but we still have a long way to go.
💡 In the CPU lingua, a memory read/write is called a memory load/store
The spin-lock that burned CPUs
You may have realized that our spin-lock will… spin doing nothing, the loop is empty.
🤪 “Great, it’ll attempt to take ownership faster”
Well, that’s only true if you want to burn your CPU. Since the CPU has no way of knowing that you are waiting and not doing any meaningful work, it might stay at a high frequency. Modern CPUs can change the frequency of the cores to save energy, and effectively also lower the CPU core temperature. This is clearly not desirable behavior, especially on mobile/embedded devices.
Not convinced or do not care about the planet? (shame on you!) Then at least think about your users’ power bill. Still not convinced? What if I told you this can actually be slower than doing something in the loop?
Imagine that a lot of threads are attempting to lock your spin-lock. Only one can win. But worse, due to its nature you always do memory writes, which need to be synchronized between the different cores of your CPU!
From Intel’s Optimization Reference Manual 3 11.4.2:
On a modern microprocessor with a superscalar speculative execution engine, a loop like this results in the issue of multiple simultaneous read requests from the spinning thread. These requests usually execute out-of-order with each read request being allocated a buffer resource. On detection of a write by a worker thread to a load that is in progress, the processor must guarantee no violations of memory order occur. The necessity of maintaining the order of outstanding memory operations inevitably costs the processor a severe penalty that impacts all threads.
And the issue will keep getting bigger with recent CPUs that have many cores and sometimes NUMA memory.
This penalty occurs on the Intel Core Solo and Intel Core Duo processors. However, the penalty on these processors is small compared with penalties suffered on the Intel Xeon processors. There the performance penalty for exiting the loop is about 25 times more severe.
If you still need some convincing… this is even worse if you enable SMT (hyperthreading):
On a processor supporting Intel HT Technology, spin-wait loops can consume a significant portion of the execution bandwidth of the processor. One logical processor executing a spin-wait loop can severely impact the performance of the other logical processor.
Now that I hopefully have your attention, here’s how to solve mitigate the issue:
The best way to avoid “bothering” your neighbours is to avoid spin loops tell the CPU you are waiting to be notified of a memory change/doing a spinloop!
On x86 CPUs, this is done with the PAUSE instruction. It was designed exactly for this use-case!
The penalty of exiting from a spin-wait loop can be avoided by inserting a
PAUSEinstruction in the loop. In spite of the name, thePAUSEinstruction improves performance by introducing a slight delay in the loop and effectively causing the memory read requests to be issued at a rate that allows immediate detection of any store to the synchronization variable. This prevents the occurrence of a long delay due to memory order violation.
You can modify the code to use this instruction with compiler intrinsics:
void cpu_pause()
{
#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)
_mm_pause();
#elif defined(__arm__) || defined(__aarch64__) || defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
__yield();
#else
#error "unknown instruction set"
#endif
}
void lock()
{
while (isLocked.exchange(1) != 0)
{
cpu_pause();
}
}
The spin-lock that didn’t wait enough
As already mentioned, the penalty of synchronizing data between CPU cores is getting more expensive as new CPUs get more cores, get multiple core complexes or NUMA architectures.
Resolving conflicts (multiple cores trying to do atomic stores) thus needs to be mitigated in some way.
A traditional approach is to use a backoff strategy that increases the number of PAUSE instructions for each attempt at locking.
The one you will find most (recommended by the Intel Optimization Manual, 2.7.4), is the exponential backoff:
void lock()
{
const int maxPauses = 64; // MAX_BACKOFF
int nbPauses = 1;
while (isLocked.exchange(1) != 0)
{
for (int i = 0; i<nbPauses; i++)
cpu_pause();
// Multiply the number of pauses by 2 until we reach the max backoff count.
nbPauses = nbPauses < maxPauses ? nbPauses * 2 : nbPauses;
}
}
As mentioned by Intel:
The number of
PAUSEinstructions are increased by a factor of 2 until someMAX_BACKOFFis reached which is subject to tuning.
We also mix it with a bit of randomness by using rdtsc, and let’s refactor the yielding part into a structure that can be easily swapped:
struct Yielder
{
static const int maxPauses = 64; // MAX_BACKOFF
int nbPauses = 1;
void do_yield()
{
// jitter is in the range of [0;nbPauses-1].
// We can use bitwise AND since nbPauses is a power of 2.
const int jitter = static_cast<int>(__rdtsc() & (nbPauses - 1));
// So subtracting we get a value between [1;nbPauses]
const int nbPausesThisLoop = nbPauses - jitter;
for (int i = 0; i < nbPausesThisLoop; i++)
cpu_pause();
// Multiply the number of pauses by 2 until we reach the max backoff count.
nbPauses = nbPauses < maxPauses ? nbPauses * 2 : nbPauses;
}
}
void lock()
{
Yielder yielder;
while (isLocked.exchange(1) != 0)
{
yielder.do_yield();
}
}
The spin-lock that waited too long
Remember the comment above about MAX_BACKOFF being subject to tuning?
Well you’d better make sure to tune it for the exact CPU you’ll be working on.
Let’s have a look at the following table listing the measured7 duration of PAUSE in cycles:
| Sandy Bridge | Ivy Bridge | Haswell | Broadwell | Skylake | Kaby Lake | Coffee Lake | Cannon Lake | Cascade Lake | Ice Lake | Rocket Lake | Alder Lake-P | Tremont | Alder Lake-E | AMD Zen+ | AMD Zen2 | AMD Zen3 | AMD Zen4 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11.00 | 10.00 | 9.00 | 9.00 | 140.00 | 140.00 | 152.50 | 157.00 | 40.00 | 138.20 | 138.20 | 160.17 | 176.00 | 61.80 | 3.00 | 65.00 | 65.00 | 65.00 |
And that’s where the issue lies. Depending on the architecture, you may get more than 10x changes in cycles per PAUSE.
Old CPUs tended to have small PAUSE duration of ~10 cycles on Intel, ~3 on AMD, where new architectures have a duration of 100-160 cycles on Intel, and ~60 cycles on AMD.
And this might get worse in the future!
This actually is also now part of the latest Intel Optimization Reference Manual 3 2.7.4:
The latency of the
PAUSEinstruction in prior generation microarchitectures is about 10 cycles, whereas in Skylake Client microarchitecture it has been extended to as many as 140 cycles.
How to fix this, you ask? I’ll defer to Intel’s advice again and limit the duration of the PAUSE loop using CPU cycles instead of a counter:
static inline bool before(uint64_t a, uint64_t b)
{
return ((int64_t)b - (int64_t)a) > 0;
}
void pollDelay(uint32_t clocks)
{
uint64_t endTime = _rdtsc()+ clocks;
for (; before(_rdtsc(), endTime); )
cpu_pause();
}
As the
PAUSElatency has been increased significantly, workloads that are sensitive toPAUSElatency will suffer some performance loss.
[…]
Notice that in the Skylake Client microarchitecture theRDTSCinstruction counts at the machine’s guaranteed P1 frequency independently of the current processor clock (see the INVARIANT TSC property), and therefore, when running in Intel® Turbo-Boost-enabled mode, the delay will remain constant, but the number of instructions that could have been executed will change.
Let’s implement this:
struct Yielder
{
static const int maxPauses = 64; // MAX_BACKOFF
int nbPauses = 1;
const int maxCycles = /*Some value*/;
void do_yield()
{
uint64_t beginTSC = __rdtsc();
uint64_t endTSC = beginTSC + maxCycles; // Max duration of the yield
// jitter is in the range of [0;nbPauses-1].
// We can use bitwise AND since nbPauses is a power of 2.
const int jitter = static_cast<int>(beginTSC & (nbPauses - 1));
// So subtracting we get a value between [1;nbPauses]
const int nbPausesThisLoop = nbPauses - jitter;
for (int i = 0; i < nbPausesThisLoop && before(__rdtsc(), endTSC); i++)
cpu_pause();
// Multiply the number of pauses by 2 until we reach the max backoff count.
nbPauses = nbPauses < maxPauses ? nbPauses * 2 : nbPauses;
}
}
void lock()
{
Yielder yield;
while (isLocked.exchange(1) != 0)
{
yield.do_yield();
}
}
This method has two main advantages:
- We define the max duration of a
PAUSEloop in terms ofTSCcycles, which is (on most modern CPUs) independent of the actual frequency of the core or duration ofPAUSE. - If the operating system happens to preempt our thread in the middle of the loop, it will stop yielding after being rescheduled if maximum duration has been exceeded. Otherwise we could call
PAUSEmore than necessary on a thread wakeup.
You’ll notice that we kept the exponential backoff as a plain counter. This is to avoid having to compute the duration of a single PAUSE (this would require getting rid of the jitter).
However, we still need to choose a value for maxCycles. This again is purely empirical and needs tuning, but one may assume the duration of a context switch is about 3µs. Depending on the system and actual switch this can be more or be less. But it should be in the same order of magnitude.
We can then estimate the TSC cycles/µs conversion to be ~3200cycles/µs for a 3.2Ghz clock. Another common frequency for the TSC is 2.5GHz.
While obviously incorrect, this is a good guesstimate for a default value on PC. At worst, you’ll most likely get a 2x difference with the real value, which is way better than the x10 you could get with the varying PAUSE durations!
I did however mention this is a default value, and the best thing to do is to retrieve the real value, either from the OS or by measuring it. Sadly TSC calibration is not officially exposed by Linux/Windows, so the best way is to measure the TSC against the system high resolution clock. Ideally this should be done asynchronously (don’t do it on your application main thread at boot, please).
// Please, do this asynchronously and not during your main thread init
// Otherwise you will make your application boot longer for nothing!
// Note there are more accurate ways to do this, but we do not need a very high precision nor accuracy.
// You may also split this function in two and do some meaningful amount of work instead of sleeping.
uint64_t MeasureCyclesPerUs()
{
const auto clockBefore = std::chrono::high_resolution_clock::now();
const uint64_t cyclesBefore = __rdtsc();
std::this_thread::sleep_for(std::chrono::microseconds{10});
const auto clockAfter = std::chrono::high_resolution_clock::now();
const uint64_t cyclesAfter = __rdtsc();
const auto clockDelta = clockAfter - clockBefore;
const uint64_t cyclesDelta = cyclesAfter - cyclesBefore;
const uint64_t cyclesPerUs = (1000 * cyclesDelta) / std::chrono::nanoseconds(clockDelta).count();
return cyclesPerUs;
}
💡 Windows actually “exposes” this value as
CyclesPerYieldin the kernel shared data at offset0x2D6. This is used internally by synchronization primitives to determine how manyPAUSEinstructions it should issue. However I wouldn’t recommend using those internals unless your code sanitizes the value.
The spin-lock that used too many barriers
We only briefly touched the topic of atomics. All atomic operations actually take an optional parameter which is the memory order.
I don’t want to spend too much time on this as entire talks are dedicated to it, and it’s not an easy topic.
However do know this: not providing the parameter is equivalent to using std::memory_order_seq_cst (sequentially consistent) which enforces the most restrictions. On some platforms this may even flush your cache via memory barriers!
Our previous example can actually be re-written using acquire/release semantics:
void lock()
{
Yielder yield;
while (isLocked.exchange(1, std::memory_order_acquire) != 0)
{
yield.do_yield();
}
}
void unlock()
{
isLocked.store(0, std::memory_order_release);
}
On my x64 machine and exponential backoff:
| Lock Type | Uncontended (ops/s) | Contended (ops/s) |
|---|---|---|
| ExpBackoff+SeqCst | 313M | 55.3M |
| ExpBackoff+AcqRel | 612M | 58.7M |
| ExpBackoff+Acquire | 652M | 65.3M |
You may have a look at the various assemblies generated on this compiler explorer example https://godbolt.org/z/GjEEWPsj8.
The spin-lock that saturated the load ports
A spin-lock should be fast, otherwise you would just use your average system lock. While we mitigated the inter-core synchronization with the jitter and exponential backoff, there are ways to reduce the cache coherency 8 traffic under contention. This has been mentioned by many in the past 9 10 11 but it doesn’t hurt to remind it again. Instead of looping over a Test-And-Set (aka Compare-And-Swap) operation, prefer using both Test and Test-And-Set operations! It also applies to our Load-And-Test (aka Exchange) operation.
So instead of:
void lock()
{
Yielder yield;
while (isLocked.exchange(1, std::memory_order_acquire) != 0)
{
yield.do_yield();
}
}
Do:
void lock()
{
Yielder yield;
// Actually start by an exchange, we assume the lock is not already taken
// This is because the main use case of a spinlock is when there's no contention!
while (isLocked.exchange(1, std::memory_order_acquire) != 0)
{
// To avoid locking the cache line with a write access, always only read before attempting the writes
do {
yield.do_yield(); // Yield while we fail to obtain the lock.
} while (isLocked.load(std::memory_order_relaxed) != 0);
}
}
The spin-lock that didn’t handle priority inversion
Priority inversion is one of the worst things that could (and will) happen with a spinlock. And it impacts most severely the platforms that need them the most! (Embedded, real-time OSes, …) Let’s have a look at the issue:
- A low-priority thread acquires your spinlock
- A high-priority thread tries to acquire the lock and starts spinning
- The OS scheduler preempts the low-priority thread to run another thread with medium/high priority (anything higher than “low”)
- There are no cores left to run the low priority thread as they are all used by higher priority threads.
- The high-priority thread burns CPU cycles spinning forever.
🤪 “Let’s use
std::this_thread::yield()?”
Meh, did you test it on multiple systems? I’ll play along and give it a try.
struct Yielder
{
void do_yield_expo_and_jitter()
{
// Same as before, exponential backoff and jitter
}
void do_yield()
{
do_yield_expo_and_jitter();
if (nbPauses >= maxPauses)
{
std::this_thread::yield(); // Yield thread back to the OS
nbPauses = 1;
}
}
}
Now when we reach the maximum number of iterations, we make the thread yield its quantum to the operating system (SwitchToThread on Windows, sched_yield on Linux) so that another thread may be scheduled.
While in practice this may, sometimes, solve the issue as the OS is now free to schedule other threads including the low priority one, this is not mandatory!
Some implementations may end up just rescheduling the thread that just yielded since it’s of higher priority.
You may have also seen implementations that use Sleep(0) on Windows. This is better than SwitchToThread (which can only yield to a thread ready to run on the current core, per the docs12. Same for normal Linux schedulers13). However this used to14 only yield to threads of same or higher priorities, and still does on the real-time version of the OS! For example on an embedded device, or a console.
The only way to schedule any thread on real-time kernels, be it Windows or Linux, is to sleep for a non-zero duration… which we obviously would like to avoid!
So the solution that the DotNet runtime team came up with is to start with SwitchToThread, then Sleep(0) then Sleep(1)!
// We prefer to call Thread.Yield first, triggering a SwitchToThread. This
// unfortunately doesn't consider all runnable threads on all OS SKUs. In
// some cases, it may only consult the runnable threads whose ideal processor
// is the one currently executing code. Thus we occasionally issue a call to
// Sleep(0), which considers all runnable threads at equal priority. Even this
// is insufficient since we may be spin waiting for lower priority threads to
// execute; we therefore must call Sleep(1) once in a while too, which considers
// all runnable threads, regardless of ideal processor and priority, but may
// remove the thread from the scheduler's queue for 10+ms, if the system is
// configured to use the (default) coarse-grained system timer.
The spin-locks that woke something unrelated
So we dealt with the priority inversion at the cost of potential sleeps.
🤪 “Ship it!”
Please god no… Yes, you (most likely) avoid the worst case scenario (the livelock), but really, is it fine?
Let’s stop for a second here and assume we never did more than yield.
As you may have already guessed, a livelock is only half the story (this is starting to be a recurring pattern, isn’t it?). The fact is, the issue could happen even if all your threads have the same priority! (Yes, I saw you coming asking for an easy fix by removing priorities.) Consider the following scenario:
- 4 cores machine
- 4 high priority threads: A, B, C, D (your thread pool)
- 4 other high priority threads: X, Y, W, Z (controlled by a 3rd party, those suck. Please library writers, don’t spawn threads on your own, thank you!).
- Thread A acquires the lock
- Threads B, C, D spin, trying to acquire it.
At this point, we have the following:
| Core 0 | Core 1 | Core 2 | Core 3 |
|---|---|---|---|
| Thread A | Thread B | Thread C | Thread D |
- Thread X gets scheduled (A somehow released its quantum, still holds the lock)
| Core 0 | Core 1 | Core 2 | Core 3 |
|---|---|---|---|
| Thread X | Thread B | Thread C | Thread D |
- Thread B yields, Y is scheduled
| Core 0 | Core 1 | Core 2 | Core 3 |
|---|---|---|---|
| Thread X | Thread Y | Thread C | Thread D |
- Thread Y yields, B is scheduled again, C and D yield to W and Z
| Core 0 | Core 1 | Core 2 | Core 3 |
|---|---|---|---|
| Thread X | Thread B | Thread W | Thread Z |
I could continue this for a long time. Even though thread A might get scheduled again, it might not! This depends on your scheduler’s internals. Especially since yielding may yield only to the ready threads of the current core12 13. At the time of writing this article, this actually is a known issue with Address Sanitizer!
Oh, and even if it did get scheduled, you probably lost a lot of time switching from one thread to the other, this is your typical lock convoy15 and is what Linus Torvalds more or less hints here4 16 13:
And no, adding random “
sched_yield()” calls while you’re spinning on the spinlock will not really help. It will easily result in scheduling storms while people are yielding to all the wrong processes.
So no, simply using the same priority for all threads or sleeping is not fine. Let’s see what we can do about it.
The spin-lock that spoke to the OS
The real problem, when you spin in a loop, is that you expect things to go fast so that your thread may continue.
But by yielding this way you defeat a lot of the kernel heuristics. It has no way to know what you actually meant, and may schedule anything (or nothing) but threads from your process. Worse, it may degrade your thread priority, move it to lower frequency cores, and you lose any kind of priority boost when waking up due to the lock being released…
That’s clearly not what we want. If only there was a way to communicate our intent to the OS…
Well that’s exactly what Linux did when introducing the futex API! Since we’re waiting in a loop for a value to change, just notify the OS about it and let it handle things from there.
Windows also implements this with the WaitOnAddress API, which we’ll be demonstrating here:
void do_yield(int32_t* address, int32_t comparisonValue, uint32_t timeoutMs)
{
do_yield_expo_and_jitter();
if (nbPauses >= maxPauses)
{
// The thread will stay asleep while the value at the given address doesn't change and `WakeByAddressSingle`/`WakeByAddressAll` isn't called.
// We might have a spurious wakeup though, so the value needs to be checked afterward, which we already do since we spin.
WaitOnAddress(address, &comparisonValue, sizeof(comparisonValue), timeoutMs);
nbPauses = 1;
}
}
void lock()
{
Yielder yield;
while (isLocked.exchange(1, std::memory_order_acquire) != 0)
{
do {
yield.do_yield(&isLocked, 1 /*while locked*/ , 1 /*ms*/);
} while (isLocked.load(std::memory_order_relaxed) != 0);
}
}
void unlock()
{
isLocked = 0;
WakeByAddressSingle(&isLocked); // Notify a potential thread waiting, if any.
}
Windows’ WaitOnAddress internally does a single iteration before issuing the system call, but Linux’s futex API is a direct syscall. That’s why we call WaitOnAddress only after spinning a bit.
This lets us have a similar spinning strategy on all platforms, which ensures a more consistent behavior.
🤪 “Wait! Wasn’t
std::atomic_waitadded to the standard recently?”
Yes! And this is what one should have used if implementers did the right thing from the get-go (and more importantly did the same thing for each implementation), but this was not the case…17
libc++(clang) used to do exponential backoff with thread yields beforefutex. At least it got fixed in January 2025 but it still does exponential backoff.- MSVC STL does the right thing™ imho and goes almost straight to the OS since the first implementation. Good job!
- So does
libstdc++(GCC)!
So if you use it, you may get a built-in exponential backoff, or not. Both implementations actually make sense from an implementer’s point of view (Do you expect std::atomic_wait users to use it with their own backoff strategies? Or directly as condition variables?), but this difference ends up being problematic since the code behaves differently between implementations.
In the end, as usual with the std library, you’re better off using the OS primitives directly if you want portable behaviour that you control.
As mentioned, Windows’
WaitOnAddresswill do a single spin before doing a syscall. The duration ofPAUSEis computed on process start by the loader inLdrpInitializeProcessand stored inntdll.dll!RtlpWaitOnAddressSpinCycleCount.
The spin-lock that was unfair
An issue with some lock algorithms is that they may be unfair: this is what happens when under contention a thread may never actually grab the ownership of the lock if other threads are faster.
This time I’ll simply give a warning and ask you to trust me as this article is starting to be lengthy. You may have encountered some “ticket” locks that attempt to enhance the fairness of the lock. While it may look good on paper, it’s actually not so good in practice.
Not only is it slower due to its complexity, but as mentioned before only the OS really knows what’s good for scheduling. And if you want to use a futex-like API you end up having to wake up all potential waiters instead of just the one you want. So please, rely on the OS primitives for fairness instead. (Even if we didn’t have those primitives, a random+exponential backoff may perform better than a ticket lock anyway!)
The spin-locks that were falsely sharing
Here comes another tidbit of CPU architecture: even if you write to different variables, they may share the same cacheline!
And this is really bad for performance when you do atomic operations on the same cacheline, even if the addresses are different.
To fix this issue, you may enforce alignment of your variables or use padding in a struct. False sharing is also known as destructive interference, which led to the standard’s std::hardware_destructive_interference_size value!
alignas(std::hardware_destructive_interference_size) MyLock lock1;
alignas(std::hardware_destructive_interference_size) MyLock lock2;
This is however not a silver bullet!
While you will avoid false sharing, you may also fill your TLB and L1 cache faster which may lead to more cache thrashing.
You may even encounter cache bank conflicts. Cache bank conflicts only exist on some CPUs, but don’t trust manufacturers to avoid them. From 3.6.1.3 of the Intel Optimization Reference Manual:
- “In the Sandy Bridge microarchitecture, the internal organization of the L1D cache may manifest […]”
- “The L1D cache bank conflict issue does not apply to Haswell microarchitecture.”
- “In the Golden Cove microarchitecture, bank conflicts often happen when multiple loads access […]”
💡 So this was once an issue, then fixed, then it came back in another form.
These are thankfully mitigated thanks to the random+exponential backoff, but are getting worse (this pattern of “yes, but” should really annoy you by now, that’s the whole point of this article).
Whenever possible, avoid reading the same memory location within a tight loop or using multiple load operations.
And the only way to really fix that is to… actually park the thread by calling an OS primitive such as a futex! You should also avoid doing multiple loads per loop, as recommended previously.
What about specialized instructions?
🤪 “I’ve read about
MWAITandTPAUSE.”
And you should probably have read further as those are privileged instructions! But yes they do have the same look as a futex wait/wake, which is very tempting.
And, to be fair, AMD does offer a userland alternative which is monitorx and mwaitx that we can use!
One advantage of mwaitx is that you can tell the CPU to wait for a given TSC count instead of having to loop! So it can be used to replace the _mm_pause loop when supported, and that’s actually what Windows’ locking primitives such as WaitOnAddress or AcquireSRWLockExclusive do internally!
Not only is the “API” easier (you provide a timestamp for the wakeup date) but it can save power! 18 19
Just do not use it for long periods since you are still delaying potential work from other threads by not explicitly yielding to the OS.
💡
mwaitxcan spuriously wake up, but this is fine for our usage since we’ll just spin and try again!
Conclusion
You’ll notice I barely mentioned ARM, that’s because I do not have enough experience with this architecture to give any advice other than you should use the proper memory ordering for decent performance.
If you read this far, I’ll say it again: in most (and pretty much all) cases you should not even need to worry about the performance of your locks. The best lock is the one you don’t use.
Again, from Linus: 4
Because you should never ever think that you’re clever enough to write your own locking routines.. Because the likelihood is that you aren’t (and by that “you” I very much include myself - we’ve tweaked all the in-kernel locking over decades, and gone through the simple test-and-set to ticket locks to cacheline-efficient queuing locks, and even people who know what they are doing tend to get it wrong several times).
There’s a reason why you can find decades of academic papers on locking. Really. It’s hard.
But if you do, even after all those warnings, at least make sure you follow best practices and especially the pre-requisites for a spinlock to be efficient:
- There is low contention
- The critical section (work done under the lock) is very small. (Consider that “small” varies with the number of threads competing for the lock…)
- Notify your OS about what you’re doing (
futex,WaitOnAddress, …)
Bonus!
List of projects/libraries that do (or did) it wrong and that I happened to stumble upon:
- RPMalloc: the one that led to this rant, we had a dependency using it on console, and it caused livelocks. It only loops with a single CPU yield. Bad for perf, impossible (read: will break) to use on embedded platforms with a realtime scheduler.
- OpenBSD’s libc goes straight to an OS thread
yield - Glibc goes straight to
futexby default- default mutex is the “simple” one and only checks value once before going straight to the OS
PTHREAD_MUTEX_ADAPTIVE_NPis mostly good! The number of spins is fixed by default but can be tweaked using the tunableglibc.pthread.mutex_spin_count
- Intel TBB:
tbb::spinlock: simple backoff with 16 yieldstbb::mutex: Pure_mm_pausebackoff with fixed count, then thread yield backoff, then uses a futex wait on linux, or plain semaphore on other platforms
- Webkit:
- OS Thread yield (
SwitchToThread/sched_yield) on CAS failure - Hardcoded spincount
- Duplicated here
- Same thing here
- Even though they have benchmarks (both for speed and fairness)!
- OS Thread yield (
- AddressSanitizer (ASAN)
- The Issue => can livelock with a realtime scheduler
- The Implementation
- Yield is done by sched_yield on linux and
Sleep(0)on Windows
- DotNet runtime still uses
Sleepinstead of WaitOnAddress - And so many others…
Footnotes
-
Measuring Mutexes, Spinlocks and how Bad the Linux Scheduler Really Is - Malte Skarupke ↩
-
Spinlocks Considered Harmful - matklad ↩
-
Intel® 64 and IA-32 Architectures Optimization Reference Manual ↩ ↩2 ↩3
-
Linus Torvalds on spinlocks (1) - Real World Technologies Forum ↩ ↩2 ↩3
-
Lock–Unlock: Is That All? A Pragmatic Analysis of Locking in Software Systems - Rachid Guerraoui, Hugo Guiroux, Renaud Lachaize, Vivien Quéma, Vasileios Trigonakis. ACM Transactions on Computer Systems, 2019, pp.1-149. ↩ ↩2
-
Spin Locks Considered Harmful, and How to Write Them When We Must - Intel (via The Wayback Machine) ↩
-
Cache coherency primer - Fabian Giesen ↩
-
AMD Ryzen Processor Software Optimization (GDC 2024) - Ken Mitchell ↩
-
AMD RYZEN™ Cpu Optimization (GDC 2018) - Ken Mitchell & Elliot Kim ↩
-
Correctly implementing a spinlock in C++ - Erik Rigtorp ↩
-
SwitchToThread docs - Microsoft ↩ ↩2
-
Linus Torvalds on spinlocks (3) - Real World Technologies Forum ↩ ↩2 ↩3
-
Sleeping vs. Yielding - Ken Henderson ↩
-
A Complete Guide to Lock Convoys - Dave Kilian ↩
-
Linus Torvalds on spinlocks (2) - Real World Technologies Forum ↩
-
Implementing atomic wait and notify - Blat Blatnik ↩
-
SpecialK’s commit automating
mwaitx’s usage - Andon M. Coleman ↩ -
An Analysis of User-space Idle State Instructions on x86 Processors - Malte-Christian Kuns, Hannes Tröpgen, Robert Schöne. ICPE ‘25: Proceedings of the 16th ACM/SPEC International Conference on Performance Engineering, 2025, pp.232-239. ↩

