So I had so much fun writing the Adaptive Radix Tree and later SIMD'ing-optimizing it, that I decided to implement a concurrent version of it: ART with optimistic lock coupling, from a paper by the original ART authors [1]. That took the fun to the whole new level.
Each tree node has its own lock, of the kind described below. While traversing the tree, the parent lock is taken, then a child lock, then the parent lock is released, then it repeats ("lock coupling/crabbing"). ART is not a B-tree in that any modification will be contained in three nodes at most, which limits how much of the tree will be write-locked in the worst case.
Now it also would be good for reads to scale, and for that a regular RW lock where read locking writes to memory will not be good enough. So OLC ART uses a different synchronization primitive, which the paper calls the Optimistic Lock: it is a mutex with a version counter, implemented in a single machine word. Writes lock the mutex, bumping the counter, at the end they unlock and bump the counter again. Reads check the counter, copy out what they want to read, and check the counter again to see whether their copy is consistent and usable or whether they have to restart. This is just like Linux kernel seqlocks (sequence locks), which the paper fails to mention. There is one extra feature compared to seqlocks: an unlock can mark the node as obsolete, which forces all concurrent readers and waiting writers to restart their algorithms and no longer try to lock this node.
Now those obsolete nodes have to be garbage-collected somehow. For that, I implemented Quiscent-State Based Reclamation (QSBR), in which each thread periodically declares that it holds no live pointers to the shared data structures. Once all threads do that, the epoch advances and the oldest deleted nodes from two epochs ago are reclaimed. Ironically my QSBR implementation uses a big global mutex for itself, killing scalability, so no fun OLC benchmarks unless I get around to rewriting that part.
Optimistic Locks / seqlocks are not that straightforward to express in the C++11 memory model. Luckily for me the hard work has been done in [2], discussing the correctness and the trade-offs of different implementations. A fun thing is that all protected data in the critical sections needed to be declared as relaxed C++11 atomics. But I also keep the original single-threaded ART around and did not want to copy-paste otherwise identical node-level algorithm implementations between the two. This led to the following C++ template gem (well maybe a turd) of overengineering: I wrote two class templates: relaxed_atomic and not_atomic. The former is like std::atomic with a difference that all operations use std::memory_order_relaxed instead of std::memory_order_seq_cst. The latter is like std::atomic except that its implementation is not atomic at all. Then I templatized node classes on this atomic policy, and, one thousand glue code lines later, avoided the copy paste. Profit!
Debugging this whole contraption took me a while, and I cannot say with certainty that I'm done. I know I have advanced a bit into the long tail of possible concurrency bugs, but I have no idea what's out there that I haven't seen yet. I attacked this in force, wrote the stress tests, employed all the sanitizers, ThreadSanitizer included, everything is running in CI (BTW moved from Travis CI to Github Actions like the rest of world), and yet. Usually it is that writing own concurrency primitive is as good an idea as a land war in Asia, but here it caused only a smaller part (thanks to following [2], I guess) of the issues: reading protected data and not acting on it until successful read unlock is for some reason harder than it sounds. But it was QSBR where dragons lived.