Wednesday, October 27, 2021

Optimistic lock coupling overhead for single thread Adaptive Radix Tree

UnoDB has two Adaptive Radix Tree implementations: a regular one, which is not safe for concurrent use, and optimistic lock coupling one (OLC ART), which enables concurrency and, hopefully, scalability for reads, by having per-node seqlocks. When I set out to implement them both, one interesting question for me was how much overhead does OLC add in single-thread workloads, in other words - can OLC ART be used instead of ART even if concurrency is not needed?

To try to make the answer more meaningful I did a round of optimizations for the latter, with the biggest one being a rewrite of try_read_lock implementation to do a single comparison in the fast path instead of two. Two comparisons is what the pseudocode in the OLC ART paper does. I have also logged my first missed-optimization compiler bug. The changes did not result in dramatic improvements (the try_read_lock one reduced branch mispredictions by 25%) but every little bit helps.

So the current answer is - OLC makes things slower 2x-3x compared to the regular one, in my implementation. This is a rough mean/average of various benchmarks, which vary from ~10% overhead on random gets to 200%-300%-400% (even 600% once) for everything else. Drilling down into these numbers a bit, it's somewhat easy to explain the relatively good numbers for random gets - memory-bound execution, least changes in the OLC algorithm compared to the baseline one. The update algorithms, especially the delete one, are relatively heavy in locking operations. The delete might lock three nodes at once, with restart logic if any one of the three write locks fail. This, however, does not explain, why scans have as much overhead as updates, sharing the same algorithm as random gets.

Another elephant in the room in this comparison is direct heap allocation/deallocation for the regular case vs. Quiescent State-Based Reclamation (QSBR) for OLC. I see a lot of QSBR in the single-thread profiles and I cannot really optimize it, at this point, due to the implementation needing a full rewrite as it does not scale due to a Big QSBR Lock.

And that is the next fun thing to do. I did not find any lock-free QSBR implementation on the internet, so I guess I'll see about writing one.

No comments: