Thursday, December 23, 2021

Lock-free quiescent state-based reclamation

As I mentioned previously, the current QSBR implementation in UnoDB does not scale due to being completely serialized by a big lock. It was still a useful implementation as it allowed developing a non-concurrent testsuite for correctness and to integrate the rest of code, but it had outlived its purpose and it was time to replace it. At the outset I was not able to find a reference implementation to use, so I developed my own.

The starting point implementation was similar to the one given here: maintain three shared data structures:

  1. a map of thread IDs to how many times a quiescent state was observed in the current epoch with a counter of threads still not having seen the current epoch;
  2. a collection of previous interval deallocation requests - while global, this one is used only by the epoch-advancing thread, with no sharing with the rest;
  3. a collection of current interval deallocation requests.

Ignoring the case of single thread execution, all deallocation requests are added to the current interval request collection. Whenever a thread goes through a quiescent state, the thread id map is checked whether it's a first time in this epoch for this thread. If yes, the previous epoch thread counter is decremented and the epoch is advanced, while the previous interval deallocation requests are executed, and the current interval requests are moved to previous. All of this happened under the lock. To slightly complicate things, threads can join and leave at any time.

I developed a lock-free version of the above step-by-step with some dead-ends explored on the way. So what were the steps?

  1. Move the "how many times a quiescent state was observed in the current epoch" counter from the shared map to the thread itself, and reduce the thread ID -> counter map to a thread ID set. Microbenchmarks showed slowdowns across the board, sometimes dramatic ones, but this was done as a prerequisite for next steps.
  2. Get rid of the thread ID set altogether, replaced by a total thread count. This loses the ability to catch unbalanced thread register/unregister calls, which is acceptable. 0%-33% speedups, mostly on the low side.
  3. Read the current epoch number atomically. Previously, the threads checking what is the global epoch still had to take the big lock. This was the first of big impact changes, with speedups up to 99% in high contention tests.
  4. (The committed part of "make-things-atomic-but-separate" dead-end). Make total thread count atomic. Like the current epoch number, while it could be read without taking the lock now, it still had to be updated under lock protection, and later I learned the hard way that this design will not work. 4% slowdowns to 25% speedups.
  5. After the previous step I spent time trying to make the threads-in-previous-epoch counter atomic as it was begging to do so (subtract 1 atomically, if it's zero, change the epoch). I'd have three separate atomic vars - epoch, thread count, threads-in-previous. After some work I realized this will never work because I cannot update two or three of them at the same time and maintain consistency. So I packed all three into a single atomic QSBR state variable. To squeeze three 64-bit words into one, I shrank the epoch number to a two-bit wrapping-around counter – it even could have been a one bit counter but some debugging asserts need two bits – and limited the maximum thread count to 2^29-1. Half a billion threads should be enough for everybody! 25% slowdowns to 45% speedups, mostly on the low speedup side.
  6. The above commit did not shrink all the QSBR lock critical sections as much as possible, so cleaned it up next. Perf impact was from a 14% slowdown in the single-threaded case to 30% speedups, but mostly unchanged.
  7. At this point the only shared data structures were the deferred deallocation request collections. I started looking into making them shared but lock-free data structures, but then realized it is likely to be better not to share them in the first place by keeping, maintaining, and finally executing them per-thread. However, this is not enough for threads quitting while still having requests. Freeing them at that point would be premature, so shared lock-free lists of orphan request collections were still necessary for correctness. With this change the big QSBR lock was finally removed, with -25% performance regression for the single threaded case to 80% speedups under high contention. 

By removing the lock I was able to beat the performance numbers into showing something resembling linear-ish scalability. This is most visible for read-only workloads where there is no contention for the OLC ART tree nodes themselves and no deallocation requests, just the QSBR state updates. Then the delete workloads - where the previous/current interval request collections are exercised - benefited too. The impact is least visible with the insert workloads where QSBR does not do much except maintain its state, just like with gets, but there is OLC ART node contention too. It was also nice to compare the numbers with old implementation: there is a 0%-5% regression for the single-thread case and progressively larger speedups for each concurrency increase.

Somewhere in the middle of the work I finally found what looked to be a production-strength QSBR library, but I did not look at its implementation. It will be interesting to compare design decisions. Debugging this work was quite hard, I was glad I was adding asserts everywhere and that I had an already-existing non-concurrent correctness testcase.

A side trip: statistics. I had added some stats for QSBR, such as mean times a thread goes through a quiescent state before the epoch advances, mean deallocation request counts and bytes, etc. These were also maintained under the big QSBR lock. I tried to get rid of it by implementing lock-free stats accumulators, and while the implementation worked, it did not perform nor scale: 0% to 25% regression across all concurrency levels, with a single case of speedup. I guess this is due to combination of several reasons:

  • As a bare minimum, a lock-free stats accumulator needs to maintain mean and count atomically together. I could not shrink both to a single 64-bit number, so a 128-bit double-width CAS operation (LOCK CMPXCHG16B on x86_64) was needed.
  • Two more CASes are needed if variance and maximum need to be calculated too.
  • So instead of a contended mutex we have four contended 64-bit atomic words, two of them in a 128-bit pair.

So, no lock-free stats. I still did some work to reduce stats bottleneck:

  1. Replace the stats rwlock with a regular mutex and a set of atomic stat value variables, so that rwlock does not have to be taken to read the stats. 1%-17% speedups at low concurrency, some slowdowns too.
  2. Do not query per-interval deallocation request collections for their sizes, but maintain separate atomic size vars. Up to 80% speedups under high concurrency, with some slowdowns too.
  3. Split the stats lock into two by the usage pattern. Some slowdowns in get benchmarks, minor speedups elsewhere.
  4. Move those split locks into separate cache lines. Fixed the above get benchmark regression.
  5. Increase the false-sharing-avoidance padding to 128 bits after reading in Folly sources that Intel CPUs fetch cache lines in pairs. Some more improvement on the read-only benchmarks.

To see how bad the overall stat overhead is, I tested a branch that removes all stats, which appears to result in a 11% slowdown (high-concurrency gets) to 32% speedup, with maybe ~10% speedup on average. Clearly, the idea of global frequently-updated stats needs re-thinking. Maybe I'll do per-thread stats that aggregate on threads quitting, since the current microbenchmarks only need them at the end. Maybe something else? I don't know yet.

No comments: