Laurynas Biveinis: 2021

Thursday, December 23, 2021

Lock-free quiescent state-based reclamation

As I mentioned previously, the current QSBR implementation in UnoDB does not scale due to being completely serialized by a big lock. It was still a useful implementation as it allowed developing a non-concurrent testsuite for correctness and to integrate the rest of code, but it had outlived its purpose and it was time to replace it. At the outset I was not able to find a reference implementation to use, so I developed my own.

The starting point implementation was similar to the one given here: maintain three shared data structures:

a map of thread IDs to how many times a quiescent state was observed in the current epoch with a counter of threads still not having seen the current epoch;
a collection of previous interval deallocation requests - while global, this one is used only by the epoch-advancing thread, with no sharing with the rest;
a collection of current interval deallocation requests.

Ignoring the case of single thread execution, all deallocation requests are added to the current interval request collection. Whenever a thread goes through a quiescent state, the thread id map is checked whether it's a first time in this epoch for this thread. If yes, the previous epoch thread counter is decremented and the epoch is advanced, while the previous interval deallocation requests are executed, and the current interval requests are moved to previous. All of this happened under the lock. To slightly complicate things, threads can join and leave at any time.

I developed a lock-free version of the above step-by-step with some dead-ends explored on the way. So what were the steps?

Move the "how many times a quiescent state was observed in the current epoch" counter from the shared map to the thread itself, and reduce the thread ID -> counter map to a thread ID set. Microbenchmarks showed slowdowns across the board, sometimes dramatic ones, but this was done as a prerequisite for next steps.
Get rid of the thread ID set altogether, replaced by a total thread count. This loses the ability to catch unbalanced thread register/unregister calls, which is acceptable. 0%-33% speedups, mostly on the low side.
Read the current epoch number atomically. Previously, the threads checking what is the global epoch still had to take the big lock. This was the first of big impact changes, with speedups up to 99% in high contention tests.
(The committed part of "make-things-atomic-but-separate" dead-end). Make total thread count atomic. Like the current epoch number, while it could be read without taking the lock now, it still had to be updated under lock protection, and later I learned the hard way that this design will not work. 4% slowdowns to 25% speedups.
After the previous step I spent time trying to make the threads-in-previous-epoch counter atomic as it was begging to do so (subtract 1 atomically, if it's zero, change the epoch). I'd have three separate atomic vars - epoch, thread count, threads-in-previous. After some work I realized this will never work because I cannot update two or three of them at the same time and maintain consistency. So I packed all three into a single atomic QSBR state variable. To squeeze three 64-bit words into one, I shrank the epoch number to a two-bit wrapping-around counter – it even could have been a one bit counter but some debugging asserts need two bits – and limited the maximum thread count to 2^29-1. Half a billion threads should be enough for everybody! 25% slowdowns to 45% speedups, mostly on the low speedup side.
The above commit did not shrink all the QSBR lock critical sections as much as possible, so cleaned it up next. Perf impact was from a 14% slowdown in the single-threaded case to 30% speedups, but mostly unchanged.
At this point the only shared data structures were the deferred deallocation request collections. I started looking into making them shared but lock-free data structures, but then realized it is likely to be better not to share them in the first place by keeping, maintaining, and finally executing them per-thread. However, this is not enough for threads quitting while still having requests. Freeing them at that point would be premature, so shared lock-free lists of orphan request collections were still necessary for correctness. With this change the big QSBR lock was finally removed, with -25% performance regression for the single threaded case to 80% speedups under high contention.

By removing the lock I was able to beat the performance numbers into showing something resembling linear-ish scalability. This is most visible for read-only workloads where there is no contention for the OLC ART tree nodes themselves and no deallocation requests, just the QSBR state updates. Then the delete workloads - where the previous/current interval request collections are exercised - benefited too. The impact is least visible with the insert workloads where QSBR does not do much except maintain its state, just like with gets, but there is OLC ART node contention too. It was also nice to compare the numbers with old implementation: there is a 0%-5% regression for the single-thread case and progressively larger speedups for each concurrency increase.

Somewhere in the middle of the work I finally found what looked to be a production-strength QSBR library, but I did not look at its implementation. It will be interesting to compare design decisions. Debugging this work was quite hard, I was glad I was adding asserts everywhere and that I had an already-existing non-concurrent correctness testcase.

A side trip: statistics. I had added some stats for QSBR, such as mean times a thread goes through a quiescent state before the epoch advances, mean deallocation request counts and bytes, etc. These were also maintained under the big QSBR lock. I tried to get rid of it by implementing lock-free stats accumulators, and while the implementation worked, it did not perform nor scale: 0% to 25% regression across all concurrency levels, with a single case of speedup. I guess this is due to combination of several reasons:

As a bare minimum, a lock-free stats accumulator needs to maintain mean and count atomically together. I could not shrink both to a single 64-bit number, so a 128-bit double-width CAS operation (LOCK CMPXCHG16B on x86_64) was needed.
Two more CASes are needed if variance and maximum need to be calculated too.
So instead of a contended mutex we have four contended 64-bit atomic words, two of them in a 128-bit pair.

So, no lock-free stats. I still did some work to reduce stats bottleneck:

Replace the stats rwlock with a regular mutex and a set of atomic stat value variables, so that rwlock does not have to be taken to read the stats. 1%-17% speedups at low concurrency, some slowdowns too.
Do not query per-interval deallocation request collections for their sizes, but maintain separate atomic size vars. Up to 80% speedups under high concurrency, with some slowdowns too.
Split the stats lock into two by the usage pattern. Some slowdowns in get benchmarks, minor speedups elsewhere.
Move those split locks into separate cache lines. Fixed the above get benchmark regression.
Increase the false-sharing-avoidance padding to 128 bits after reading in Folly sources that Intel CPUs fetch cache lines in pairs. Some more improvement on the read-only benchmarks.

To see how bad the overall stat overhead is, I tested a branch that removes all stats, which appears to result in a 11% slowdown (high-concurrency gets) to 32% speedup, with maybe ~10% speedup on average. Clearly, the idea of global frequently-updated stats needs re-thinking. Maybe I'll do per-thread stats that aggregate on threads quitting, since the current microbenchmarks only need them at the end. Maybe something else? I don't know yet.

Wednesday, October 27, 2021

Optimistic lock coupling overhead for single thread Adaptive Radix Tree

UnoDB has two Adaptive Radix Tree implementations: a regular one, which is not safe for concurrent use, and optimistic lock coupling one (OLC ART), which enables concurrency and, hopefully, scalability for reads, by having per-node seqlocks. When I set out to implement them both, one interesting question for me was how much overhead does OLC add in single-thread workloads, in other words - can OLC ART be used instead of ART even if concurrency is not needed?

To try to make the answer more meaningful I did a round of optimizations for the latter, with the biggest one being a rewrite of try_read_lock implementation to do a single comparison in the fast path instead of two. Two comparisons is what the pseudocode in the OLC ART paper does. I have also logged my first missed-optimization compiler bug. The changes did not result in dramatic improvements (the try_read_lock one reduced branch mispredictions by 25%) but every little bit helps.

So the current answer is - OLC makes things slower 2x-3x compared to the regular one, in my implementation. This is a rough mean/average of various benchmarks, which vary from ~10% overhead on random gets to 200%-300%-400% (even 600% once) for everything else. Drilling down into these numbers a bit, it's somewhat easy to explain the relatively good numbers for random gets - memory-bound execution, least changes in the OLC algorithm compared to the baseline one. The update algorithms, especially the delete one, are relatively heavy in locking operations. The delete might lock three nodes at once, with restart logic if any one of the three write locks fail. This, however, does not explain, why scans have as much overhead as updates, sharing the same algorithm as random gets.

Another elephant in the room in this comparison is direct heap allocation/deallocation for the regular case vs. Quiescent State-Based Reclamation (QSBR) for OLC. I see a lot of QSBR in the single-thread profiles and I cannot really optimize it, at this point, due to the implementation needing a full rewrite as it does not scale due to a Big QSBR Lock.

And that is the next fun thing to do. I did not find any lock-free QSBR implementation on the internet, so I guess I'll see about writing one.

Friday, October 01, 2021

C++ linters and static analysis tools I tried for UnoDB

I love C++ static analysis. I also like linters, and so will happily try away and integrate any reasonable tool out there into CI/CD for UnoDB. So, in the last three years I got to play with quite a few of them.

Let's start with the simplest of them all, the linters. While the simplest, they still vary in complexity, from regex check collections (cpplint) to LLVM-infrastructure-based ones (clang-tidy).

cpplint. Good for having unified header guard style, "// namespace" comments at their ends, wrapping lines at 80 chars. It also tries to enforce Google C++ guidelines of 2012 (?) vintage, with less than impressive results: '<mutex> is an unapproved C++11 header'. OTOH, it integrates with Emacs through flycheck-google-cpplint, making it easy to keep the code compliant while it's being written.
include-what-you-use. Cleans up #include directives. A great tool, must-have for C++ until modules take over, but some of the diagnostics suggest to include some internal header for a C++ standard symbol. As a result, it's somewhat labor-intensive to go through results and apply fixes.
clang-tidy. A source-code linter with local code analysis. Enforced the Rule of 0-3-5 for me many times, helped with Almost Always Auto declarations (sorry Justinai), replaced v.size() == 0 with v.empty() etc. Overall, does not catch bugs, but helps to write better C++ idioms. Integrates with Emacs through clangd for instant feedback.

Next up, compiler warnings, GCC & clang. I went through the docs of both and enabled every reasonable non-default warning, but did not use clang's -Weverything, which is apparently not very reasonable. The warnings made me do several non-default-for-me things:

All the small constants have U suffixes if used with unsigned variables. You don't do unsigned y = 1U, x = y << 3; you do y = 1U, x = y << 3U;
GCC function attributes cold, pure, etc. were suggested and applied. But also there are in-line warning suppressions for false positives.
GCC made me write a class template deduction guide once. I have looked into them before and am happy user of them when somebody else writes them, but as for actually writing one, I had thought hell will freeze over first. Yet, here we are.

Honorable omission: MSVC /W4.

Finally, the actual static analysis tools:

Coverity. I couldn't get the damn thing ("Coverity Build Tool") to run, and believe me, I tried. Guess will check again in six months.
cppcheck. (Hey, an actual project still hosted on SourceForge!) This one punches above its weight in that it's implements its own C++ parser, but has better diagnostics than one would expect from that. Sadly, "better" is not always "good enough." I added a tweak here & there in my code, but most of the time I add suppressions. I also managed to crash it once, unfortunately this means I found more bugs in cppcheck than it did in my code. I tried to report this bug, but the procedure is unfriendly for new reporters, to put it mildly. On the bright side, it is being actively developed and it recently got an LLVM-based parser, so I expect more good things out of it in the future.
Compiler static analysis, GCC & clang. GCC static analyzer is very new, the clang one is a bit older but still not very complete, and so at this point it was more of integrating them into the pipeline, suppressing false positives, and waiting for the new versions to come out. Honorable omission: MSVC /ANALYZE.
Sonatype Lift, called muse.dev originally and then acquired. Its developers were extremely helpful, popping up in my project to comment on false positives and "oh yeah we will fix this ASAP" remarks before I could even review my own run results. As for the actual checks, for C++ code the main backend is FBInfer. It produced several non-obvious-at-first diagnostics which were not false positives nonetheless and required thinking and refactoring to address, resulting in better code structure.
Sonarcloud. This one tries to become the JIRA of everything by including test results, coverage results, and converting diagnostics to tasks. For the actual diagnostics, it has an opinion about everything ("never use std::unique_lock, always use std::lock_guard"), all that done in a web portal. I applied a lot of minor fixes to make this one happy - adding missing 'const', removing redundant template args, tightening class access specifiers, etc, etc, etc.

Omissions: PVS-Studio. Their licence for OSS used to be a rather strange one–requiring to add source code comments "This project is checked by PVS-Studio for free!", but I see it has changed since, and so I might try it next.

Tuesday, June 29, 2021

art_map: Adaptive Radix Tree implemented as a std::set/std::map container

TL;DR: https://github.com/justinasvd/art_map, 10x faster searches than std::map, patches welcome!

I was happily hacking away at my ART implementation UnoDB when Justinas V. Daugmaudis noticed it and decided to implement a C++ container-style data structure using it. This thought had crossed my mind previously too, but the project looked definitely not-fun to me due to the level of C++ knowledge required, which I, a finger-tracking reader of "Effective Modern C++" do not possess. Now Justinas, a Boost contributor, knows his C++ at the language lawyer level, in fact I don't know anyone better suited to pull off a proper C++ container implementation.

There are significant differences between C++ container needs and what UnoDB provides with its in-memory-DB-ish interface and whole Optimistic Lock Coupling business, so Justinas could not just wrap my implementation. He ended up taking my source code, removing OLC and leaf node code, and writing some actual proper C++ in its place, with my surviving code being mostly internal node algorithms. He also replaced the node type bytes with tagged pointers, reducing node sizes, something I also plan to do. In other words, UnoDB now has all the attributes of a successful open source project-it was forked! /s

We exchanged ideas in the process, with me implementing several performance-related suggestions of his (I lost one bet), him following my development too, all for the better.

I also provided a fuzzing service for Justinas-by wrapping art_map into UnoDB-compatible interface and plugging it into my fuzzers. The API surface tested this is too narrow for my liking but still getting the important insert/delete/search code paths. And the fuzzers, the same ones I praised in my previous blog post, came up with absolutely nothing, to my shock. That tells something about writing code in a, uhm, mathematical way. Yet, he claims it's alpha and could kill your cat.

Besides collaborating with art_map I also added several OLC-specific optimizations, did major internal cleanups from Justinas code review, and changed the license to Apache 2.0 because someone asked me to.

Wednesday, April 14, 2021

On fuzzing

I am getting convinced that fuzzing is the third best thing after writing tests and using sanitizers a developer can do for their project quality. I'll start with MySQL, but it was my Adaptive Radix Tree implementation that really drove the point home.

I first encountered fuzzing - well maybe not fuzzing as such but random testing - in late 2011, when at Percona developer meeting in Prague Stewart Smith told us about Random Query Generator (RQG), introducing it as "90s Microsoft technology" - referring to a VLDB'98 paper. Stewart then proceeded by picking a random "crash happening intermittently, no clear reproduction steps" Percona Server bug, writing random test grammar for it, and getting a test case - in less than an hour.

As nice as it was, RQG still required maintaining test grammar, which as far as I can tell, happened very rarely, so Percona did not come close to using its full potential.

Then, a few years later, Roel Van de Paar had a realization that instead of bespoke maintenance-hungry grammar we could use a simpler mutating tool which then would need a large corpus to be effective - and Roel had this idea of taking the whole of MySQL MTR test suite as the seed corpus. He named the tool pquery. It was brutal, easily producing, IIRC, 60-80 crashing test cases per hour of fuzzing. Roel and his team duly sorted them, filed the bugs for MySQL and Percona Server, and then proceeded to watch in frustration as release crashers were being made private by Oracle (at least they were getting fixed), and debug build bugs were being ignored both by Oracle and Percona, myself included. This was not good - for effective fuzzing one must fix the initial crashes to let the fuzzer proceed past them. I see that currently Roel is working with pquery over at MariaDB, and I sincerely hope they do better than we did.

Later I wanted to try out fuzzing for my Adaptive Radix Tree. Most fuzzers work with the notion of input data strings rather than sequences of API calls, which suggests not to bother with the fuzzer framework to avoid the need to write the string-to-API mapping code. Instead, do a random sequence of API calls from scratch like John Regehr does, which will catch (some? most of?) bugs, but will not help with reducing test cases nor will be coverage-driven.

Enter DeepState. A data structure fuzzer written in it will look about the same as John Regehr's one, so no hairy string-to-API conversions, and it will have an actual pluggable fuzzing framework (AFL, libfuzzer, etc.) underneath, with the option of a brute force fuzzer too. There is also a tool to reduce interesting test cases. It all sounded like a very good fit for data structure fuzzing and so I used DeepState to write an ART fuzzer, with brute-force fuzzing at first for simplicity.

I had thought ART was in a reasonably good shape. The implementation was small and not too hard to reason about. I had written ART code with test-driven approach, with close-to-100% test line coverage and reasonable branch coverage too. Sanitizers and Valgrind were silent. How bad could it be?

Turned out, two-nasty-crashing-bugs-bad [1], [2], with an extra non-crashing logic bug [3] on the top.

Unfortunately, DeepState was–and still is–not without its share of issues, thus I stopped at that point, and did not bundle DS as a dependency. I did run it on my laptop regularly though.

Recently I implemented Optimistic Lock Coupling ART that uses Quiescent State-Based Reclamator (QSBR) for managing memory and find myself struggling with intermittent QSBR crashes. I try to assert() and sanitize my way out of it, but make no progress. Then I get an idea to fuzz it. Now QSBR is inherently multithreaded and fuzzing strongly dislikes that because it's a major source of non-determinism. But my QSBR implementation is serialized by a big internal lock anyway, which is bad in general, but for fuzzing, there is a only a small number of possible thread switch points, and no two threads run in parallel inside QSBR. Thus a fuzzer can launch many threads, block them all, and wake a selected thread for a selected action one at a time. Excellent, let's code that. It is likely that such fuzzer will stop being useful if/when I remove the QSBR global mutex later, but hopefully it will have produced some test cases by then.

At the same time I try to switch to libfuzzer as the DeepState backend. I chose it over alternatives because it's a part of LLVM, avoiding one more external dependency. DeepState is not without some issues in this area too, however there are strong advantages over brute force fuzzing:

The test corpus can be persisted. In brute force fuzzing it is randomly generated every time anew.
Fuzzing saturation can be inferred when it stops producing new paths through code for a while. With brute force fuzzing there is no insight to that.
Presumably it will go through new paths sooner than later because it's coverage-driven.
It can dump its coverage info, which can then be checked for suspicious omissions to improve the fuzzer.

With all that, and after a couple of iterations to improve the fuzzer, it found what I was looking for: a bug in QSBR, where the second-to-last thread quitting will cause premature deallocations. I had introduced the bug as a mistaken optimization and had been staring at that code for a long time without suspecting a thing.

Lessons learned:

If it's fuzzable, fuzz it, no matter how "bug-free" you think it is. It is not bug-free.
For fuzzing a data structure, I haven't found a better introduction than John Regehr's one. There is also a DeepState-specific one.
Start with the simplest fuzzer to catch the low hanging fruit first and to get the motivation to use more advanced ones.
Fuzzer is another client to the thing being fuzzed. While being developed, unnatural and broken things in API will be caught, leading to improvements there.
A fuzzer works when it actually checks and catches incorrect state. For that, assert the heck out of the internal state in the implementation, and external one in the fuzzer.
If there is, say, an API to dump some internal state–and where it is not practical to cross-check that the output is consistent–dump it anyway to a null sink for its internal asserts and for AddressSanitizer.
Dereference all the valid pointers even if you don't know their contents, again for AddressSanitizer.
If fuzzing a data structure, it's easy to have an oracle: an alternative implementation that mirrors all the operations and catches any differences.

Thursday, March 04, 2021

Optimistic Lock Coupling for Adaptive Radix Tree

So I had so much fun writing the Adaptive Radix Tree and later SIMD'ing-optimizing it, that I decided to implement a concurrent version of it: ART with optimistic lock coupling, from a paper by the original ART authors [1]. That took the fun to the whole new level.

Each tree node has its own lock, of the kind described below. While traversing the tree, the parent lock is taken, then a child lock, then the parent lock is released, then it repeats ("lock coupling/crabbing"). ART is not a B-tree in that any modification will be contained in three nodes at most, which limits how much of the tree will be write-locked in the worst case.

Now it also would be good for reads to scale, and for that a regular RW lock where read locking writes to memory will not be good enough. So OLC ART uses a different synchronization primitive, which the paper calls the Optimistic Lock: it is a mutex with a version counter, implemented in a single machine word. Writes lock the mutex, bumping the counter, at the end they unlock and bump the counter again. Reads check the counter, copy out what they want to read, and check the counter again to see whether their copy is consistent and usable or whether they have to restart. This is just like Linux kernel seqlocks (sequence locks), which the paper fails to mention. There is one extra feature compared to seqlocks: an unlock can mark the node as obsolete, which forces all concurrent readers and waiting writers to restart their algorithms and no longer try to lock this node.

Now those obsolete nodes have to be garbage-collected somehow. For that, I implemented Quiscent-State Based Reclamation (QSBR), in which each thread periodically declares that it holds no live pointers to the shared data structures. Once all threads do that, the epoch advances and the oldest deleted nodes from two epochs ago are reclaimed. Ironically my QSBR implementation uses a big global mutex for itself, killing scalability, so no fun OLC benchmarks unless I get around to rewriting that part.

Optimistic Locks / seqlocks are not that straightforward to express in the C++11 memory model. Luckily for me the hard work has been done in [2], discussing the correctness and the trade-offs of different implementations. A fun thing is that all protected data in the critical sections needed to be declared as relaxed C++11 atomics. But I also keep the original single-threaded ART around and did not want to copy-paste otherwise identical node-level algorithm implementations between the two. This led to the following C++ template gem (well maybe a turd) of overengineering: I wrote two class templates: relaxed_atomic and not_atomic. The former is like std::atomic with a difference that all operations use std::memory_order_relaxed instead of std::memory_order_seq_cst. The latter is like std::atomic except that its implementation is not atomic at all. Then I templatized node classes on this atomic policy, and, one thousand glue code lines later, avoided the copy paste. Profit!

Debugging this whole contraption took me a while, and I cannot say with certainty that I'm done. I know I have advanced a bit into the long tail of possible concurrency bugs, but I have no idea what's out there that I haven't seen yet. I attacked this in force, wrote the stress tests, employed all the sanitizers, ThreadSanitizer included, everything is running in CI (BTW moved from Travis CI to Github Actions like the rest of world), and yet. Usually it is that writing own concurrency primitive is as good an idea as a land war in Asia, but here it caused only a smaller part (thanks to following [2], I guess) of the issues: reading protected data and not acting on it until successful read unlock is for some reason harder than it sounds. But it was QSBR where dragons lived.

Now what? I think I should rewrite QSBR to remove its Big Kernel Lock, then look at the Optimistic Lock spin loop, which currently consists of _mm_pause(), then check if it actually scales? More fun awaiting!

[1]: Viktor Leis et al, The ART of Practical Synchronization

[2]: Hans-J. Boehm, Can Seqlocks Get Along with Programming Language Memory Models?

[3]: Sorry for not figuring out how to make proper footnotes in Blogger