Wednesday, April 14, 2021

On fuzzing

I am getting convinced that fuzzing is the third best thing after writing tests and using sanitizers a developer can do for their project quality. I'll start with MySQL, but it was my Adaptive Radix Tree implementation that really drove the point home.

I first encountered fuzzing - well maybe not fuzzing as such but random testing - in late 2011, when at Percona developer meeting in Prague Stewart Smith told us about Random Query Generator (RQG), introducing it as "90s Microsoft technology" - referring to a VLDB'98 paper. Stewart then proceeded by picking a random "crash happening intermittently, no clear reproduction steps" Percona Server bug, writing random test grammar for it, and getting a test case - in less than an hour.

As nice as it was, RQG still required maintaining test grammar, which as far as I can tell, happened very rarely, so Percona did not come close to using its full potential.

Then, a few years later, Roel Van de Paar had a realization that instead of bespoke maintenance-hungry grammar we could use a simpler mutating tool which then would need a large corpus to be effective - and Roel had this idea of taking the whole of MySQL MTR test suite as the seed corpus. He named the tool pquery. It was brutal, easily producing, IIRC, 60-80 crashing test cases per hour of fuzzing. Roel and his team duly sorted them, filed the bugs for MySQL and Percona Server, and then proceeded to watch in frustration as release crashers were being made private by Oracle (at least they were getting fixed), and debug build bugs were being ignored both by Oracle and Percona, myself included. This was not good - for effective fuzzing one must fix the initial crashes to let the fuzzer proceed past them. I see that currently Roel is working with pquery over at MariaDB, and I sincerely hope they do better than we did.

Later I wanted to try out fuzzing for my Adaptive Radix Tree. Most fuzzers work with the notion of input data strings rather than sequences of API calls, which suggests not to bother with the fuzzer framework to avoid the need to write the string-to-API mapping code. Instead, do a random sequence of API calls from scratch like John Regehr does, which will catch (some? most of?) bugs, but will not help with reducing test cases nor will be coverage-driven.

Enter DeepState. A data structure fuzzer written in it will look about the same as John Regehr's one, so no hairy string-to-API conversions, and it will have an actual pluggable fuzzing framework (AFL, libfuzzer, etc.) underneath, with the option of a brute force fuzzer too. There is also a tool to reduce interesting test cases. It all sounded like a very good fit for data structure fuzzing and so I used DeepState to write an ART fuzzer, with brute-force fuzzing at first for simplicity.

I had thought ART was in a reasonably good shape. The implementation was small and not too hard to reason about. I had written ART code with test-driven approach, with close-to-100% test line coverage and reasonable branch coverage too. Sanitizers and Valgrind were silent. How bad could it be?

Turned out, two-nasty-crashing-bugs-bad [1], [2], with an extra non-crashing logic bug [3] on the top.

Unfortunately, DeepState was–and still is–not without its share of issues, thus I stopped at that point, and did not bundle DS as a dependency. I did run it on my laptop regularly though.

Recently I implemented Optimistic Lock Coupling ART that uses Quiescent State-Based Reclamator (QSBR) for managing memory and find myself struggling with intermittent QSBR crashes. I try to assert() and sanitize my way out of it, but make no progress. Then I get an idea to fuzz it. Now QSBR is inherently multithreaded and fuzzing strongly dislikes that because it's a major source of non-determinism. But my QSBR implementation is serialized by a big internal lock anyway, which is bad in general, but for fuzzing, there is a only a small number of possible thread switch points, and no two threads run in parallel inside QSBR. Thus a fuzzer can launch many threads, block them all, and wake a selected thread for a selected action one at a time. Excellent, let's code that. It is likely that such fuzzer will stop being useful if/when I remove the QSBR global mutex later, but hopefully it will have produced some test cases by then.

At the same time I try to switch to libfuzzer as the DeepState backend. I chose it over alternatives because it's a part of LLVM, avoiding one more external dependency. DeepState is not without some issues in this area too, however there are strong advantages over brute force fuzzing:

  • The test corpus can be persisted. In brute force fuzzing it is randomly generated every time anew.
  • Fuzzing saturation can be inferred when it stops producing new paths through code for a while. With brute force fuzzing there is no insight to that.
  • Presumably it will go through new paths sooner than later because it's coverage-driven.
  • It can dump its coverage info, which can then be checked for suspicious omissions to improve the fuzzer.

With all that, and after a couple of iterations to improve the fuzzer, it found what I was looking for: a bug in QSBR, where the second-to-last thread quitting will cause premature deallocations. I had introduced the bug as a mistaken optimization and had been staring at that code for a long time without suspecting a thing.

Lessons learned:

  • If it's fuzzable, fuzz it, no matter how "bug-free" you think it is. It is not bug-free.
  • For fuzzing a data structure, I haven't found a better introduction than John Regehr's one. There is also a DeepState-specific one.
  • Start with the simplest fuzzer to catch the low hanging fruit first and to get the motivation to use more advanced ones.
  • Fuzzer is another client to the thing being fuzzed. While being developed, unnatural and broken things in API will be caught, leading to improvements there.
  • A fuzzer works when it actually checks and catches incorrect state. For that, assert the heck out of the internal state in the implementation, and external one in the fuzzer.
  • If there is, say, an API to dump some internal state–and where it is not practical to cross-check that the output is consistent–dump it anyway to a null sink for its internal asserts and for AddressSanitizer.
  • Dereference all the valid pointers even if you don't know their contents, again for AddressSanitizer.
  • If fuzzing a data structure, it's easy to have an oracle: an alternative implementation that mirrors all the operations and catches any differences.