Tuesday, October 25, 2022

UnoDB and exception safety

Nobody uses exceptions in C++, which makes C++ an exceptional (sorry) language. Everybody uses exceptions in Java, for example. I cannot think of any other language whose practitioners ignore the main prescribed error handling method so thoroughly, to the point of standardization work now adding alternate (not replacement! merely alternate) error handling mechanisms, which then don't cover all use cases, resulting in a fine mess IMHO.

I find the main accepted C++ way of handling errors by checking every return value at every call site for every possible error incredibly verbose, ugly, error prone, and I actually like C++ exceptions. Maybe that's because I did not have to use them for embedded development, did not care about error messages having full stacktraces, nesting them, their performance, and dealing with catching "..." in a useful way. In other words, maybe I find them to be OK because I never actually had to use them in production.

Since I am developing UnoDB as a library which could be used in an exception-using or exception ignoring codebase, I have to make it exception-safe. Now ever since reading Effective C++ series, I got the vague idea of being exception-safe as

  • noexcept all the things that can be noexcept'ed
  • RAII all the things
  • for any non-trivial state, build it in a local variable and std::swap into place
  • never throw from a destructor (I know it's technically not true, you just have to compare std::uncaught_exceptions values in the constructor and the destructor, if equal, fail away!)
  • ???
  • Profit!

See, the levels of exception safety guarantees (nothrow, basic, strong) do not even really enter the picture. Just try to do things in a safe way, and the result will be safe-ish.

And then this "safe-ish" code will break in subtle or not-subtle ways on the first actual exception thrown. I was ignoring this until Justinas casually mentioned while discussing something else: "to test exceptions, run a test in a loop with injecting std::bad_alloc on the 1st, 2nd, ... allocation, until the test succeeds." I noted this and some time later set off to work:

  • Created an allocation_failure_injector class that maintains an allocation counter and throws std::bad_alloc on reaching a preset value, call it from the global operator new 
  • Easy tests first: for non-memory allocating operations, wrap their tests in a must_not_allocate helper method that sets the failure injector to fail on the very first allocation, observe that it never happens
  • Have a helper to do Justinas' loop
  • In the state-changing potentially-failing test harness operations (such as ART test insert), catch any exception and assert that nothing in the observable state changed (i.e. node counter values). Look, we are going for the strong exception safety guarantee here!
  • Make fuzzer tests inject and handle OOM too.
Both ART and QSBR were treated this way, and it did not take long for the "safe-ish" code to show serious bugs:
  • QSBR would leak memory, bump counters prematurely, and generally end up in inconsistent states (one, two, three), including a fuzzer-found bug that was non-deterministic due to unpredictable C++ standard library allocation count
  • In the case of growing/shrinking an ART tree node, if the allocation of the new node failed, the old existing node would still get reclaimed. Yikes.

As a result, the code is no longer exception "safe-ish", but actually safe to some actual level with respect to std::bad_alloc. There are other exceptions that might be thrown, but I couldn't think of any reasonable not-immediately-fatal way to handle std::mutex::lock throwing std::system_error.

As I was writing above I noticed that half of the commit messages say "crash safety" instead of "exception safety." At least I haven't found "exception isolation levels" or the rest of ACID in my commit messages yet.