Monday, November 16, 2020

Optimizing Adaptive Radix Tree Implementation for CPU

For a long time in my career I wanted to learn how to better optimize for the CPU. I was lucky enough to work in places where performance matters, however it was almost exclusively I/O performance, scalability, big-O, maybe an occasional memory access pattern issue-not quite the CPU itself.

So I took my Adaptive Radix Tree implementation, and wrote a bunch of isolating microbenchmarks for all the interesting code paths. Then I tried to see how can I make things faster, on a bare metal Sandy Bridge server (AVX instruction set, state of the art, I know). Some random things I learned (or re-confirmed for myself) in the process:

  • ART, as a node-jumping tree structure, is cache miss-bound as sizes increase. Yet it is possible to do CPU work with L3/L2/L1D-fitting tree sizes.
  • An "obviously faster" code change is not such frequently enough so that not re-measuring will bite you. The more "obvious", the more discipline is required to rerun the numbers. This point is beaten to death and yet never goes away.
  • I couldn't really figure out the restrict keyword. I understand the theory behind it, yet when I put it in the code, surprises happen. Most of the time it perturbs code generation in a random insignificant direction. Including the cases when it should have been a no-op. It is not helping that it is absent from the C++ language, existing as a compiler extension only.
  • Compiler generates different code for, say, unsigned vs uint8_t local variables whose all allowed values fit into uint8_t. It should be able to pick the optimal width itself, but it does not appear to do that.
  • Well-predicted branching code appears to be faster than branchless code, which appears to be faster than poorly-predicted branching code. It might be easier to tell if some code is poorly-predicted than if some code is well-predicted.
  • Some branchless code will trade conditional jumps for conditional data dependency chains (e.g. Intel CMOVcc), which is still an improvement, but less so.
  • Looping is faster than recursing
  • Sprinkling SIMD intrinsics in the code for not-quite-vector data is a thing, it is also a thing that the compiler is not likely to do for you. The smallest data amount which I saw to be processed faster in SIMD was four bytes (Node4 search in ART - exact same algorithm that the paper proposed for Node16 search, just narrower). It helps to view SIMD instructions not as something exclusively reserved for long vector loops, but as specific tools in the general purpose instruction set.
  • Conversely, putting a bunch of bytes in a general-purpose register and treating them as a vector of independent bytes is also a thing, it even has a name: SWAR (SIMD within a register).