Monday, November 16, 2020

Optimizing Adaptive Radix Tree Implementation for CPU

For a long time in my career I wanted to learn how to better optimize for the CPU. I was lucky enough to work in places where performance matters, however it was almost exclusively I/O performance, scalability, big-O, maybe an occasional memory access pattern issue-not quite the CPU itself.

So I took my Adaptive Radix Tree implementation, and wrote a bunch of isolating microbenchmarks for all the interesting code paths. Then I tried to see how can I make things faster, on a bare metal Sandy Bridge server (AVX instruction set, state of the art, I know). Some random things I learned (or re-confirmed for myself) in the process:

  • ART, as a node-jumping tree structure, is cache miss-bound as sizes increase. Yet it is possible to do CPU work with L3/L2/L1D-fitting tree sizes.
  • An "obviously faster" code change is not such frequently enough so that not re-measuring will bite you. The more "obvious", the more discipline is required to rerun the numbers. This point is beaten to death and yet never goes away.
  • I couldn't really figure out the restrict keyword. I understand the theory behind it, yet when I put it in the code, surprises happen. Most of the time it perturbs code generation in a random insignificant direction. Including the cases when it should have been a no-op. It is not helping that it is absent from the C++ language, existing as a compiler extension only.
  • Compiler generates different code for, say, unsigned vs uint8_t local variables whose all allowed values fit into uint8_t. It should be able to pick the optimal width itself, but it does not appear to do that.
  • Well-predicted branching code appears to be faster than branchless code, which appears to be faster than poorly-predicted branching code. It might be easier to tell if some code is poorly-predicted than if some code is well-predicted.
  • Some branchless code will trade conditional jumps for conditional data dependency chains (e.g. Intel CMOVcc), which is still an improvement, but less so.
  • Looping is faster than recursing
  • Sprinkling SIMD intrinsics in the code for not-quite-vector data is a thing, it is also a thing that the compiler is not likely to do for you. The smallest data amount which I saw to be processed faster in SIMD was four bytes (Node4 search in ART - exact same algorithm that the paper proposed for Node16 search, just narrower). It helps to view SIMD instructions not as something exclusively reserved for long vector loops, but as specific tools in the general purpose instruction set.
  • Conversely, putting a bunch of bytes in a general-purpose register and treating them as a vector of independent bytes is also a thing, it even has a name: SWAR (SIMD within a register).


Wednesday, October 28, 2020

My dotfiles

My dotfiles

I started using Emacs in 2001, and have been using it for two decades continuously, circumstances permitting (in 00s-early 10s Emacs support for Java and C++ was terrible). I have kept its configuration under source control since 2009.

All that time my Emacs knowledge was quite superficial, and the configuration was mostly copy-n-paste from various online bits. Two years ago I decided that I cannot afford not to know my main editor properly, and read its manual and Emacs Lisp reference cover to cover. I did not become an expert, but at least I got a map for the territory. This enabled me to make notes of and fix numerous long-standing annoyances instead of accepting them.

At the same time I started adding various work-related (MySQL back then, Aerospike now) utilities. Still later I started adding new system configuration notes for macOS and Linux, and untangled the whole mess with the help of GNU Stow.

Now I have added a README.md and so my dotfiles are in good enough shape to be public!

Sunday, May 17, 2020

The various forms of __builtin_prefetch on x86_64

Recently I learned that the GCC compiler intrinsic to prefetch data into CPU cache,  __builtin_prefetch, can actually take two optional extra arguments: rw and locality. This is slightly embarrassing for me, because as far as I can tell, the intrinsic had three-arg form from day one, starting with GCC 3.1, released in 2002.

The rw argument can take 0 or 1 value, hinting whether the anticipated memory access is going to be read (0, the default) or write (1). locality can be 0, 1, 2 or 3, with 0 hinting that the memory will be accessed just once, 3 (the default) asking to keep it cached as much as possible, 1 and 2 being in the middle.

The next thing I didn't know was what the different argument values compile to on x86_64. With GCC 10 on Godbolt I found:
__builtin_prefetch(ptr);        // prefetcht0
__builtin_prefetch(ptr, 1);     // prefetchw
__builtin_prefetch(ptr, 0);     // prefetcht0
__builtin_prefetch(ptr, 1, 0);  // prefetchw
__builtin_prefetch(ptr, 1, 1);  // prefetchw
__builtin_prefetch(ptr, 1, 2);  // prefetchw
__builtin_prefetch(ptr, 1, 3);  // prefetchw
__builtin_prefetch(ptr, 0, 0);  // prefetchnta
__builtin_prefetch(ptr, 0, 1);  // prefetcht2
__builtin_prefetch(ptr, 0, 2);  // prefetcht1
__builtin_prefetch(ptr, 0, 3);  // prefetcht0

Next let's check Intel docs on these instructions:
PREFETCHT0: prefetch to all cache levels.
PREFETCHT1: prefetch to L1 and L2. Of course if a particular implementation has an inclusive L3 cache, it will end up there too.
PREFETCHT2: Intel's Software Developer's Manual states: prefetch to L1/L2/L3, "or an implementation-specific choice." Intel's Optimization Reference Manual states that this "identical to PREFETCHT1," that is, prefetch into L1, L2, but not necessarily L3, depending on its inclusiveness.
PREFETCHNTA: the SDM tries to abstract from actual hardware a bit by describing this as "prefetch into non-temporal cache close to the CPU, try not to pollute the regular caches." The ORM explains what it means in practice: on non-Xeon, prefetch to L1, bypassing L2, and on Xeon, prefetch to "L3 with fast replacement" - I have no idea what "fast replacement" means here.
PREFETCHW: according to the SDM, prefetch into L1 or L2 and invalidate other instances of this cache line. And according to the ORM, prefetch into all levels and invalidate the same. The invalidation is of course why this is a write-specific prefetch instruction, saving an invalidation later at the actual write time.

Some common properties of all these instructions are as follows. First, prefetch stride. I always assumed it was 64, that is, L1 cache line size. Docs added a few wrinkles to that, hopefully all historical: 1) the minimum is 32 bytes; 2) NetBurst Pentium 4 used to prefetch two cache lines; 3) It should be interrogated by CPUID. In practice everyone seem to agree it's 64 nowadays.

Instruction scheduling. The Intel docs seem to be last updated for Pentium 4 (twenty years ago). For that, they state "insert a PREFETCH instruction every 20 to 25 cycles." Finding no info at the official sources, I consulted Ander Fog's instruction tables, where they are listed with 0.5-1 clock cycle per instruction per thread reciprocal throughput. That is, except for Ivy Bridge, where it's 43 for some reason. Anyway, it does not seem to be particularly harmful to issue several prefetches without interleaving with computational instructions these days.