Wednesday, November 09, 2022

Porting UnoDB to ARM64

ARM is the most common non-Intel instruction set, and UnoDB has enough of Intel-specific code to make its port an interesting project. I started out on AWS Graviton 2 hardware (ARMv8.2+extensions instruction set) and finished on an Apple M1 (ARMv8.5+extensions). The latter became my daily development platform.

I believe any porting effort goes through similar steps: 1) make it build; 2) get it tested; 3) make it correct; 4) make it fast. Let's review them.

1) Make it build. While I had been trying to properly isolate Intel code with conditional compilation, some bits slipped through, which was to be expected while it was an Intel-only build. Some Intel code was missing preprocessor conditional compilation guards. Node16 search did not have a platform-independent fallback implementation. Both were easy to fix. Then, only three ARM-specific bits were required: cache line size constants, and the spinlock spin loop body. For the latter I went with the YIELD instruction. Optimistic lock spinlock implementation is probably the most underdeveloped feature of UnoDB anyway (it is a single PAUSE instruction on Intel), I didn't sweat it too much. The last ARM-specific bit was the platform-specific static_asserts to confirm the internal node sizes, which is entirely optional too.

2) Get it tested. I needed a free public CI/CD service. Internet said there are two options available: Travis-CI, and CircleCI. GitHub Actions, the one I was using already, supports ARM, but only if you provide your own runner VMs, so, nope. Now Travis-CI was something I used before, and then stopped, together with the rest of the OSS world. CircleCI was OK, thus I set up a simple job there at first, and added different compilers, tests, and sanitizers later.

3) Make it correct. Well, the tests passed on the first run attempt. All of them. With sanitizers. Under Valgrind. This includes the parallel tests for relaxed-atomics-heavy QSBR and Optimistic Lock Coupling ART. On ARM having a weaker memory model than Intel. I still haven't seen a crash since. Either I am lucky, or all that consistent testing over time on Intel using sanitizers, including ThreadSanitizer, pays off.

4) Make it fast. In this case this means porting the code that uses Intel vectorization intrinsics. There are libraries to write vectorized code at a slightly higher abstraction level (sse2neon, simde, and others), but I wanted to learn the actual architecture. That actual architecture has several vectorization instruction set extensions: NEON, SVE, SVE2. NEON very roughly corresponds to, say, SSE4, provides 128-bit vectors, and is the simplest one to use. Now SVE (and SVE2) is something else altogether. They provide means to write vector width-independent code, that is, the same code would run unmodified on a CPU with 128-bit vectors and on a CPU with 512-bit vectors. Naturally this comes with an overhead to query the runtime vector width and handling the data sizes not fitting evenly into vectors. This appears to be best suited for processing large amounts of data, which UnoDB internal nodes aren't. Thus I went with simpler NEON.

All the UnoDB vectorized code loop bodies follow the same pattern:

  1. Load the next part of data
  2. Compare it in some way against something, getting a result vector
  3. Take the mask of the result vector and process it as needed.

That "take the mask of a vector" part is handled by PMOVMSKB/VPMOVSKB instructions (_mm_movemask_epi8 & _mm256_movemask_epi8 intrinsics), and so it happens that NEON does not have a direct replacement. I tried some emulating implementations from sse2neon and simde, getting slower than baseline results every time. Nevertheless, I managed to implement a faster Node4 search in NEON by observing that the useful part of the result vector is so small it can be copied to a general purpose register directly instead of movemask'ing it. This resulted in up to 14% higher throughput in the related microbenchmarks over the SWAR-in-a-general purpose register-optimized baseline.

At this point I had thought I was done because I couldn't overcome slow movemask fallback implementations for the rest of the code. Then, someone on Twitter (dead link–the account has been deleted since, I believe it's him) posted a new movemask replacement based on SHRN (shift right and narrow). This operation can be considered as a "halfway-movemask" which does not get down to a single bit per vector element, but it does not have to. Once we get something that fits in a GP register (or a pair of them, in the initial Node16 search implementation), we can work with that.

With this, a straightforward Node16 NEON search resulted in up to 50% higher throughput (and in up to 8% regression in the case of minimal-sized Node16, I took that trade-off). Node48 insert position search became up to 8% faster in the single-load-per-iteration implementation, and then I unrolled that loop to load four registers (8 elements per iteration). Unfortunately I misplaced the benchmark results of that, I recall it being something up to 10% faster on the top of baseline NEON.

Interestingly this code is unrolled exactly the same in all three (SSE4, AVX2, NEON) vector implementations to process four vector registers per iteration, corresponding to handling eight pointers per iteration for SSE4 & NEON, and 16 pointers for AVX2.

So, ARM64 is now a first-class platform for UnoDB, which is also convenient for me due to switching to Apple M1 as my main machine.

No comments: