Laurynas Biveinis: ART: how much faster AVX2 is than SSE4?

The original Adaptive Radix Tree paper, with vectorization as one of the main selling points, only requires SSE2 intrinsics, introduced with the first Pentium 4 in year 2000. And I developed UnoDB on an AVX-supporting Intel CPU, Sandy Bridge, 2nd Core generation, introduced in 2011. It might be considered a bit long in the tooth now, but still a bit newer than SSE2, and I have used SSE4.1 intrinsics too.

Then I upgraded the Sandy Bridge machine to Kaby Lake (7th generation, 2016), and started looking into an AVX2 port.

The vectorized ART algorithms were:

Node4 search and insert position search (both regular and while expanding N4 to N16), all data fitting into a single XMM register
Node16 search and insert position search, the original ART paper algorithm, again all data fitting into a single XMM register
Node48 insert free child slot search, which is a first nullptr search in an array of 48 pointers

For the first two, I couldn't think of any ways to improve them. The data is already in a single XMM register. Maybe the compiler uses AVX2 VPBROADCAST* to do loads, stores, and, well, broadcasts, I am not sure, it's up to it.

The last one, however, is a textbook vectorization case, even if the amount of data is relatively small. The SSE4 implementation loads four XMM registers to process eight pointers per loop iteration. The array length being 48 easily permits unrolling the loop once more, to handle 16 pointers per iteration, but that implementation was slower for me.

For AVX2 I started with the simplest implementation and then tried to improve it step by step, using synthetic microbenchmarks for the N48 insert.

The simplest implementation is to load and process a single YMM register (four pointers) per loop iteration. 2% to 5% speedup for N48 insert over the baseline which handled twice as much per iteration
Then, unroll loop once for eight pointers per iteration. 1% to 8% speedup over the previous step
Then, I experimented with prefetching the next loop iteration, but that was slower, I guess it's a trivial case for the hardware prefetcher.
Then, unroll loop twice for 16 pointers per iteration. 2% to 10% speedup over the loop unrolled once.

The three steps together amount to 2% to 40% speedups (and there's a 2% regression in a single hopefully rare case) over SSE4, and that's the benefit of AVX2 I was able to get.

If that loop were unrolled once more, it would disappear: the first "iteration" would handle 32 pointers, and then separate code would optionally handle the remaining 16. I did not go down this road.

The next project would be to add AVX-512 support, but I don't have the hardware nor I am really motivated to get my hands on it. A fun fact: did you know that "512" in AVX-512 refers to the total number of different instruction sets all calling themselves AVX-512? But then again, the base AVX-512F set should be enough for me.

(edit: replaced GitHub line links with line range links)

Laurynas Biveinis

Friday, November 04, 2022

ART: how much faster AVX2 is than SSE4?

No comments: