Laurynas Biveinis: November 2022

Monday, November 28, 2022

Rust for a C++ Engineer

Collected from reading https://doc.rust-lang.org/book/. The notes are organized by topic rather than by the book order.

Primitive types

Integer types are sized, except for isize and usize, whose width is architecture-specific, and corresponds to C++ std::ssize_t/std::intptr_t and std::size_t/std::uintptr_t respectively.
char != u8 != i8, oh thank god
String literals "foo" in UTF-8. They are slices (see below).
Tuples are built into language, use parentheses syntax, roughly correspond to std::tuple and std::pair. Can also do .i syntax to access the i-th element. Unit tuple ().
Fixed-size arrays are built into language, use bracket syntax, roughly correspond to std::array with mandatory bounds-checking.
Simple types implement Copy trait (i.e. scalars, correspond to value types in Java) and their assignment copies. Tuples copy if all their elements copy, and this may include arbitrary large tuples. Otherwise, assignment moves. A copy is done by an explicit .clone() operation.
String slices have type &str, can be created with &var[x..y] syntax, x and/or y may be omitted if they are zero and length respectively. Slice consists of a pointer and length.
General slices are very similar. They have &[element_type] type.
! is an empty type with no values. Used as return type for functions that never return. It can be coerced into any other type, which is used for i.e. match arms which e.g. continue or panic!.

Variables & type inference

Variables (let) are immutable by default, but their names can be shadowed, allowing sequences of changes in what a given variable name means. The type can be changed too. Mutable variables can be declared with let mut.
Even immutable variables may be declared without initialization.
Variable types are inferred, but can be annotated when/if needed.
Constants are declared with const, and require type annotations.
Function parameter and return types always must be provided.

Lifetimes, references & borrow checker

All variables are owned by their enclosing scope. The ownership may be passed around, when the last owner scope exits, the variable is destroyed (drop is executed if the Drop trait is implemented, memory released).
Passing variable into a function moves it (or copies), transferring its ownership into the function. Likewise returning it moves it (or copies), transferring the return value ownership to the caller.
Creating references is called borrowing. Immutable references do not allow modifying the pointed-to variable and are borrowed with &, can be passed around without transferring ownership. Several of them may be active at the same time. Mutable references are borrowed with &mut, and no other mutable or immutable references may exist at the same time.
Each reference has a lifetime. Most cases are handled by implicit lifetimes. Lifetimes are named by 'a, usually very short names, placed after &.
For function signatures, generic lifetime parameters use angle brackets:

fn foo<'a>(x: &'a str, y: &'a str) -> &'a str

Lifetime annotations in struct definitions limit struct lifetime to that of its fields.
If there are multiple input lifetime parameters, but one of them is &self or &mut self, its lifetime is assigned to all output lifetime parameters.
Deref coercion converts a reference to a Deref-implementing type to a reference to a different type. I.e. &String to &str. Happens automatically on parameter-argument type mismatch: from &T to &U when T: Deref<Target=U>. Internally as many .deref() calls are inserted as needed.
For mutable references, implement DerefMut trait. Two extra deref coercion rules: from &mut T to &mut U when T: DerefMut<Target=U> and from &mut T to &U when T: Deref<Target=U>.
The Drop trait is the closest thing to a C++ destructor, adds a drop method that takes a mutable reference to self.
For structs, field lifetimes are part of the struct type, and should be specified where the struct type is specified.
'static lifetime is the lifetime of the whole program.
Lifetimes are a type of generics, so a function with both lifetimes and generic type parameters lists both together in angle brackets.

On paper Rust lifetimes appear to be a genius idea. Besides manual resource management and GC, this is a viable third option that combines the advantages of the two while avoiding their disadvantages, althought only partially so.

Google Chrome developers C++ tried this and did not succeed enough for it to be viable: Borrowing Trouble: The Difficulties of a C++ Borrow-Checker.

Statements & expressions

Last nonterminal symbol in a block may be expression (lacking the final semicolon), in which case the whole block is an expression with this return value. This makes return expression; in functions replaceable with expression. This also merges if statement and ternary operator to a single if expression.
loop starts an infinite loop. break may return an expression, making the loop an expression. Nested loops may have labels 'label: loop {, then possible to do break 'label;.
for is a range loop. When possible, for loops seem to be more idiomatic than while loops.

Type aliases, structs & enums

Rust type alias type Foo = ExistingType is like C++ using Foo = ExistingType. Can be generic: type Foo<T> = std::result::Result<T, std::io::Error>.
Structures use struct keyword, contain only type-annotated fields.
If during a struct variable construction a field and a var it is initialized from have the same name, one of the can be omitted (field init shorthand).
..var in the struct variable construction takes all the unspecified fields from var of the same struct type.
Tuple structs struct foo(i64, i64, i64) are structs that are very similar to tuples. Fields are unnamed.
Unit structs struct Foo;
If structs need to store references, then lifetimes have to be used, that's for later.
Attribute #[derive(Debug)] for struct allows doing {:?} in println! to dump the fields.
dbg!(value) macro maybe inserted as an expression to dump value
struct may have methods attached to them, in separate impl StructName blocks.
The first arg of a method may be one of
- &self, corresponds to a C++ const method;
- &mut self, corresponds to a regular C++ method;
- self, consumes the object;
A function in an impl StructName not taking a self is an associated function, not a method, corresponding a C++ static method. Called through StructName:: syntax.
There may be muliple impl blocks.
The simplest Rust enum roughly matches C++ enum class. But then each enum variant may have different associated data with it, making it similar to std::variant
Standard library Option enum handles the use cases for nullptr (a null reference does not exist in Rust). Similar to C++ std::optional.

Closures & function pointers

Closures: || with parameters inside followed by an (optionally bracketed) expression. Parameter and return types are not annotated usually.
Once closure types are inferred, they don't change, cannot call the same closure with different ones.
Variables are captured by different types of borrowing / ownership taking implicitly depending on what the code does. move keyword before || forces taking ownership, when the body does not need it implicitly. One use case is passing data to a new thread.
All closures implement FnOnce trait, meaning they can be called once.
Closures that mutate captured values but don't move them out implement FnOnce and FnMut.
Closures that don't mutate captured values and don't move them out implement FnOnce, FnMut, and Fn.
All functions coerce to the fn type, which is the function pointer type. It implements all of FnOnce, FnMut, Fn.
To return a closure, use a trait object, e.g. -> Box<dyn Fn ... >

Generics & traits

Generics (types) and traits (behavior) resemble C++ templates. Traits also resemble interfaces in other languages.
Generic type uses must be constrained by traits–no SFINAE. C++ concepts.
impl Foo<f32> {...} adds implementation for a specific type, similar to C++ template specialization.
Separate traits are implemented for structs in separate blocks: impl Trait for Struct { ... }
Traits need to be brought into scope too, pulling in an implementing type is not sufficient to call trait methods.
Can implement a local trait on an external type or an external trait on an local type, but not an external trait on an external type (so no C++ std::hash specialization for a std:: type). This is to avoid allowing multiple trait implementations for the same type.
Trait methods may have default implementations, which may call other, possibly unimplemented methods in the same trait. The default implementation may not be called from an overriding implementation.
Trait-type parameters without generics syntax: fn foo(bar: &impl Trait).
Trait bounds, using generics syntax: fn foo<T: Trait>(bar: &T). Same as above.
Multiple trait bounds: fn foo(bar: &(impl Trait1 + Trait2)) and fn foo<T: Trait1 + Trait2>(bar: &T)
In the case trait bounds become long, where clauses pull them aside:

fn foo<T, U>(t: &T, u: &U) -> Result
where
    T: Trait1 + Trait2,
    U: Trait1 + Trait3,
{
    ...
}

Can use fn ... -> impl Trait to return a trait-implementing type, as long as it's a single type.
Can conditionally implement methods for generic structs by adding trait bounds to their implementation: impl<T: Trait> Type<T> { ... }. These are called blanket implementations.
Rust does not have OOP inheritance. Some form is available through default trait method implementations. Dynamic dispatch (C++ virtual methods) is through trait objects.
A trait object is pointer to an instance of a type and a pointer to a vtable.
Struct and enum vars in Rust are not objects, trait objects come close, but they cannot contain data.
Must be a reference (or a smart pointer) to dyn trait type, i.e. Box<dyn Trait>.
Associated types. type Name allows to use Name as a type in a trait before its declaration is given by the trait implementors. In C++ one would use template argument dependent typenames.
Default generic type parameters: <T=DefaultType>.
foo.bar() can be replaced by Type::bar(&foo) when bar is implemented by more than trait to disambiguate. If even more disambiguation is needed, i.e. for associated methods without self parameter, <Type as Trait>::bar calls the method from Trait as implemented for Type.
If a trait depends on another trait, the latter is called a supertrait: trait Foo: SuperTraitBar.
Newtype pattern. One use case: implement external traits on external types, declare a new thin wrapper tuple struct. There are other use cases.

Error handling

panic! macro exits (or aborts, depending on config) on unrecoverable error.
Errors are handled using Result enum, which can be Ok or Err.
unwrap returns the success variant of Result or panics.
unwrap_or_else executes given code instead of panicking.
expect is like unwrap with a given error message for panicking.
? operator after a call, e.g. let foo = bar()?; unwraps returned Result, or returns from the caller with the error. If the error types do not match, From trait converts.
? operator works with Option return types too.

Iterators

Rust iterators correspond to C++ ranges (or iterator pairs).
Calling .iter() on a collection roughly corresponds to a C++ .cbegin(), except that the latter is not a range. Other options are .into_iter() to take ownership of values–not sure what a direct C++ mapping would be–and .iter_mut() over mutable references (C++ .begin()).
Iterators implement the Iterator trait.
Iterator .collect() method returns a new collection from iterating.
Code using iterator adapters might be faster than equivalent loop-based code. An example of Rust zero-cost abstractions, which of course is found in C++ as well.

Pattern matching & related control flow

Pattern matching can decompose a tuple to local vars, and do many other things.
match is a generalized switch with pattern matching, variable binding, and more.
_ is a catch-all non-binding pattern, like the default in switch. Ignores the entire value.
if let behaves like a single match arm, combining if with variable binding in the case of true condition.
while let loop repeats until its pattern matches.
The value after for keyword in a for loop is a pattern.
let keyword takes a pattern, not a variable id.
Function parameters are patterns.
Patterns are refutable and irrefutable, the latter ones matching any possible passed value. Function parameters, let, and for take irrefutable patterns. if let and while let take both kinds, with a compiler warning if irrefutable (as that creates always-true if or an infinite loop while). match arms must be refutable except for the last one, which should be irrefutable (if the possibilities were not exhausted until then).
Multiple patterns can be combined with |.
An inclusive range of values can be matched with ..=. The range cannot be empty.
Struct destructuring: Foo { x: a, y: b } = v gets a and b. If field and var names match, then Foo { x, y } = v. Literals can be used too.
Enum destructuring Foo::Variant { x, y }, Foo::VariantWithNoData.
Can destructure arbitrarily deep nested structs and enums.
Nested _ ignores just that part.
Starting a variable name with an _ suppressed unused variable warnings for it.
.. is a greedy sequence of _, i.e.

let numbers = (1, 5, 7, 20, 30);
match numbers {
    (first, .., last) => ...
}

match arms may have match guards which are extra if conditions that can use the bound vars. Some(x) if x > 5. Exhaustiveness is not checked.
@ bindings allow to create a var holding the tested value at the match time, i.e. Message::Hello { id: id_var @ 3..=7 }.

Standard library types

Dynamic strings: String, would correspond to std::string type, but UTF-8. Display trait adds to_string method. Not indexable to avoid byte/UTF-8 encoding mixup. Slicing is allowed but runtime-checked to fall on char boundary. To disambiguate byte/char interpretation, use .chars() or .bytes().
Standard library vectors match std::vector. A macro vec![1, 2, 3] to create a vector with given contents. Ownership/borrowing rules apply to whole vector, i.e. if a mutable reference to the first element is taken, a new one cannot be pushed to the back.
Standard library hash maps correspond to std::unordered_map.
std::thread::spawn(closure) -> std::thread::thread(callable).

Smart pointers & dynamically-sized (unsized) types

Smart pointers own data. They implement Deref and Drop traits. String and Vec<T> are smart pointers.
Box<T> is like std::unique_ptr<T> in C++, except that Rust is more likely to use plain references and lifetimes, so no 1:1 mapping in i.e. rewrite. Box::new is std::make_unique.
Implementing Deref trait enables dereferencing with the * operator, like overloading C++ * and -> operators does.
Under the hood *x is transformed to *(x.deref()) exactly once.
std::mem::drop corresponds to C++ std::unique_ptr::reset or other early destruction.
Rc<T> matches std::shared_ptr. Rc::clone method matches std::shared_ptr copy constructor.
Interior mutability pattern: unsafe code to mutate data inside an immutable value even with immutable references present.
RefCell<T> does borrow checking at runtime instead of compile time. borrow and borrow_mut methods.
Rc<RefCell<T>> pattern implements multiple owners to potentially-mutable data.
Weak<T> matches std::weak_ptr. Constructed by Rc::downgrade. Upgraded by upgrade method, corresponding to std::weak_ptr::lock.
Dynamically sized types (DST) or unsized types, whose sizes are only known at the runtime. Cannot create variables of such types directly, naturally always hidden in some pointer + size structure. Rust automatically implements Sized trait for every non-DST, and implicitly bounds by it for every generic function. To relax the latter, fn foo<T: ?Sized>.... The ?Trait syntax is only available for Sized trait.

Operator overloading

Operator overloading by implementing the desired traits in std::ops.

Concurrency

thread::spawn returns a JoinHandle, which has a join method, similar to C++ std::thread::join.
Message passing for inter-thread communication, like in Go. Channels, std::sync::mpsc::channel(). The endpoints have send, recv, try_recv methods. The receiver implements Iterator too. The channels may have multiple transmitters (it's MPSC), which can be created by .clone.
Messages must implement Send trait. If a type is composed of Send types only, it becomes Send automatically.
Types whose variables are safe to be referenced from multiple threads implement Sync trait. If &T is Send, then T is Sync. A type made of Sync types only is Sync automatically.
Mutex<T> is a mutex-guarded variable of T. .lock() returns a LockResult, which has a (potentially mutable) reference MutexGuard to the guarded data. The guard unlocks when it goes out of scope. Mutex implements interior mutability.
Arc<T> corresponds to C++ std::atomic<std::shared_ptr>.
To actually share mutexes between threads, wrap them: Arc<Mutex<T>>.

Assorted standard library functionality

std::env::args is for int argc, char *argv[]. It's Unicode-checking, if that hurts then std::env::args_os.
std::process::exit is for exit
std::env::var is for getenv
println! prints to stdout, eprintln! to stderr.

Build and dependency management

cargo seems to be a much better story than CMake hell or its alternatives.
A crate is the smallest compilation unit, either a library crate, or a binary crate. Usually means the former. A crate root is the starting source file in it. A package is a bundle of crates with at most one library crate. Standard paths inside a package: src/main.rs, src/lib.rs, src/bin.
Release profiles correspond to a mixture of CMake build configurations, NDEBUG define, etc. in C++. dev profile corresponds to Debug, and release~to ~Release (or RelWithDebInfo?).
Can customize profiles in Cargo.toml, i.e. optimization levels.
Workspaces organize related packages together in large projects, to share directory root, Cargo.toml, and Cargo.lock.

Modules

Not familiar enough with C++ modules to compare.

Modules (and submodules) inside a crate do namespaceing and public/private. src/modulename.rs, src/modulename/submodulename.rs. Modules can be private or public, declared with pub mod and mod.
super:: as a part of name path goes one level up.
use keyword imports. Idiomatically functions are imported through their parent module, everything else directly.
use ... as ... creates name alias.
pub use re-exports. Used to organize and collect public API from several potentially nested submodules.
Nested path syntax: use foo::{bar, baz, self};, globs use foo::*;

Tooling

rustfmt formats, so does clang-format.
Clippy the linter.
rust-analyzer for LSP support.

Documentation

Documentation header comments start with /// and support Markdown. Built by cargo doc [--open].
Typical API doc sections: Examples, Panics, Errors, Safety.
Contained documentation comments start with //!, typically used for crates and modules.

Testing & benchmarking

#[test] annotates a function to be a test, i.e. Google Test TEST macro in C++. Tests run in parallel by default.
assert_eq! is like gtest ASSERT_EQ, except that the args are 'left' and 'right' instead of 'expected' and 'actual' or similar. Equality asserts may be applied on types implementing PartialEq and Debug traits.
assert! may take 2nd and subsequent args for a message in the case of failure.
Tests annotated with #[should_panic] test that the annotated function panics, similar but not identical to gtest death tests. Best to add expected parameter to the attribute to specify the reason for panic.
Tests may also be implemented by returning a Result<T, E>.
Benchmark tests correspond to Google Benchmark, but unstable ATM.
Documentation tests can compile API examples automatically.
Unit tests go with the code they test, mod tests annotated with #[cfg(test)]
Visibility rules happen to allow the testing of private functions.
Integration tests go to a top-level tests directory, no configuration annotation. Each file there is a separate (test) crate–if that's not what's needed, i.e. for setup code, use foo/mod.rs naming convention for non-tests.
cargo test runs in sequence: unit, integration, doc, does not go to the next category if failure.
Binary crates cannot have integration tests directly. The usual thing to do is to always have a library crate with a binary crate as minimal as possible.

Macros

There are macros, names trailing with exclamation mark (println!).
Macros can take Rust code and expand to a different Rust code. A difference from C++ preprocessor that it works on the AST, not textually. While powerful, how well does this work with tooling? Do they run macros? Can they refactor macros?
Declarative macros (macro_rules!) pattern-match given code to produce code.
#[macro_export] annotation for public macros.
Procedural macros take token stream input and produce token stream output.
One kind is custom derive macros that add code to a struct implementation.
Attribute-like macros allow creating new attributes.
Function-like macros are close to C preprocessor function-like macros, except that they also operate on TokenStream and not on arguments directly. Can take variable number of arguments.

Unsafe Rust & FFI

unsafe { ... }: allows some, well, unsafe features
unsafe can dereference raw pointers *const T, *mut T. Raw pointers are just like C raw pointers.
unsafe fn foo() {}, then fn can be called from unsafe code.
extern "C" { fn putenv ... } for FFI, may only be called from unsafe code.
To make Rust function callable from external code, add #[no_mangle] annotation and pub extern "C" before the fn.
Static variables may be declared with static FOO_BAR: type = value; Immutable static vars have an address in memory; constants don't. All mutable static vars are unsafe.
unsafe trait Foo, unsafe impl Foo for Bar.
union types exist, mainly used for interfacing with C unions, accessing fields is unsafe.
Raw identifier syntax r#while allows using e.g. a keyword for an identifier. Useful for FFI and different Rust edition interfacing.

Wednesday, November 09, 2022

Porting UnoDB to ARM64

ARM is the most common non-Intel instruction set, and UnoDB has enough of Intel-specific code to make its port an interesting project. I started out on AWS Graviton 2 hardware (ARMv8.2+extensions instruction set) and finished on an Apple M1 (ARMv8.5+extensions). The latter became my daily development platform.

I believe any porting effort goes through similar steps: 1) make it build; 2) get it tested; 3) make it correct; 4) make it fast. Let's review them.

1) Make it build. While I had been trying to properly isolate Intel code with conditional compilation, some bits slipped through, which was to be expected while it was an Intel-only build. Some Intel code was missing preprocessor conditional compilation guards. Node16 search did not have a platform-independent fallback implementation. Both were easy to fix. Then, only three ARM-specific bits were required: cache line size constants, and the spinlock spin loop body. For the latter I went with the YIELD instruction. Optimistic lock spinlock implementation is probably the most underdeveloped feature of UnoDB anyway (it is a single PAUSE instruction on Intel), I didn't sweat it too much. The last ARM-specific bit was the platform-specific static_asserts to confirm the internal node sizes, which is entirely optional too.

2) Get it tested. I needed a free public CI/CD service. Internet said there are two options available: Travis-CI, and CircleCI. GitHub Actions, the one I was using already, supports ARM, but only if you provide your own runner VMs, so, nope. Now Travis-CI was something I used before, and then stopped, together with the rest of the OSS world. CircleCI was OK, thus I set up a simple job there at first, and added different compilers, tests, and sanitizers later.

3) Make it correct. Well, the tests passed on the first run attempt. All of them. With sanitizers. Under Valgrind. This includes the parallel tests for relaxed-atomics-heavy QSBR and Optimistic Lock Coupling ART. On ARM having a weaker memory model than Intel. I still haven't seen a crash since. Either I am lucky, or all that consistent testing over time on Intel using sanitizers, including ThreadSanitizer, pays off.

4) Make it fast. In this case this means porting the code that uses Intel vectorization intrinsics. There are libraries to write vectorized code at a slightly higher abstraction level (sse2neon, simde, and others), but I wanted to learn the actual architecture. That actual architecture has several vectorization instruction set extensions: NEON, SVE, SVE2. NEON very roughly corresponds to, say, SSE4, provides 128-bit vectors, and is the simplest one to use. Now SVE (and SVE2) is something else altogether. They provide means to write vector width-independent code, that is, the same code would run unmodified on a CPU with 128-bit vectors and on a CPU with 512-bit vectors. Naturally this comes with an overhead to query the runtime vector width and handling the data sizes not fitting evenly into vectors. This appears to be best suited for processing large amounts of data, which UnoDB internal nodes aren't. Thus I went with simpler NEON.

All the UnoDB vectorized code loop bodies follow the same pattern:

Load the next part of data
Compare it in some way against something, getting a result vector
Take the mask of the result vector and process it as needed.

That "take the mask of a vector" part is handled by PMOVMSKB/VPMOVSKB instructions (_mm_movemask_epi8 & _mm256_movemask_epi8 intrinsics), and so it happens that NEON does not have a direct replacement. I tried some emulating implementations from sse2neon and simde, getting slower than baseline results every time. Nevertheless, I managed to implement a faster Node4 search in NEON by observing that the useful part of the result vector is so small it can be copied to a general purpose register directly instead of movemask'ing it. This resulted in up to 14% higher throughput in the related microbenchmarks over the SWAR-in-a-general purpose register-optimized baseline.

At this point I had thought I was done because I couldn't overcome slow movemask fallback implementations for the rest of the code. Then, someone on Twitter (dead link–the account has been deleted since, I believe it's him) posted a new movemask replacement based on SHRN (shift right and narrow). This operation can be considered as a "halfway-movemask" which does not get down to a single bit per vector element, but it does not have to. Once we get something that fits in a GP register (or a pair of them, in the initial Node16 search implementation), we can work with that.

With this, a straightforward Node16 NEON search resulted in up to 50% higher throughput (and in up to 8% regression in the case of minimal-sized Node16, I took that trade-off). Node48 insert position search became up to 8% faster in the single-load-per-iteration implementation, and then I unrolled that loop to load four registers (8 elements per iteration). Unfortunately I misplaced the benchmark results of that, I recall it being something up to 10% faster on the top of baseline NEON.

Interestingly this code is unrolled exactly the same in all three (SSE4, AVX2, NEON) vector implementations to process four vector registers per iteration, corresponding to handling eight pointers per iteration for SSE4 & NEON, and 16 pointers for AVX2.

So, ARM64 is now a first-class platform for UnoDB, which is also convenient for me due to switching to Apple M1 as my main machine.

Friday, November 04, 2022

ART: how much faster AVX2 is than SSE4?

The original Adaptive Radix Tree paper, with vectorization as one of the main selling points, only requires SSE2 intrinsics, introduced with the first Pentium 4 in year 2000. And I developed UnoDB on an AVX-supporting Intel CPU, Sandy Bridge, 2nd Core generation, introduced in 2011. It might be considered a bit long in the tooth now, but still a bit newer than SSE2, and I have used SSE4.1 intrinsics too.

Then I upgraded the Sandy Bridge machine to Kaby Lake (7th generation, 2016), and started looking into an AVX2 port.

The vectorized ART algorithms were:

Node4 search and insert position search (both regular and while expanding N4 to N16), all data fitting into a single XMM register
Node16 search and insert position search, the original ART paper algorithm, again all data fitting into a single XMM register
Node48 insert free child slot search, which is a first nullptr search in an array of 48 pointers

For the first two, I couldn't think of any ways to improve them. The data is already in a single XMM register. Maybe the compiler uses AVX2 VPBROADCAST* to do loads, stores, and, well, broadcasts, I am not sure, it's up to it.

The last one, however, is a textbook vectorization case, even if the amount of data is relatively small. The SSE4 implementation loads four XMM registers to process eight pointers per loop iteration. The array length being 48 easily permits unrolling the loop once more, to handle 16 pointers per iteration, but that implementation was slower for me.

For AVX2 I started with the simplest implementation and then tried to improve it step by step, using synthetic microbenchmarks for the N48 insert.

The simplest implementation is to load and process a single YMM register (four pointers) per loop iteration. 2% to 5% speedup for N48 insert over the baseline which handled twice as much per iteration
Then, unroll loop once for eight pointers per iteration. 1% to 8% speedup over the previous step
Then, I experimented with prefetching the next loop iteration, but that was slower, I guess it's a trivial case for the hardware prefetcher.
Then, unroll loop twice for 16 pointers per iteration. 2% to 10% speedup over the loop unrolled once.

The three steps together amount to 2% to 40% speedups (and there's a 2% regression in a single hopefully rare case) over SSE4, and that's the benefit of AVX2 I was able to get.

If that loop were unrolled once more, it would disappear: the first "iteration" would handle 32 pointers, and then separate code would optionally handle the remaining 16. I did not go down this road.

The next project would be to add AVX-512 support, but I don't have the hardware nor I am really motivated to get my hands on it. A fun fact: did you know that "512" in AVX-512 refers to the total number of different instruction sets all calling themselves AVX-512? But then again, the base AVX-512F set should be enough for me.

(edit: replaced GitHub line links with line range links)