Laurynas Biveinis: 2022

Wednesday, December 21, 2022

The Zoo of MySQL Storage Engine Flags

MySQL has had pluggable storage engine support for decades now. Their capabilities vary wildly. Some of those storage engines are not even storage engine-like, for example BLACKHOLE does not store data at all. For OLTP, this herd thins very quickly if we only consider ACID-supporting ones: InnoDB, MyRocks, NDB, (up until recently) TokuDB, and (up until ten years ago) PBXT. I am not too familiar with OLAP, but I know there's Warp.

The core server layer has accumulated dozens of engine feature flags over those decades to accommodate those varying capabilities, let's take a look at what they can be. To make this post look less like a source code annotation dump, I'll try to sort them by feature area. If I classify something as an "implementation detail", it still could be more than that, there could be e.g. performance implications or something user-visible like forced closing of a table, but I'll be not figuring that out at this time.

Handlerton Flags

To recap, a "handlerton" is a "handler singleton", and a "handler" applies to a single opened table. So a handlerton is a singleton for all the tables in a particular storage engine, the capabilities for it as the whole, although later we'll see that some of the table flags are engine-level too. Perhaps that happened because handlerton concept was introduced later, in 5.0 series AFAIK, and it was all handler earlier.

The list in the source code starts with #define and ends with constexpr, showing its evolution as the coding standards evolve. Although it all should be somehow made into an enum class bitset, I think, with a healthy part of handler flags moved over here too.

Features

HTON_ALTER_NOT_SUPPORTED: cannot do ALTER TABLE, i.e. performance_schema.
HTON_TEMPORARY_NOT_SUPPORTED: cannot do CREATE TEMPORARY TABLE, i.e. performance_schema again.
HTON_SUPPORT_LOG_TABLES: can host mysql.general_log and slow_log tables. CSV and MyISAM!
HTON_NO_PARTITION: tables cannot be partitioned. performance_schema, Warp.
HTON_SUPPORTS_EXTENDED_KEYS: all the secondary keys include the primary key columns in the end. This is a feature for the query optimizer. InnoDB, MyRocks, TokuDB.
HTON_SUPPORTS_FOREIGN_KEYS: self-descriptive. InnoDB, NDB.
HTON_SUPPORTS_ATOMIC_DDL: DDL statements are atomic, nice. InnoDB, NDB.
HTON_SUPPORTS_PACKED_KEYS: MyISAM something something.
HTON_SUPPORTS_SECONDARY_ENGINE, HTON_SECONDARY_ENGINES_DDL: too early to tell, the secondary engine feature is under development, and even then it might not be public, i.e. something for HeatWave.
HTON_SUPPORTS_TABLE_ENCRYPTION: self-descriptive. InnoDB.
HTON_SUPPORTS_ENGINE_ATTRIBUTE: whether CREATE and ALTER can take ENGINE_ATTRIBUTE for engine-specific attributes provided by user.
HTON_SUPPORTS_GENERATED_INVISIBLE_PK: will create a hidden primary key if you don't provide one. InnoDB.

Hidden Storage Engines

HTON_HIDDEN: do not advertise as a storage engine at all, do not allow CREATE TABLE in it etc. Binary log is such storage engine!
HTON_NOT_USER_SELECTABLE: do not allow user access any tables in this engine. Binlog again. Not sure why it cannot be merged with HTON_HIDDEN.

Implementation Details

HTON_CLOSE_CURSORS_AT_COMMIT: last in-tree engine that used it was BDB in 5.0, which I am not familiar with. MariaDB docs (here goes Planet MySQL) say this was BerkeleyDB. But also TokuDB.
HTON_CAN_RECREATE: map TRUNCATE TABLE to DROP & CREATE internally. MyRocks, Warp.
HTON_FLUSH_AFTER_RENAME: introduced somewhere between 5.0 and 5.1, removed in 8.0. An implementation detail for the BDB storage engine.
HTON_NO_BINLOG_ROW_OPT: if using row-based binlog format, do not do anything about before/after row images for this engine, meaning all the columns will be always included. NDB.

Handlerton Flags for Foreign Keys

Foreign key capabilities are finer-grained. Archeologically we are above the #define layer, but below the constexpr layer. These are static const, so they are emitted in every single compilation unit including handler.h, unless the compilers manage to clean it up.

HTON_FKS_WITH_PREFIX_PARENT_KEYS: whether the foreign key can have arbitrary columns with resulting non-uniqueness etc. as long as the unique parent key columns are in its prefix. InnoDB.
HTON_FKS_WITH_SUPPORTING_HASH_KEYS: the foreign key can be a hash key. NDB.
HTON_FKS_NEED_DIFFERENT_PARENT_AND_SUPPORTING_KEY: can't have a self-referring foreign key. InnoDB.
HTON_FKS_WITH_EXTENDED_PARENT_KEYS: if the keys have primary key columns in its suffix, use that. InnoDB.

Optional Handlerton Function Pointers

We are done with handlerton flags, but we are not done with handlerton. Turns out, some of its function pointers are optional, and capabilities are inferred from them being non-null. A lot of them are transaction-related. These capabilities are:

Savepoints

savepoint_set: transaction savepoints. InnoDB, MyRocks, TokuDB.
savepoint_rollback_can_release_mdl: ability to rollback MDL locks taken after a savepoint on rollback there. InnoDB, MyRocks

XA Transactions

commit_by_xid, rollback_by_xid, set_prepared_in_tc, set_prepared_in_tc_by_xid: XA. InnoDB, MyRocks, TokuDB.
recover: can recover XA transactions on startup. InnoDB, MyRocks.
recover_prepared_in_tc: can recover prepared XA transactions on startup. InnoDB.

Other Features

prepare, set_prepared_in_tc: 2PC support. InnoDB, MyRocks.
start_consistent_snapshot: supports START TRANSACTION WITH CONSISTENT SNAPSHOT. InnoDB, MyRocks.
flush_logs: self-descriptive. InnoDB, MyRocks, TokuDB.
show_status: Likewise.
alter_tablespace, get_tablespace_type_by_name: supports tablespaces. InnoDB.
is_supported_system_table: can host at least some of the mysql schema tables. InnoDB.
get_table_statistics, get_index_column_cardinality, get_tablespace_statistics, prepare_secondary_engine, optimize_secondary_engine, compare_secondary_engine_cost: self-descriptive. InnoDB, MyRocks.
clone_interface: can clone. InnoDB. Watch this space for more exciting announcements!
lock_hton_log, unlock_hton_log, collect_hton_log_info: can participate in performance_schema.LOG_STATUS table. InnoDB, MyRocks

Implementation Details

replace_native_transaction_in_thd, ddse_dict_init, dict_...: native data dictionary support. InnoDB.
page_track: can be used with the page tracking service.
drop_database: wants to participate in DROP DATABASE implementation. NDB.
panic: can be shutdown, even if it's statically compiled in.
fill_is_schema: wants to fill in some of INFORMATION_SCHEMA tables by itself.
get_tablespace: unused.
post_ddl, post_recover, check_fk_column_compat, se_before_commit, se_after_commit, se_before_rollback: just let's say it's all implementation details.

Handler Table Flags

These flags are per-table, but before handlerton was a thing, a lot of per-engine flags were added too.

Features

HA_NO_TRANSACTIONS: this is not a real OLTP database storage engine. MyISAM.
HA_CAN_SQL_HANDLER: supports HANDLER interface. InnoDB.
HA_NO_AUTO_INCREMENT: does not support auto increment fields. performance_schema, Warp.
HA_HAS_CHECKSUM: CREATE TABLE supports CHECKSUM option. MyISAM.
HA_CAN_REPAIR: can REPAIR TABLE. MyISAM, Warp.
HA_CAN_EXPORT: supports FLUSH TABLE FOR EXPORT. InnoDB.
HA_SUPPORTS_DEFAULT_EXPRESSION: DEFAULT column clauses may be expressions. InnoDB, MyRocks, MyISAM.
HA_UPDATE_NOT_SUPPORTED: self-descriptive. ARCHIVE.
HA_DELETE_NOT_SUPPORTED: likewise.

Supported Column and Index Types

HA_NO_PREFIX_CHAR_KEYS: does not support indexing CHAR column prefixes. NDB.
HA_NO_VARCHAR: does not support VARCHAR type. No engine sets this.
HA_CAN_FULLTEXT: supports fulltext indexes. InnoDB.
HA_NO_BLOBS: does not support BLOBs.
HA_CAN_INDEX_BLOBS: can index BLOBs. InnoDB, MyRocks, TokuDB.
HA_BLOB_PARTIAL_UPDATE: BLOBs can be updated partially. InnoDB.
HA_CAN_GEOMETRY: can handle spatial data. InnoDB.
HA_CAN_RTREEKEYS: supports R-Tree indexes.
HA_SUPPORTS_GEOGRAPHIC_GEOMETRY_COLUMN: geometry supports can do not only coordinates on a flat surface, but geographic coordinates too. InnoDB.
HA_CAN_BIT_FIELD: supports BIT type. MyISAM.
HA_GENERATED_COLUMNS: supports generated columns. InnoDB, MyRocks, MyISAM.
HA_CAN_INDEX_VIRTUAL_GENERATED_COLUMN: can index them. InnoDB, MyRocks.
HA_ANY_INDEX_MAY_BE_UNIQUE: self-descriptive. MRG_MyISAM.
HA_DESCENDING_INDEX: supports descending indexes. InnoDB.
HA_MULTI_VALUED_KEY_SUPPORT: can index JSON arrays. InnoDB.

Query Optimizer Features

HA_PARTIAL_COLUMN_READ: can return a subset of row columns. InnoDB, MyRocks.
HA_TABLE_SCAN_ON_INDEX: a clustered index always exists and should be used for table scans. InnoDB, TokuDB.
HA_FAST_KEY_READ: random key order is as fast as sequential. Set for in-memory engines, but probably should be set for everything on SSD?
HA_NULL_IN_KEY: can index NULL. InnoDB, MyRocks, TokuDB.
HA_STATS_RECORDS_IS_EXACT: table record count in statistics is exact. InnoDB.
HA_PRIMARY_KEY_IN_READ_INDEX: when doing an index-only read on a secondary key, primary key columns will be returned too. InnoDB, MyRocks, TokuDB.
HA_PRIMARY_KEY_REQUIRED_FOR_POSITION: to position() a handler, primary key is needed. Otherwise, position works as a row counter. InnoDB, MyRocks, TokuDB.
HA_COUNT_ROWS_INSTANT: not sure what the difference from HA_STATS_RECORDS_IS_EXACT is. The name suggests that counting records is instant, and the comment suggests that it's exact. MyISAM.
HA_READ_BEFORE_WRITE_REMOVAL: supports blind writes, i.e. to write a row you don't have to read it first. NDB. But, this has no relation to a similar optimization in MyRocks and TokuDB.
HA_BLOCK_CONST_TABLE: disables const-table optimization. NDB.
HA_CAN_FULLTEXT_HINTS: can provide extra FULLTEXT-related info in EXPLAIN SELECT? InnoDB.

Implementation Details

HA_UNUSED3: used to be HA_REC_NOT_IN_SEQ before 8.0. Implementation detail.
HA_REQUIRES_KEY_COLUMNS_FOR_DELETE: need to read all the key columns before deleting a row by that key. Unused in MySQL, used by MariaDB MyISAM and InnoDB.
HA_DUPLICATE_POS: on duplicate key error, provide the position of the existing key.
HA_AUTO_PART_KEY: multi-part keys can include auto increment columns. TokuDB.
HA_REQUIRE_PRIMARY_KEY: a user-provided primary key must exist. No SE uses this.
HA_UNUSED14: until 5.7 it was HA_CAN_INSERT_DELAYED. MyISAM.
HA_PRIMARY_KEY_REQUIRED_FOR_DELETE: DELETE and UPDATE work by primary key. performance_schema.
HA_FILE_BASED: each table is stored in a separate file. MyISAM, TokuDB.
HA_NO_COPY_ON_ALTER: the name suggests that "ALTER TABLE does not copy data", but this is different from online ALTER. MRG_MyISAM.
HA_HAS_OWN_BINLOGGING: NDB.
HA_BINLOG_ROW_CAPABLE, HA_BINLOG_STMT_CAPABLE: what binary log formats are supported.
HA_DUPLICATE_KEY_NOT_IN_ORDER: multiple key conflicts in a multiple value-REPLACE statements are reported not in the ascending key name order. Unused in MySQL, used by CONNECT storage engine in MariaDB.
HA_CAN_FULLTEXT_EXT: supports "extended fulltext API", i.e. it is InnoDB and not MyISAM.
HA_READ_OUT_OF_SYNC: the storage engine is BLACKHOLE. I cannot make more sense out of the description: "what you read will not be what is expected to be in the table".
HA_NO_READ_LOCAL_LOCK: does not support LOCK TABLE READ LOCAL, and does not want them to be upgraded to LOCK TABLE READ by the server. InnoDB.
HA_ATTACHABLE_TRX_COMPATIBLE: the storage engine is "compatible" with attachable transactions. It is a superset of "supporting" them. InnoDB and MyISAM.
HA_NO_INDEX_ACCESS: the indexes are not really indexes, because they don't support access through them. MOCK (that's a storage engine name, not a suggestion).

Index Flags

These are capabilities for individual indexes.

HA_READ_NEXT: the index can go from one record to the next. Not used, always assumed.
HA_READ_PREV: the index can go to the previous record.
HA_READ_ORDER: the index is ordered. Usually might mean not a hash index.
HA_READ_RANGE: the index supports ranges, again not a hash index, usually.
HA_ONLY_WHOLE_INDEX: the index does not support key prefix search. Hash indexes again.
HA_TABLE_SCAN_ON_NULL: the index does not store NULLs even if the underlying column does, forcing a switch to table scan. NDB.
HA_KEYREAD_ONLY: index-only scans are supported. InnoDB, MyRocks, TokuDB.
HA_KEY_SCAN_NOT_ROR: Not sure. Will update if I figure it out. InnoDB.
HA_DO_INDEX_COND_PUSHDOWN: supports Index Condition Pushdown. InnoDB, MyRocks, TokuDB.

Handler Partitioning Flags

Storage engines can implement partitioning with different levels of capabilities.

HA_CAN_UPDATE_PARTITION_KEY: can do UPDATE involving the partitioning key. NDB.
HA_CAN_PARTITION_UNIQUE: can handle unique indexes across partitions correctly. NDB.
HA_USE_AUTO_PARTITION: not sure what "automatic partitioning" is, but this is something NDB supports.
HA_CAN_EXCHANGE_PARTITION: can ALTER TABLE EXCHANGE PARTITION with a non-partitioned table. InnoDB.
HA_CANNOT_PARTITION_FK: a foreign key cannot be the partition key. MyRocks, TokuDB.
HA_TRUNCATE_PARTITION_PRECLOSE: TRUNCATE PARTITION requires closing the involved tables. InnoDB.

Writing this post resulted in three bug reports asking to remove get_tablespace, HA_REQUIRES_KEY_COLUMNS_FOR_DELETE, and HA_DUPLICATE_KEY_NOT_IN_ORDER.

Monday, November 28, 2022

Rust for a C++ Engineer

Collected from reading https://doc.rust-lang.org/book/. The notes are organized by topic rather than by the book order.

Primitive types

Integer types are sized, except for isize and usize, whose width is architecture-specific, and corresponds to C++ std::ssize_t/std::intptr_t and std::size_t/std::uintptr_t respectively.
char != u8 != i8, oh thank god
String literals "foo" in UTF-8. They are slices (see below).
Tuples are built into language, use parentheses syntax, roughly correspond to std::tuple and std::pair. Can also do .i syntax to access the i-th element. Unit tuple ().
Fixed-size arrays are built into language, use bracket syntax, roughly correspond to std::array with mandatory bounds-checking.
Simple types implement Copy trait (i.e. scalars, correspond to value types in Java) and their assignment copies. Tuples copy if all their elements copy, and this may include arbitrary large tuples. Otherwise, assignment moves. A copy is done by an explicit .clone() operation.
String slices have type &str, can be created with &var[x..y] syntax, x and/or y may be omitted if they are zero and length respectively. Slice consists of a pointer and length.
General slices are very similar. They have &[element_type] type.
! is an empty type with no values. Used as return type for functions that never return. It can be coerced into any other type, which is used for i.e. match arms which e.g. continue or panic!.

Variables & type inference

Variables (let) are immutable by default, but their names can be shadowed, allowing sequences of changes in what a given variable name means. The type can be changed too. Mutable variables can be declared with let mut.
Even immutable variables may be declared without initialization.
Variable types are inferred, but can be annotated when/if needed.
Constants are declared with const, and require type annotations.
Function parameter and return types always must be provided.

Lifetimes, references & borrow checker

All variables are owned by their enclosing scope. The ownership may be passed around, when the last owner scope exits, the variable is destroyed (drop is executed if the Drop trait is implemented, memory released).
Passing variable into a function moves it (or copies), transferring its ownership into the function. Likewise returning it moves it (or copies), transferring the return value ownership to the caller.
Creating references is called borrowing. Immutable references do not allow modifying the pointed-to variable and are borrowed with &, can be passed around without transferring ownership. Several of them may be active at the same time. Mutable references are borrowed with &mut, and no other mutable or immutable references may exist at the same time.
Each reference has a lifetime. Most cases are handled by implicit lifetimes. Lifetimes are named by 'a, usually very short names, placed after &.
For function signatures, generic lifetime parameters use angle brackets:

fn foo<'a>(x: &'a str, y: &'a str) -> &'a str

Lifetime annotations in struct definitions limit struct lifetime to that of its fields.
If there are multiple input lifetime parameters, but one of them is &self or &mut self, its lifetime is assigned to all output lifetime parameters.
Deref coercion converts a reference to a Deref-implementing type to a reference to a different type. I.e. &String to &str. Happens automatically on parameter-argument type mismatch: from &T to &U when T: Deref<Target=U>. Internally as many .deref() calls are inserted as needed.
For mutable references, implement DerefMut trait. Two extra deref coercion rules: from &mut T to &mut U when T: DerefMut<Target=U> and from &mut T to &U when T: Deref<Target=U>.
The Drop trait is the closest thing to a C++ destructor, adds a drop method that takes a mutable reference to self.
For structs, field lifetimes are part of the struct type, and should be specified where the struct type is specified.
'static lifetime is the lifetime of the whole program.
Lifetimes are a type of generics, so a function with both lifetimes and generic type parameters lists both together in angle brackets.

On paper Rust lifetimes appear to be a genius idea. Besides manual resource management and GC, this is a viable third option that combines the advantages of the two while avoiding their disadvantages, althought only partially so.

Google Chrome developers C++ tried this and did not succeed enough for it to be viable: Borrowing Trouble: The Difficulties of a C++ Borrow-Checker.

Statements & expressions

Last nonterminal symbol in a block may be expression (lacking the final semicolon), in which case the whole block is an expression with this return value. This makes return expression; in functions replaceable with expression. This also merges if statement and ternary operator to a single if expression.
loop starts an infinite loop. break may return an expression, making the loop an expression. Nested loops may have labels 'label: loop {, then possible to do break 'label;.
for is a range loop. When possible, for loops seem to be more idiomatic than while loops.

Type aliases, structs & enums

Rust type alias type Foo = ExistingType is like C++ using Foo = ExistingType. Can be generic: type Foo<T> = std::result::Result<T, std::io::Error>.
Structures use struct keyword, contain only type-annotated fields.
If during a struct variable construction a field and a var it is initialized from have the same name, one of the can be omitted (field init shorthand).
..var in the struct variable construction takes all the unspecified fields from var of the same struct type.
Tuple structs struct foo(i64, i64, i64) are structs that are very similar to tuples. Fields are unnamed.
Unit structs struct Foo;
If structs need to store references, then lifetimes have to be used, that's for later.
Attribute #[derive(Debug)] for struct allows doing {:?} in println! to dump the fields.
dbg!(value) macro maybe inserted as an expression to dump value
struct may have methods attached to them, in separate impl StructName blocks.
The first arg of a method may be one of
- &self, corresponds to a C++ const method;
- &mut self, corresponds to a regular C++ method;
- self, consumes the object;
A function in an impl StructName not taking a self is an associated function, not a method, corresponding a C++ static method. Called through StructName:: syntax.
There may be muliple impl blocks.
The simplest Rust enum roughly matches C++ enum class. But then each enum variant may have different associated data with it, making it similar to std::variant
Standard library Option enum handles the use cases for nullptr (a null reference does not exist in Rust). Similar to C++ std::optional.

Closures & function pointers

Closures: || with parameters inside followed by an (optionally bracketed) expression. Parameter and return types are not annotated usually.
Once closure types are inferred, they don't change, cannot call the same closure with different ones.
Variables are captured by different types of borrowing / ownership taking implicitly depending on what the code does. move keyword before || forces taking ownership, when the body does not need it implicitly. One use case is passing data to a new thread.
All closures implement FnOnce trait, meaning they can be called once.
Closures that mutate captured values but don't move them out implement FnOnce and FnMut.
Closures that don't mutate captured values and don't move them out implement FnOnce, FnMut, and Fn.
All functions coerce to the fn type, which is the function pointer type. It implements all of FnOnce, FnMut, Fn.
To return a closure, use a trait object, e.g. -> Box<dyn Fn ... >

Generics & traits

Generics (types) and traits (behavior) resemble C++ templates. Traits also resemble interfaces in other languages.
Generic type uses must be constrained by traits–no SFINAE. C++ concepts.
impl Foo<f32> {...} adds implementation for a specific type, similar to C++ template specialization.
Separate traits are implemented for structs in separate blocks: impl Trait for Struct { ... }
Traits need to be brought into scope too, pulling in an implementing type is not sufficient to call trait methods.
Can implement a local trait on an external type or an external trait on an local type, but not an external trait on an external type (so no C++ std::hash specialization for a std:: type). This is to avoid allowing multiple trait implementations for the same type.
Trait methods may have default implementations, which may call other, possibly unimplemented methods in the same trait. The default implementation may not be called from an overriding implementation.
Trait-type parameters without generics syntax: fn foo(bar: &impl Trait).
Trait bounds, using generics syntax: fn foo<T: Trait>(bar: &T). Same as above.
Multiple trait bounds: fn foo(bar: &(impl Trait1 + Trait2)) and fn foo<T: Trait1 + Trait2>(bar: &T)
In the case trait bounds become long, where clauses pull them aside:

fn foo<T, U>(t: &T, u: &U) -> Result
where
    T: Trait1 + Trait2,
    U: Trait1 + Trait3,
{
    ...
}

Can use fn ... -> impl Trait to return a trait-implementing type, as long as it's a single type.
Can conditionally implement methods for generic structs by adding trait bounds to their implementation: impl<T: Trait> Type<T> { ... }. These are called blanket implementations.
Rust does not have OOP inheritance. Some form is available through default trait method implementations. Dynamic dispatch (C++ virtual methods) is through trait objects.
A trait object is pointer to an instance of a type and a pointer to a vtable.
Struct and enum vars in Rust are not objects, trait objects come close, but they cannot contain data.
Must be a reference (or a smart pointer) to dyn trait type, i.e. Box<dyn Trait>.
Associated types. type Name allows to use Name as a type in a trait before its declaration is given by the trait implementors. In C++ one would use template argument dependent typenames.
Default generic type parameters: <T=DefaultType>.
foo.bar() can be replaced by Type::bar(&foo) when bar is implemented by more than trait to disambiguate. If even more disambiguation is needed, i.e. for associated methods without self parameter, <Type as Trait>::bar calls the method from Trait as implemented for Type.
If a trait depends on another trait, the latter is called a supertrait: trait Foo: SuperTraitBar.
Newtype pattern. One use case: implement external traits on external types, declare a new thin wrapper tuple struct. There are other use cases.

Error handling

panic! macro exits (or aborts, depending on config) on unrecoverable error.
Errors are handled using Result enum, which can be Ok or Err.
unwrap returns the success variant of Result or panics.
unwrap_or_else executes given code instead of panicking.
expect is like unwrap with a given error message for panicking.
? operator after a call, e.g. let foo = bar()?; unwraps returned Result, or returns from the caller with the error. If the error types do not match, From trait converts.
? operator works with Option return types too.

Iterators

Rust iterators correspond to C++ ranges (or iterator pairs).
Calling .iter() on a collection roughly corresponds to a C++ .cbegin(), except that the latter is not a range. Other options are .into_iter() to take ownership of values–not sure what a direct C++ mapping would be–and .iter_mut() over mutable references (C++ .begin()).
Iterators implement the Iterator trait.
Iterator .collect() method returns a new collection from iterating.
Code using iterator adapters might be faster than equivalent loop-based code. An example of Rust zero-cost abstractions, which of course is found in C++ as well.

Pattern matching & related control flow

Pattern matching can decompose a tuple to local vars, and do many other things.
match is a generalized switch with pattern matching, variable binding, and more.
_ is a catch-all non-binding pattern, like the default in switch. Ignores the entire value.
if let behaves like a single match arm, combining if with variable binding in the case of true condition.
while let loop repeats until its pattern matches.
The value after for keyword in a for loop is a pattern.
let keyword takes a pattern, not a variable id.
Function parameters are patterns.
Patterns are refutable and irrefutable, the latter ones matching any possible passed value. Function parameters, let, and for take irrefutable patterns. if let and while let take both kinds, with a compiler warning if irrefutable (as that creates always-true if or an infinite loop while). match arms must be refutable except for the last one, which should be irrefutable (if the possibilities were not exhausted until then).
Multiple patterns can be combined with |.
An inclusive range of values can be matched with ..=. The range cannot be empty.
Struct destructuring: Foo { x: a, y: b } = v gets a and b. If field and var names match, then Foo { x, y } = v. Literals can be used too.
Enum destructuring Foo::Variant { x, y }, Foo::VariantWithNoData.
Can destructure arbitrarily deep nested structs and enums.
Nested _ ignores just that part.
Starting a variable name with an _ suppressed unused variable warnings for it.
.. is a greedy sequence of _, i.e.

let numbers = (1, 5, 7, 20, 30);
match numbers {
    (first, .., last) => ...
}

match arms may have match guards which are extra if conditions that can use the bound vars. Some(x) if x > 5. Exhaustiveness is not checked.
@ bindings allow to create a var holding the tested value at the match time, i.e. Message::Hello { id: id_var @ 3..=7 }.

Standard library types

Dynamic strings: String, would correspond to std::string type, but UTF-8. Display trait adds to_string method. Not indexable to avoid byte/UTF-8 encoding mixup. Slicing is allowed but runtime-checked to fall on char boundary. To disambiguate byte/char interpretation, use .chars() or .bytes().
Standard library vectors match std::vector. A macro vec![1, 2, 3] to create a vector with given contents. Ownership/borrowing rules apply to whole vector, i.e. if a mutable reference to the first element is taken, a new one cannot be pushed to the back.
Standard library hash maps correspond to std::unordered_map.
std::thread::spawn(closure) -> std::thread::thread(callable).

Smart pointers & dynamically-sized (unsized) types

Smart pointers own data. They implement Deref and Drop traits. String and Vec<T> are smart pointers.
Box<T> is like std::unique_ptr<T> in C++, except that Rust is more likely to use plain references and lifetimes, so no 1:1 mapping in i.e. rewrite. Box::new is std::make_unique.
Implementing Deref trait enables dereferencing with the * operator, like overloading C++ * and -> operators does.
Under the hood *x is transformed to *(x.deref()) exactly once.
std::mem::drop corresponds to C++ std::unique_ptr::reset or other early destruction.
Rc<T> matches std::shared_ptr. Rc::clone method matches std::shared_ptr copy constructor.
Interior mutability pattern: unsafe code to mutate data inside an immutable value even with immutable references present.
RefCell<T> does borrow checking at runtime instead of compile time. borrow and borrow_mut methods.
Rc<RefCell<T>> pattern implements multiple owners to potentially-mutable data.
Weak<T> matches std::weak_ptr. Constructed by Rc::downgrade. Upgraded by upgrade method, corresponding to std::weak_ptr::lock.
Dynamically sized types (DST) or unsized types, whose sizes are only known at the runtime. Cannot create variables of such types directly, naturally always hidden in some pointer + size structure. Rust automatically implements Sized trait for every non-DST, and implicitly bounds by it for every generic function. To relax the latter, fn foo<T: ?Sized>.... The ?Trait syntax is only available for Sized trait.

Operator overloading

Operator overloading by implementing the desired traits in std::ops.

Concurrency

thread::spawn returns a JoinHandle, which has a join method, similar to C++ std::thread::join.
Message passing for inter-thread communication, like in Go. Channels, std::sync::mpsc::channel(). The endpoints have send, recv, try_recv methods. The receiver implements Iterator too. The channels may have multiple transmitters (it's MPSC), which can be created by .clone.
Messages must implement Send trait. If a type is composed of Send types only, it becomes Send automatically.
Types whose variables are safe to be referenced from multiple threads implement Sync trait. If &T is Send, then T is Sync. A type made of Sync types only is Sync automatically.
Mutex<T> is a mutex-guarded variable of T. .lock() returns a LockResult, which has a (potentially mutable) reference MutexGuard to the guarded data. The guard unlocks when it goes out of scope. Mutex implements interior mutability.
Arc<T> corresponds to C++ std::atomic<std::shared_ptr>.
To actually share mutexes between threads, wrap them: Arc<Mutex<T>>.

Assorted standard library functionality

std::env::args is for int argc, char *argv[]. It's Unicode-checking, if that hurts then std::env::args_os.
std::process::exit is for exit
std::env::var is for getenv
println! prints to stdout, eprintln! to stderr.

Build and dependency management

cargo seems to be a much better story than CMake hell or its alternatives.
A crate is the smallest compilation unit, either a library crate, or a binary crate. Usually means the former. A crate root is the starting source file in it. A package is a bundle of crates with at most one library crate. Standard paths inside a package: src/main.rs, src/lib.rs, src/bin.
Release profiles correspond to a mixture of CMake build configurations, NDEBUG define, etc. in C++. dev profile corresponds to Debug, and release~to ~Release (or RelWithDebInfo?).
Can customize profiles in Cargo.toml, i.e. optimization levels.
Workspaces organize related packages together in large projects, to share directory root, Cargo.toml, and Cargo.lock.

Modules

Not familiar enough with C++ modules to compare.

Modules (and submodules) inside a crate do namespaceing and public/private. src/modulename.rs, src/modulename/submodulename.rs. Modules can be private or public, declared with pub mod and mod.
super:: as a part of name path goes one level up.
use keyword imports. Idiomatically functions are imported through their parent module, everything else directly.
use ... as ... creates name alias.
pub use re-exports. Used to organize and collect public API from several potentially nested submodules.
Nested path syntax: use foo::{bar, baz, self};, globs use foo::*;

Tooling

rustfmt formats, so does clang-format.
Clippy the linter.
rust-analyzer for LSP support.

Documentation

Documentation header comments start with /// and support Markdown. Built by cargo doc [--open].
Typical API doc sections: Examples, Panics, Errors, Safety.
Contained documentation comments start with //!, typically used for crates and modules.

Testing & benchmarking

#[test] annotates a function to be a test, i.e. Google Test TEST macro in C++. Tests run in parallel by default.
assert_eq! is like gtest ASSERT_EQ, except that the args are 'left' and 'right' instead of 'expected' and 'actual' or similar. Equality asserts may be applied on types implementing PartialEq and Debug traits.
assert! may take 2nd and subsequent args for a message in the case of failure.
Tests annotated with #[should_panic] test that the annotated function panics, similar but not identical to gtest death tests. Best to add expected parameter to the attribute to specify the reason for panic.
Tests may also be implemented by returning a Result<T, E>.
Benchmark tests correspond to Google Benchmark, but unstable ATM.
Documentation tests can compile API examples automatically.
Unit tests go with the code they test, mod tests annotated with #[cfg(test)]
Visibility rules happen to allow the testing of private functions.
Integration tests go to a top-level tests directory, no configuration annotation. Each file there is a separate (test) crate–if that's not what's needed, i.e. for setup code, use foo/mod.rs naming convention for non-tests.
cargo test runs in sequence: unit, integration, doc, does not go to the next category if failure.
Binary crates cannot have integration tests directly. The usual thing to do is to always have a library crate with a binary crate as minimal as possible.

Macros

There are macros, names trailing with exclamation mark (println!).
Macros can take Rust code and expand to a different Rust code. A difference from C++ preprocessor that it works on the AST, not textually. While powerful, how well does this work with tooling? Do they run macros? Can they refactor macros?
Declarative macros (macro_rules!) pattern-match given code to produce code.
#[macro_export] annotation for public macros.
Procedural macros take token stream input and produce token stream output.
One kind is custom derive macros that add code to a struct implementation.
Attribute-like macros allow creating new attributes.
Function-like macros are close to C preprocessor function-like macros, except that they also operate on TokenStream and not on arguments directly. Can take variable number of arguments.

Unsafe Rust & FFI

unsafe { ... }: allows some, well, unsafe features
unsafe can dereference raw pointers *const T, *mut T. Raw pointers are just like C raw pointers.
unsafe fn foo() {}, then fn can be called from unsafe code.
extern "C" { fn putenv ... } for FFI, may only be called from unsafe code.
To make Rust function callable from external code, add #[no_mangle] annotation and pub extern "C" before the fn.
Static variables may be declared with static FOO_BAR: type = value; Immutable static vars have an address in memory; constants don't. All mutable static vars are unsafe.
unsafe trait Foo, unsafe impl Foo for Bar.
union types exist, mainly used for interfacing with C unions, accessing fields is unsafe.
Raw identifier syntax r#while allows using e.g. a keyword for an identifier. Useful for FFI and different Rust edition interfacing.

Wednesday, November 09, 2022

Porting UnoDB to ARM64

ARM is the most common non-Intel instruction set, and UnoDB has enough of Intel-specific code to make its port an interesting project. I started out on AWS Graviton 2 hardware (ARMv8.2+extensions instruction set) and finished on an Apple M1 (ARMv8.5+extensions). The latter became my daily development platform.

I believe any porting effort goes through similar steps: 1) make it build; 2) get it tested; 3) make it correct; 4) make it fast. Let's review them.

1) Make it build. While I had been trying to properly isolate Intel code with conditional compilation, some bits slipped through, which was to be expected while it was an Intel-only build. Some Intel code was missing preprocessor conditional compilation guards. Node16 search did not have a platform-independent fallback implementation. Both were easy to fix. Then, only three ARM-specific bits were required: cache line size constants, and the spinlock spin loop body. For the latter I went with the YIELD instruction. Optimistic lock spinlock implementation is probably the most underdeveloped feature of UnoDB anyway (it is a single PAUSE instruction on Intel), I didn't sweat it too much. The last ARM-specific bit was the platform-specific static_asserts to confirm the internal node sizes, which is entirely optional too.

2) Get it tested. I needed a free public CI/CD service. Internet said there are two options available: Travis-CI, and CircleCI. GitHub Actions, the one I was using already, supports ARM, but only if you provide your own runner VMs, so, nope. Now Travis-CI was something I used before, and then stopped, together with the rest of the OSS world. CircleCI was OK, thus I set up a simple job there at first, and added different compilers, tests, and sanitizers later.

3) Make it correct. Well, the tests passed on the first run attempt. All of them. With sanitizers. Under Valgrind. This includes the parallel tests for relaxed-atomics-heavy QSBR and Optimistic Lock Coupling ART. On ARM having a weaker memory model than Intel. I still haven't seen a crash since. Either I am lucky, or all that consistent testing over time on Intel using sanitizers, including ThreadSanitizer, pays off.

4) Make it fast. In this case this means porting the code that uses Intel vectorization intrinsics. There are libraries to write vectorized code at a slightly higher abstraction level (sse2neon, simde, and others), but I wanted to learn the actual architecture. That actual architecture has several vectorization instruction set extensions: NEON, SVE, SVE2. NEON very roughly corresponds to, say, SSE4, provides 128-bit vectors, and is the simplest one to use. Now SVE (and SVE2) is something else altogether. They provide means to write vector width-independent code, that is, the same code would run unmodified on a CPU with 128-bit vectors and on a CPU with 512-bit vectors. Naturally this comes with an overhead to query the runtime vector width and handling the data sizes not fitting evenly into vectors. This appears to be best suited for processing large amounts of data, which UnoDB internal nodes aren't. Thus I went with simpler NEON.

All the UnoDB vectorized code loop bodies follow the same pattern:

Load the next part of data
Compare it in some way against something, getting a result vector
Take the mask of the result vector and process it as needed.

That "take the mask of a vector" part is handled by PMOVMSKB/VPMOVSKB instructions (_mm_movemask_epi8 & _mm256_movemask_epi8 intrinsics), and so it happens that NEON does not have a direct replacement. I tried some emulating implementations from sse2neon and simde, getting slower than baseline results every time. Nevertheless, I managed to implement a faster Node4 search in NEON by observing that the useful part of the result vector is so small it can be copied to a general purpose register directly instead of movemask'ing it. This resulted in up to 14% higher throughput in the related microbenchmarks over the SWAR-in-a-general purpose register-optimized baseline.

At this point I had thought I was done because I couldn't overcome slow movemask fallback implementations for the rest of the code. Then, someone on Twitter (dead link–the account has been deleted since, I believe it's him) posted a new movemask replacement based on SHRN (shift right and narrow). This operation can be considered as a "halfway-movemask" which does not get down to a single bit per vector element, but it does not have to. Once we get something that fits in a GP register (or a pair of them, in the initial Node16 search implementation), we can work with that.

With this, a straightforward Node16 NEON search resulted in up to 50% higher throughput (and in up to 8% regression in the case of minimal-sized Node16, I took that trade-off). Node48 insert position search became up to 8% faster in the single-load-per-iteration implementation, and then I unrolled that loop to load four registers (8 elements per iteration). Unfortunately I misplaced the benchmark results of that, I recall it being something up to 10% faster on the top of baseline NEON.

Interestingly this code is unrolled exactly the same in all three (SSE4, AVX2, NEON) vector implementations to process four vector registers per iteration, corresponding to handling eight pointers per iteration for SSE4 & NEON, and 16 pointers for AVX2.

So, ARM64 is now a first-class platform for UnoDB, which is also convenient for me due to switching to Apple M1 as my main machine.

Friday, November 04, 2022

ART: how much faster AVX2 is than SSE4?

The original Adaptive Radix Tree paper, with vectorization as one of the main selling points, only requires SSE2 intrinsics, introduced with the first Pentium 4 in year 2000. And I developed UnoDB on an AVX-supporting Intel CPU, Sandy Bridge, 2nd Core generation, introduced in 2011. It might be considered a bit long in the tooth now, but still a bit newer than SSE2, and I have used SSE4.1 intrinsics too.

Then I upgraded the Sandy Bridge machine to Kaby Lake (7th generation, 2016), and started looking into an AVX2 port.

The vectorized ART algorithms were:

Node4 search and insert position search (both regular and while expanding N4 to N16), all data fitting into a single XMM register
Node16 search and insert position search, the original ART paper algorithm, again all data fitting into a single XMM register
Node48 insert free child slot search, which is a first nullptr search in an array of 48 pointers

For the first two, I couldn't think of any ways to improve them. The data is already in a single XMM register. Maybe the compiler uses AVX2 VPBROADCAST* to do loads, stores, and, well, broadcasts, I am not sure, it's up to it.

The last one, however, is a textbook vectorization case, even if the amount of data is relatively small. The SSE4 implementation loads four XMM registers to process eight pointers per loop iteration. The array length being 48 easily permits unrolling the loop once more, to handle 16 pointers per iteration, but that implementation was slower for me.

For AVX2 I started with the simplest implementation and then tried to improve it step by step, using synthetic microbenchmarks for the N48 insert.

The simplest implementation is to load and process a single YMM register (four pointers) per loop iteration. 2% to 5% speedup for N48 insert over the baseline which handled twice as much per iteration
Then, unroll loop once for eight pointers per iteration. 1% to 8% speedup over the previous step
Then, I experimented with prefetching the next loop iteration, but that was slower, I guess it's a trivial case for the hardware prefetcher.
Then, unroll loop twice for 16 pointers per iteration. 2% to 10% speedup over the loop unrolled once.

The three steps together amount to 2% to 40% speedups (and there's a 2% regression in a single hopefully rare case) over SSE4, and that's the benefit of AVX2 I was able to get.

If that loop were unrolled once more, it would disappear: the first "iteration" would handle 32 pointers, and then separate code would optionally handle the remaining 16. I did not go down this road.

The next project would be to add AVX-512 support, but I don't have the hardware nor I am really motivated to get my hands on it. A fun fact: did you know that "512" in AVX-512 refers to the total number of different instruction sets all calling themselves AVX-512? But then again, the base AVX-512F set should be enough for me.

(edit: replaced GitHub line links with line range links)

Tuesday, October 25, 2022

UnoDB and exception safety

Nobody uses exceptions in C++, which makes C++ an exceptional (sorry) language. Everybody uses exceptions in Java, for example. I cannot think of any other language whose practitioners ignore the main prescribed error handling method so thoroughly, to the point of standardization work now adding alternate (not replacement! merely alternate) error handling mechanisms, which then don't cover all use cases, resulting in a fine mess IMHO.

I find the main accepted C++ way of handling errors by checking every return value at every call site for every possible error incredibly verbose, ugly, error prone, and I actually like C++ exceptions. Maybe that's because I did not have to use them for embedded development, did not care about error messages having full stacktraces, nesting them, their performance, and dealing with catching "..." in a useful way. In other words, maybe I find them to be OK because I never actually had to use them in production.

Since I am developing UnoDB as a library which could be used in an exception-using or exception ignoring codebase, I have to make it exception-safe. Now ever since reading Effective C++ series, I got the vague idea of being exception-safe as

noexcept all the things that can be noexcept'ed
RAII all the things
for any non-trivial state, build it in a local variable and std::swap into place
never throw from a destructor (I know it's technically not true, you just have to compare std::uncaught_exceptions values in the constructor and the destructor, if equal, fail away!)
???
Profit!

See, the levels of exception safety guarantees (nothrow, basic, strong) do not even really enter the picture. Just try to do things in a safe way, and the result will be safe-ish.

And then this "safe-ish" code will break in subtle or not-subtle ways on the first actual exception thrown. I was ignoring this until Justinas casually mentioned while discussing something else: "to test exceptions, run a test in a loop with injecting std::bad_alloc on the 1st, 2nd, ... allocation, until the test succeeds." I noted this and some time later set off to work:

Created an allocation_failure_injector class that maintains an allocation counter and throws std::bad_alloc on reaching a preset value, call it from the global operator new
Easy tests first: for non-memory allocating operations, wrap their tests in a must_not_allocate helper method that sets the failure injector to fail on the very first allocation, observe that it never happens
Have a helper to do Justinas' loop
In the state-changing potentially-failing test harness operations (such as ART test insert), catch any exception and assert that nothing in the observable state changed (i.e. node counter values). Look, we are going for the strong exception safety guarantee here!
Make fuzzer tests inject and handle OOM too.

Both ART and QSBR were treated this way, and it did not take long for the "safe-ish" code to show serious bugs:

QSBR would leak memory, bump counters prematurely, and generally end up in inconsistent states (one, two, three), including a fuzzer-found bug that was non-deterministic due to unpredictable C++ standard library allocation count
In the case of growing/shrinking an ART tree node, if the allocation of the new node failed, the old existing node would still get reclaimed. Yikes.

As a result, the code is no longer exception "safe-ish", but actually safe to some actual level with respect to std::bad_alloc. There are other exceptions that might be thrown, but I couldn't think of any reasonable not-immediately-fatal way to handle std::mutex::lock throwing std::system_error.

As I was writing above I noticed that half of the commit messages say "crash safety" instead of "exception safety." At least I haven't found "exception isolation levels" or the rest of ACID in my commit messages yet.

Wednesday, February 09, 2022

UnoDB ported to MSVC

I have ported UnoDB to Microsoft Visual Studio, the third major C++ compiler. The other two, GCC and clang, are reasonably close to each other, while MSVC likes to do its own thing, not only due to different runtime platform, but also because of a different C++ standard interpretation in the compiler and library. MSVC 6.0 (the Internet Explorer 6.0 of C++ compilers, don't try to use the C++98 standard library there) was the first compiler I used in my professional C++ work two decades ago. Luckily 6.0 is not the current version, and Microsoft tried to clean up its act with respect to standard C++ since (not fully – why std::exception::what() is not noexcept?) So, the port forced to write more, uhm, portable code. Most GCC and clang-specific compiler extensions and builtins mapped straightforwardly to MSVC ones, and a few that didn't could be #ifdef'ed away trivially. There was only one incompatible API call – posix_memalign – which mapped to _aligned_malloc trivially too. Then the port pointed out actual bugs in my assumptions, mostly in the standard library use:

Do not assume that std::array::iterator is a raw pointer - it can be anything that satisfies C++ iterator requirements.
Do not instantiate std::uniform_int_distribution with std::uint8_t - the distribution only supports integral types while uint8_t is of course a character type!
Do not assume that different compilers will lay out the same structs the same in memory, even if architecture and field alignment requirements do not differ.

Besides the actual source code, I was most worried about CMake script compatibility. Turns out, MSVC and CMake work quite nicely together if one uses CMake presets, and porting the build scripts was relatively easy.

At this point the basic port was completed with tests and benchmarks passing. Even though my main development platform is macOS and test/benchmark one is Linux, I wanted to do a first-class port that matches the existing platforms as much as possible and makes the most of MSVC features. For that, I did more work:

CI. Github Actions has good Windows runner support. It was a bit alien to learn and write PowerShell script bits and not to revert to UNIX'isms, otherwise it works and runs great.
Clean compiler diagnostics. I have enabled /W4 warning level and fixed a few unremarkable issues. I did have to add some CMake kludge to remove the default /W3. I am using CMake 3.12 and this issue has been fixed in 3.15.
AddressSanitizer. It's very nice that MSVC has it, and it mostly works as expected. There are two open MSVC bugs though, so I am unable to fully test it in CI yet.
clang/LLVM in MSVC. Now that's interesting – a clang compiler integrated with the rest of MSVC toolchain. For code, that required adding #ifdefs in the style of defined(_MSC_VER) && !defined (__clang__), which takes some time to get used to. For build system, things got even more interesting. There is a clang-cl.exe compiler driver that translates MSVC cl.exe command line options to clang ones internally. Which is great, except that there are a few cl.exe flags that do not have translations and cause errors, and that there are a few clang-style flags that have to passed together with cl.exe-style flags too. And then this whole one-toolchain-two-compilers business took CMake by surprise – I had to use incorrect-if-ever-ported-to-native-LLVM-on-windows CMake workarounds. CMake 3.14 introduces CMAKE_CXX_COMPILER_FRONTEND to address this. Anyway, this is also tested in CI.
clang/LLVM in MSVC with AddressSanitizer. Yep, that's also a thing, but here I hit the wall of incompatible runtime libs and stopped.

But by far the biggest payoff in doing the port were the MSVC static analyzer fixes. The analyzer was a honorable mention in my static analyzer post and now that I can run it on my code, the results exceeded my expectations. It started innocently enough, a few trivial changes (and some more), a few new asserts added to help out data flow analysis. Similar to asserts, I added some assumptions – which are facts provided to compiler about variable values in release build too. (One of my previous experimental branches did the same - lock-free 128-bit atomic stats used assumptions to limit double values to 2^63 for integer conversion to avoid branching to the highest-bit-set case handling). Some code was de-duplicated, dead code was removed. Then it caused me to discover a typo, which invalidated many months of runs of Node48 and Node256 random get benchmarks. Luckily there were no code changes dependent on the affected benchmarks.

The static analyzer also checks for C++ Core Guidelines violations, and there is a slightly sad story there, of the "why we can't have nice things in C++" kind. To suppress violation diagnostics, MSVC provides a well-meaning gsl::suppress(x.1) attribute. To make suppressions portable, LLVM well-meaningly ported it but made the argument a string literal – gsl::suppress("x.1") – because that's what portability means! To make both compilers work, the Core Guidelines introduces a well-meaning GSL_SUPPRESS(x.1) macro. If it is present in the code, clang-format will insert a space after dot – GSL_SUPPRESS(x. 1) – breaking the compilation. There is no .clang-format option to stop this behavior, and so all the actual uses of this macro in the actual portable codebases look like this:

// clang-format off
GSL_SUPPRESS(f.4) // NO-FORMAT: attribute
// clang-format on

Where NO-FORMAT: attribute seems to be a thing to handle MSVC own formatter (?). Luckily, I could replace GSL suppressions with direct MSVC compiler suppressions, and that's a minor inconvenience in the big picture.

All in all, I learned a lot by doing an MSVC port, working with Windows was a bit of a change of regular programming environment, and improved the code more than expected – not too bad.