Monday, September 25, 2023

Implementing durability in a MySQL storage engine

update 2023-09-28: edited for non-durable SE commits under group commit, and fixed the trx->flush_log_later discussion.

update 2023-09-27: Binlog group commits asks the storage engines to commit non-durably, will edit the post even more.

update 2023-09-26: trx->flush_log_later is actually used. Will edit the post.

Let's review how a MySQL storage engine should implement transaction durability by flushing / syncing WAL writes to disk. For performance reasons (group 2PC), let's also review when it specifically should not sync writes to disk. The reference durability implementation is, of course, InnoDB.

The main storage engine entry point is handlerton::commit. Since in general the storage engines participate in two-phase commit protocol with the binary log, there is also handlerton::prepare, and handlerton::flush_logs participates too. Let's ignore rollbacks, savepoints, explicit XA transactions, read only transactions, transactions on temporary tables only, crash recovery, and transaction coordinators other than the binlog.

Background: Group Commit

It was implemented (WL#5223) in its current form in MySQL 5.6, and its internals are described in this Mats Kindahl's blog post. I will not repeat everything here (and I'm sure I'd miss a lot of details), but for durability discussion, from the storage engine side, the group commit looks as follows:

  • prepare(t1) with reduced durability;
  • prepare(t2) with reduced durability;
  • prepare(tn) with reduced durability;
  • flush_logs(), making all the prepares above durable;
  • commit(t1) with reduced durability;
  • commit(t2) with reduced durability;
  • commit(tn) with reduced durability.

A surprise here is that the commits are performed with reduced durability too. How do reduced-durability commits implement full durability for the committed transactions, then? Turns out, the design of binlog group commit is only the commit of binlog itself is durable, and for the storage engines, prepares are made durable in batches and that's it. If their commits are lost, binlog crash recovery will roll forward the prepared transactions.

This design is counterintuitive if one thinks that innodb-flush-log-at-trx-commit=1, as documented, makes InnoDB commits durable in this setup, which it does not, and it is possible to see binlog crash recovery in action. Davi Arnaut reported this as bug #75519 in 2015, and IMHO few users are aware of this behavior.

Anyway, back to the implementation. Apparently the server developers did not want to change the prepare/commit handlerton interface, so the server durability request (full or reduced) is not passed in as an argument, but must be queried by thd_get_durability_property returning an enum with two possible values HA_REGULAR_DURABILITY and HA_IGNORE_DURABILITY.

Later, in 8.0, this durability property was reused to implement correct & performant commit order on multithreaded replicas, when binlog is disabled (WL#7846).

InnoDB: handlerton::commit

Implemented by innobase_commit.

Comes last in the group commit, but let's review it first. In other setups it might be the only entry point.

Wherever I say "write [to the disk] and sync|flush", the mental model is that of a buffered write with a separate flush/sync afterwards. If O_SYNC or O_DSYNC is used to write the log instead, then the write and the sync are a single operation.

Let's ignore non-default innobase_commit_concurrency setups.

First the code sets trx->flush_log_later and then goes through the call stack innobase_commit -> innobase_commit_low -> trx_commit_for_mysql -> trx_commit -> trx_commit_low. The last one calls trx_write_serialisation_history, which makes the necessary commit writes to a mini-transaction, then trx_commit_low commits the mini-transaction by creating the redo log records. Nothing is done for durability yet at this point. Finally trx_commit_low calls trx_commit_in_memory, which sees that trx->flush_log_later is set and sets trx->must_flush_log_later. (if trx_commit is called from other API than SE commit, then flush_log_later will not be set and the durability will be ensured in this function).

At this point the callstack returns all the way back to innobase_commit, which calls trx_complete_for_mysql, which now checks trx->must_flush_log_later (set), durability request (reduced), and whether this is a DDL transaction. If it is not, then nothing is done, and InnoDB reports the commit as successful. If it is a DDL transaction, then log is flushed ignoring the reduced durability request and innodb_flush_log_at_trx_commit setting..

The above mentioned that DDL transactions are flushed more than regular ones, regardless of innodb_flush_log_at_trx_commit setting. This is a deliberate design decision, which has to do with the data dictionary, I believe. To understand why, consider the relevant parts of server startup sequence:

  1. InnoDB comes up, and performs its own recovery from its redo log.
  2. Server data dictionary is initialized.
  3. Binlog crash recovery runs.
If any DD transactions are trapped in prepared state by the time of the data dictionary initialization, they will be invisible, while their disk changes (e.g. a tablespace renamed on disk) will be present on disk. This inconsistency is likely to be fatal for the DD, and binlog crash recovery runs too late to recover from that.

InnoDB: handlerton::prepare

Implemented by innobase_xa_prepare.

It calls trx_prepare_for_mysql -> trx_prepare -> trx_prepare_low, which updates the undo log state for the transaction a mini-transaction, committing which makes the top-level transaction prepared. Then trx_prepare calls trx_flush_logs, which will either do nothing or write and flush the redo log up to the mini-transaction's commit LSN, depending on the server durability request.

InnoDB: handlerton::flush_logs

Implemented by innobase_flush_logs.

It has a bool argument telling whether it was invoked as a part of binlog group commit, which is the interesting case here, ignoring the other option of it being invoked by FLUSH LOGS SQL statement. It writes the redo log buffer to disk and flushes it according it to innodb_flush_log_at_trx_commit value.

Bugs reported while writing this:

Bugs found while writting this:

Bugs that made me write this: