update 2023-09-28: edited for non-durable SE commits under group commit, and fixed the
update 2023-09-27: Binlog group commits asks the storage engines to commit non-durably, will edit the post even more.
trx->flush_log_later is actually used. Will edit the post.
Let's review how a MySQL storage engine should implement transaction durability by flushing / syncing WAL writes to disk. For performance reasons (group 2PC), let's also review when it specifically should not sync writes to disk. The reference durability implementation is, of course, InnoDB.
The main storage engine entry point is
handlerton::commit. Since in general the
storage engines participate in two-phase commit protocol with the binary log, there
handlerton::flush_logs participates too. Let's
ignore rollbacks, savepoints, explicit XA transactions, read only transactions,
transactions on temporary tables only, crash recovery, and transaction coordinators
other than the binlog.
Background: Group Commit
It was implemented (WL#5223) in its current form in MySQL 5.6, and its internals are described in this Mats Kindahl's blog post. I will not repeat everything here (and I'm sure I'd miss a lot of details), but for durability discussion, from the storage engine side, the group commit looks as follows:
prepare(t1)with reduced durability;
prepare(t2)with reduced durability;
prepare(tn)with reduced durability;
flush_logs(), making all the prepares above durable;
commit(t1)with reduced durability;
commit(t2)with reduced durability;
commit(tn)with reduced durability.
A surprise here is that the commits are performed with reduced durability too. How do reduced-durability commits implement full durability for the committed transactions, then? Turns out, the design of binlog group commit is only the commit of binlog itself is durable, and for the storage engines, prepares are made durable in batches and that's it. If their commits are lost, binlog crash recovery will roll forward the prepared transactions.
This design is counterintuitive if one thinks that
innodb-flush-log-at-trx-commit=1, as documented, makes InnoDB commits durable in this setup, which it does not, and it is possible to see binlog crash recovery in action. Davi Arnaut reported this as bug #75519 in 2015, and IMHO few users are aware of this behavior.
Anyway, back to the implementation. Apparently the server developers did not want to change the prepare/commit handlerton
interface, so the server durability request (full or reduced) is not passed in as an
argument, but must be queried by
thd_get_durability_property returning an
with two possible values
Later, in 8.0, this durability property was reused to implement correct & performant commit order on multithreaded replicas, when binlog is disabled (WL#7846).
Comes last in the group commit, but let's review it first. In other setups it might be the only entry point.
Wherever I say "write [to the disk] and sync|flush", the mental model is that of a
buffered write with a separate flush/sync afterwards. If
used to write the log instead, then the write and the sync are a single operation.
Let's ignore non-default
First the code sets
trx->flush_log_later and then goes through the call stack
trx_commit_low. The last one calls
trx_write_serialisation_history, which makes the necessary commit writes to a
trx_commit_low commits the mini-transaction by creating the
redo log records. Nothing is done for durability yet at this point. Finally
trx_commit_in_memory, which sees that
trx->flush_log_later is set and sets
trx_commit is called from other API than SE commit, then
flush_log_later will not be set and the durability will be ensured in this function).
At this point the callstack returns all the way back to
innobase_commit, which calls
trx_complete_for_mysql, which now checks
trx->must_flush_log_later (set), durability request (reduced), and whether this is a DDL transaction. If it is not, then nothing is done, and InnoDB reports the commit as successful. If it is a DDL transaction, then log is flushed ignoring the reduced durability request and
The above mentioned that DDL transactions are flushed more than regular ones,
innodb_flush_log_at_trx_commit setting. This is a deliberate design
decision, which has to do with the data dictionary, I believe. To understand why, consider the relevant parts of server startup sequence:
- InnoDB comes up, and performs its own recovery from its redo log.
- Server data dictionary is initialized.
- Binlog crash recovery runs.
trx_prepare_low, which updates
the undo log state for the transaction a mini-transaction, committing which makes the
top-level transaction prepared. Then
trx_flush_logs, which will
either do nothing or write and flush the redo log up to the mini-transaction's commit
LSN, depending on the server durability request.
It has a bool argument telling whether it was invoked as a part of binlog group
commit, which is the interesting case here, ignoring the other option of it being
FLUSH LOGS SQL statement. It writes the redo log buffer to disk and
flushes it according it to
Bugs reported while writing this:
Bugs that made me write this: