update 2023-09-28: edited for non-durable SE commits under group commit, and fixed the trx->flush_log_later
discussion.
update 2023-09-27: Binlog group commits asks the storage engines to commit non-durably, will edit the post even more.
update 2023-09-26: trx->flush_log_later
is actually used. Will edit the post.
Let's review how a MySQL storage engine should implement transaction durability by flushing / syncing WAL writes to disk. For performance reasons (group 2PC), let's also review when it specifically should not sync writes to disk. The reference durability implementation is, of course, InnoDB.
The main storage engine entry point is handlerton::commit
. Since in general the
storage engines participate in two-phase commit protocol with the binary log, there
is also handlerton::prepare
, and handlerton::flush_logs
participates too. Let's
ignore rollbacks, savepoints, explicit XA transactions, read only transactions,
transactions on temporary tables only, crash recovery, and transaction coordinators
other than the binlog.
Background: Group Commit
It was implemented (WL#5223) in its current form in MySQL 5.6, and its internals are described in this Mats Kindahl's blog post. I will not repeat everything here (and I'm sure I'd miss a lot of details), but for durability discussion, from the storage engine side, the group commit looks as follows:
prepare(t1)
with reduced durability;prepare(t2)
with reduced durability;- …
prepare(tn)
with reduced durability;flush_logs()
, making all the prepares above durable;commit(t1)
with reduced durability;commit(t2)
with reduced durability;- …
commit(tn)
with reduced durability.
A surprise here is that the commits are performed with reduced durability too. How do reduced-durability commits implement full durability for the committed transactions, then? Turns out, the design of binlog group commit is only the commit of binlog itself is durable, and for the storage engines, prepares are made durable in batches and that's it. If their commits are lost, binlog crash recovery will roll forward the prepared transactions.
This design is counterintuitive if one thinks that innodb-flush-log-at-trx-commit=1
, as documented, makes InnoDB commits durable in this setup, which it does not, and it is possible to see binlog crash recovery in action. Davi Arnaut reported this as bug #75519 in 2015, and IMHO few users are aware of this behavior.
Anyway, back to the implementation. Apparently the server developers did not want to change the prepare/commit handlerton
interface, so the server durability request (full or reduced) is not passed in as an
argument, but must be queried by thd_get_durability_property
returning an enum
with two possible values HA_REGULAR_DURABILITY
and HA_IGNORE_DURABILITY
.
Later, in 8.0, this durability property was reused to implement correct & performant commit order on multithreaded replicas, when binlog is disabled (WL#7846).
InnoDB: handlerton::commit
Implemented by innobase_commit
.
Comes last in the group commit, but let's review it first. In other setups it might be the only entry point.
Wherever I say "write [to the disk] and sync|flush", the mental model is that of a
buffered write with a separate flush/sync afterwards. If O_SYNC
or O_DSYNC
is
used to write the log instead, then the write and the sync are a single operation.
Let's ignore non-default innobase_commit_concurrency
setups.
First the code sets trx->flush_log_later
and then goes through the call stack innobase_commit
-> innobase_commit_low
->
trx_commit_for_mysql
-> trx_commit
-> trx_commit_low
. The last one calls
trx_write_serialisation_history
, which makes the necessary commit writes to a
mini-transaction, then trx_commit_low
commits the mini-transaction by creating the
redo log records. Nothing is done for durability yet at this point. Finally
trx_commit_low
calls trx_commit_in_memory
, which sees that trx->flush_log_later
is set and sets trx->must_flush_log_later
. (if trx_commit
is called from other API than SE commit, then flush_log_later
will not be set and the durability will be ensured in this function).
At this point the callstack returns all the way back to innobase_commit
, which calls
trx_complete_for_mysql
, which now checks trx->must_flush_log_later
(set), durability request (reduced), and whether this is a DDL transaction. If it is not, then nothing is done, and InnoDB reports the commit as successful. If it is a DDL transaction, then log is flushed ignoring the reduced durability request and innodb_flush_log_at_trx_commit
setting..
The above mentioned that DDL transactions are flushed more than regular ones,
regardless of innodb_flush_log_at_trx_commit
setting. This is a deliberate design
decision, which has to do with the data dictionary, I believe. To understand why, consider the relevant parts of server startup sequence:
- InnoDB comes up, and performs its own recovery from its redo log.
- Server data dictionary is initialized.
- Binlog crash recovery runs.
InnoDB: handlerton::prepare
Implemented by innobase_xa_prepare
.
It calls trx_prepare_for_mysql
-> trx_prepare
-> trx_prepare_low
, which updates
the undo log state for the transaction a mini-transaction, committing which makes the
top-level transaction prepared. Then trx_prepare
calls trx_flush_logs
, which will
either do nothing or write and flush the redo log up to the mini-transaction's commit
LSN, depending on the server durability request.
InnoDB: handlerton::flush_logs
Implemented by innobase_flush_logs
.
It has a bool argument telling whether it was invoked as a part of binlog group
commit, which is the interesting case here, ignoring the other option of it being
invoked by FLUSH LOGS
SQL statement. It writes the redo log buffer to disk and
flushes it according it to innodb_flush_log_at_trx_commit
value.
Bugs reported while writing this:
Bugs that made me write this: