Laurynas Biveinis

Tuesday, January 30, 2024

Introducing patch2testlist for MySQL development

I wrote a small shell utility patch2testlist that might be useful for fellow MySQL developers. It reads a diff and outputs the list of tests touched in this diff to run in a format suitable for mysql-test-run.pl consumption. Furthermore, when provided with a path to the source tree of the diff, it handles included files.

There are two ways to invoke it.

Quick-and-dirty mode that does not handle included files:
```
$ ./mtr `git diff | patch2testlist` ...
```
Thorough mode that considers included files, if the source tree path is given:
```
$ ./mtr `git diff | patch2testlist ../..` ...
```

What does it do? Let's consider an example:

$ git diff | diffstat
 mysql-test/extra/rpl_tests/rpl_replica_start_after_clone.inc                                    |    2 
 mysql-test/include/keyring_tests/binlog/rpl_encryption_master_key_rotation_at_startup.inc       |    5 -
 mysql-test/include/keyring_tests/mats/rpl_encryption.inc                                        |    2 
 mysql-test/include/keyring_tests/mats/rpl_encryption_master_key_generation_recovery.inc         |    2 
 mysql-test/suite/auth_sec/include/acl_tables_row_locking_test.inc                               |    4 
 mysql-test/suite/binlog/t/binlog_restart_server_with_exhausted_index_value.test                 |    1 
 mysql-test/suite/component_keyring_file/inc/rpl_setup_component.inc                             |    1 
 mysql-test/suite/innodb/t/log_8_0_11_case1.test                                                 |    1 
 mysql-test/suite/rocksdb/r/sys_tables.result                                                    |    2 
 mysql-test/suite/rocksdb/r/sys_tables_acl_tables_row_locking.result                             |  384 +++++++++++++++++++---------------------------------------------------------------
 mysql-test/suite/rocksdb/r/sys_tables_is_statistics_mysql.result                                |    4 
 mysql-test/suite/rocksdb/r/sys_tables_mysqlcheck.result                                         |    8 -
 mysql-test/suite/rpl/t/rpl_cloned_slave_relay_log_info.test                                     |    4 
 mysql-test/suite/rpl/t/rpl_encryption.test                                                      |    3 
 mysql-test/suite/rpl/t/rpl_encryption_master_key_generation_recovery.test                       |    3 
 mysql-test/suite/rpl/t/rpl_encryption_master_key_rotation_at_startup.test                       |    5 -
 mysql-test/suite/rpl/t/rpl_gtid_innodb_sys_header.test                                          |    2 
 mysql-test/suite/rpl_gtid/t/rpl_gtid_xa_commit_failure_before_gtid_externalization.test         |    1 
 mysql-test/suite/rpl_gtid/t/rpl_gtid_xa_commit_one_phase_failure_before_prepare_in_engines.test |    1 
 mysql-test/suite/rpl_gtid/t/rpl_gtid_xa_prepare_failure_before_prepare_in_engines.test          |    1 
 mysql-test/suite/rpl_gtid/t/rpl_gtid_xa_rollback_failure_before_gtid_externalization.test       |    1 
 mysql-test/suite/rpl_nogtid/t/rpl_assign_gtids_to_anonymous_transactions_clone.test             |    4 
 mysql-test/suite/rpl_nogtid/t/rpl_gtid_mode.test                                                |    5 -
 mysql-test/suite/rpl_nogtid/t/rpl_nogtid_encryption_read.test                                   |    3 
 mysql-test/suite/test_services/t/test_host_application_signal_plugin.test                       |    3 
 mysql-test/t/basedir.test                                                                       |    5 -
 mysql-test/t/mysqld_daemon.test                                                                 |    3 
 mysql-test/t/mysqld_safe.test                                                                   |   27 ++---
 mysql-test/t/restart_server.test                                                                |    3 
 mysql-test/t/restart_server_no_acl.test                                                         |    3 
...
$ git diff | patch2testlist
binlog.binlog_restart_server_with_exhausted_index_value innodb.log_8_0_11_case1 main.basedir
main.mysqld_daemon main.mysqld_safe main.restart_server main.restart_server_no_acl
rocksdb.sys_tables rocksdb.sys_tables_acl_tables_row_locking
rocksdb.sys_tables_is_statistics_mysql rocksdb.sys_tables_mysqlcheck
rpl.rpl_cloned_slave_relay_log_info rpl.rpl_encryption
rpl.rpl_encryption_master_key_generation_recovery
rpl.rpl_encryption_master_key_rotation_at_startup rpl.rpl_gtid_innodb_sys_header
rpl_gtid.rpl_gtid_xa_commit_failure_before_gtid_externalization
rpl_gtid.rpl_gtid_xa_commit_one_phase_failure_before_prepare_in_engines
rpl_gtid.rpl_gtid_xa_prepare_failure_before_prepare_in_engines
rpl_gtid.rpl_gtid_xa_rollback_failure_before_gtid_externalization
rpl_nogtid.rpl_assign_gtids_to_anonymous_transactions_clone rpl_nogtid.rpl_gtid_mode
rpl_nogtid.rpl_nogtid_encryption_read test_services.test_host_application_signal_plugin

The quick-and-dirty mode above does not require a hundred line script, a ten-line one will do. But notice that several of the changed files in the diffstat output are test include files (i.e. rpl_replica_start_after_clone.inc). Ideally we'd want to run any tests that include (directly and indirectly) such files, and the ten-line script does not handle this case.

That's what the other ninety lines of the script do. If the optional source tree path argument is given, then it greps for any included files under mysql-test/, then greps for newly-found files and so on until it finds no more:

$ git diff | patch2testlist ../..
auth_sec.acl_tables_row_locking binlog.binlog_restart_server_with_exhausted_index_value
component_keyring_file.rpl_binlog_cache_encryption
component_keyring_file.rpl_binlog_cache_temp_file_encryption
component_keyring_file.rpl_default_table_encryption component_keyring_file.rpl_encryption
component_keyring_file.rpl_encryption_master_key_generation_recovery
component_keyring_file.rpl_encryption_master_key_rotation_at_startup innodb.log_8_0_11_case1
main.basedir main.mysqld_daemon main.mysqld_safe main.restart_server main.restart_server_no_acl
rocksdb.sys_tables rocksdb.sys_tables_acl_tables_row_locking rocksdb.sys_tables_is_statistics_mysql
rocksdb.sys_tables_mysqlcheck rpl.rpl_cloned_slave_relay_log_info rpl.rpl_encryption
rpl.rpl_encryption_master_key_generation_recovery rpl.rpl_encryption_master_key_rotation_at_startup
rpl.rpl_gtid_innodb_sys_header rpl.rpl_slave_start_after_clone
rpl_gtid.rpl_gtid_only_start_replica_after_clone
rpl_gtid.rpl_gtid_xa_commit_failure_before_gtid_externalization rpl_gtid.rpl_gtid_xa_commit_one_phase_failure_before_prepare_in_engines
rpl_gtid.rpl_gtid_xa_prepare_failure_before_prepare_in_engines
rpl_gtid.rpl_gtid_xa_rollback_failure_before_gtid_externalization
rpl_nogtid.rpl_assign_gtids_to_anonymous_transactions_clone rpl_nogtid.rpl_gtid_mode
rpl_nogtid.rpl_nogtid_encryption_read test_services.test_host_application_signal_plugin

As you can see the list is now significantly longer, indicating a more thorough test run coverage of the diff. All this extra grepping takes about 90 seconds on my machine, if some popular include files are touched. I have no idea whether that's with hot or cold FS cache. I also don't know whether replacing grep with rg would it make it faster.

To minimize the false positives in included file search, grep considers the lines that don't start with the MTR language comment character #, and are like ...source...basename-of-included-file. This allows false positives in indented comments and inside string literals (that one should be rare) and it cannot tell apart files with the same name in different directories. In theory it also allows false negatives if an include file is referenced using a string variable to store its name. Any suggestions for better regexps are welcome.

It goes without saying that it is best applied on test-only patches. If you touch the source code, then you should be looking at whole MTR runs, or, if possible, MTR runs of selected suites. But if you are indeed working on a test-only patch, this script reduces the required test time effectively.

Should be portable but currently tested on macOS only. Feedback is welcome!

Wednesday, January 24, 2024

Building and testing MySQL 8.0.36 and 8.3.0 on macOS

The previous releases (8.0.35 and 8.2.0) resulted in me reporting fifteen bugs. Let's find out whether 8.0.36 and 8.3.0 will fare better on an M1 Mac.

Let's start with the build. Boost goes away as an external dependency in 8.3.0, removing the need to specify Boost-related CMake options, good. The server continues to build successfully with -DWITH_SYSTEM_LIBS=ON but now started requiring -DWITH_ZLIB=bundled, because 8.3.0 made the system libraries option govern zlib too, and the one in XCode is one patch level version too old. The Homebrew-installed version is ignored.

8.0.36 Release configuration builds with a single potentially-fatal warning: bug #113662 (NDB compilation error on macOS Release build). Finding this made me look, why is NDB built at all, if I did not add -DWITH_NDB=ON? This resulted in bug #113661 (NDB storage engine built ignoring -DWITH_NDB=OFF (which is OFF by default too)).

The most serious build-related issue I saw previously was incorrect query results if compiled with LLVM 15 and newer, reported as bug #113049 (MTR test json.array_index fails with a result difference) and bug #113046 (MTR tests for SELECT fail with ICP, MRR, possibly other flags). This issue has been fixed, although the bugs are still open (thus no release notes entries neither). As Tor Didriksen explained, they are open due to still remaining issues with recent MSVC compilers. But, LLVM works fine for me now and that's great.

The previous releases also required -ffp-contract=off compilation flag workaround to take care of some failures: bug #113047 (MTR test main.derived_limit fails with small cost differences), bug #113048 (MTR test gis.gis_bugs_crashes fails with a result difference). This has been mostly addressed, except that #113047 is fixed in 8.0.36 and 8.4.0 but not 8.3.0, so that failure still remains if the workaround is dropped.

The previous releases could not be compiled with LLVM 17, and no changes occurred here, bug #113123 (Compilation fails with LLVM 17) still applies.

Moving on to tests in Release, Debug, and Debug+ASan+UBsan configurations. Looking better than the last time, this is what I had to report:

https://bugs.mysql.com/bug.php?id=113665 (perfschema.relaylog fails with a result difference)
https://bugs.mysql.com/bug.php?id=113702 (Test binlog.binlog_error_action fails under Sanitizer build)
https://bugs.mysql.com/bug.php?id=113703 (Test innodb.ddl_kill fails under Sanitizer build)
https://bugs.mysql.com/bug.php?id=113704 (Test main.log_backtrace fails under Sanitizer build)
https://bugs.mysql.com/bug.php?id=113709 (StrXfrmTest.ChineseUTF8MB4 failing with an AddressSanitizer error)
https://bugs.mysql.com/bug.php?id=113710 (Spec/HttpClientSecureTest.ensure/default_client_cipher_succeeds ASan error)
https://bugs.mysql.com/bug.php?id=113722 (Test main.index_merge_innodb failing with a result difference)
https://bugs.mysql.com/bug.php?id=113736 (Test main.qualify_hypergraph failing with a result difference)

So, to sum up, 10 bugs reported, 4 bugs confirmed fixed, 5 bugs (#113123, #113260, #113189, #113190, #113258) have no changes, and 1 bug (#113023) I did not test.

All in all, this looks OK. While no perfect clean testsuite results I was used to in some older releases, no miscompilation-like bugs neither, and that's fine.

Thursday, January 11, 2024

MySQL clone plugin internals and MyRocks clone design

I just realized that about MySQL and MyRocks clone I never actually published anything more serious than a single-emoji post on Facebook, a link to an Oracle umbrella bug, and this tweet.

So, let's talk about clone. MySQL has a clone plugin which can be used to copy new instances from existing ones, and it's also integrated into group replication for the same purpose. In Oracle releases this plugin copies only InnoDB tables, making the feature unsuitable for MyRocks instances.

Now MyRocks is the first storage engine, besides InnoDB, to get the clone support, and it works on mixed MyRocks/InnoDB instances while ensuring that the cloned instances are consistent across engines too. The code is in Meta's branch, I don't believe there are any user docs (but MyRocks support is so seamless that Oracle docs suffice! Only half-joking here), but there are IMHO extensive internals docs at Meta's wiki: MyRocks Clone Plugin.

Their scope is broader than the title might suggest. Not only the MyRocks clone design is discussed, but there is also a clone background section, which discusses how clone works internally and fully applies to the Oracle branches too.

Like with all things 3rd party storage engines, it is rare to develop a feature without having to patch the server (or in this case server, clone, & InnoDB plugins). The details of this patch are also discussed in the Wiki, and also in the aforementioned Oracle umbrella bug.

Last but not least, with background and patches elsewhere out of the way, the Wiki has the design of MyRocks clone proper.

I hope the feature will reach MyRocks downstreams one day. Since MariaDB currently has no clone plugin, that leaves Percona Server. Maybe these docs will help with the porting, and also for advanced end-user troubleshooting. Clone away!

Thursday, December 14, 2023

MySQL 8.0.35 and 8.2.0 are out, here are my 15 compilation/test bug reports

I'm only a month and a half late to the party. That's, unfortunately, because I tried to build it and run its tests, on macOS, of all things. First the good news: it builds, and does so with the maximum set of 3rd party libraries possible.

Next I tried running the testsuite. I am used to clean test results in Oracle releases, under good conditions at least (not too heavy a load on the system, not too high a --parallel setting), with only occasional issues. This time I saw dozens of failures under debug, debug+sanitizers, release configurations, and tried to convert them to bug reports, best-effort.

First I identified a Homebrew-packaged Perl incompatibility with a test script: https://bugs.mysql.com/bug.php?id=113023.

Then I had a couple of test output differences where the difference was in floating point values: https://bugs.mysql.com/bug.php?id=113047 (MTR test main.derived_limit fails with small cost differences) and https://bugs.mysql.com/bug.php?id=113048 (MTR test gis.gis_bugs_crashes fails with a result difference). I am not a floating point programming expert, but somewhat luckily I remembered that there is a GCC option -ffp-contract=off, and that MySQL CMake script checks whether to add it. On a hunch that maybe the CMake test is incomplete (it is Linux-only and I was on macOS) I tried adding it as a workaround and it worked!

The next set of bugs was nastier. A bunch of query optimizer tests were failing with incorrect query results (https://bugs.mysql.com/bug.php?id=113046), and so did a JSON array test (https://bugs.mysql.com/bug.php?id=113049). To find the triggering conditions I tried different compilers, and, found that the tests pass if compiled with LLVM 14 and fail with LLVM 15, 16, 17, and XCode 15. I had no idea whether this is a compiler bug, MySQL undefined behavior, or something else, but Tor Didriksen posted on #113049 that "Recent versions of Clang have changed their implementation of std::sort(), and our own 'varlen_sort()' function returns wrong results.", one less mystery then.

Checking those different compiler versions was not trivial, because Homebrew-packaged LLVM 14 to 17 fail to build MySQL: https://bugs.mysql.com/bug.php?id=113113. Something about some incompatibility between system ar and LLVM ranlib utilities, with a workaround to use the ar coming from LLVM, i.e. -DCMAKE_AR=/opt/homebrew/opt/llvm@16/bin/llvm-ar. My build script is at 700 lines now, and that's already with some parts factored out.

On the top of the previous bug, LLVM 17, being new, had its regular and expected share of new warnings/errors: https://bugs.mysql.com/bug.php?id=113123.

Back from the build-with-different-compilers detour, there were still some test failures unaccounted: a debug assertion in group replication (https://bugs.mysql.com/bug.php?id=113257), all the TLS 1.3-using tests failing (https://bugs.mysql.com/bug.php?id=113258), spam in the replicating server error log (https://bugs.mysql.com/bug.php?id=113260).

At this point I stopped processing MTR tests, as I had already logged many bugs, and it became harder to avoid duplicates, so thought I could look at the unit tests. Here I'll just give a list of partial findings:

https://bugs.mysql.com/bug.php?id=113189 (ColumnStatisticsTest.StoreAndRestoreAttributesEquiHeight unit test fails)
https://bugs.mysql.com/bug.php?id=113190 (Several tests in routertest_integration_routing_sharing_constrained_pools fail)
https://bugs.mysql.com/bug.php?id=113420 (testNdbProcess fails under ASan+UBSan)
https://bugs.mysql.com/bug.php?id=113421 (HistogramsTest.DecimalSingletonToJSON fails under sanitizer build)
https://bugs.mysql.com/bug.php?id=113422 (Handler_test.TableCreateReturnsRecordFileFullWhenTempTableAllocatorThrowsRecordF) - the test name is longer than the bug title length limit.

That's why it took me ~six weeks (and fifteen bug reports) to celebrate the new MySQL releases. That's halfway to the next expected release date on the quarterly schedule, and I hope I will be able to write a much shorter blog post much sooner after that release, as usual!

Monday, October 23, 2023

Strong typing: comparing Rust newtype to C++

Rust source code often makes heavy use of the newtype idiom, where a new type is created for an underlying primitive type to differentiate it from other uses of the same underlying type. The term "newtype" comes from Haskell where it's not an idiom, but a keyword, thus a built-in language feature.

I was wondering why I never heard of newtype as a C++ developer, because C++ is obviously a strongly-typed language, where the same issue exists. There are different ways to approach it and this is my attempt to get the thoughts on the topic in order, there will be no earth-shattering insights.

Rust: newtype

Suppose you are developing a database and have transaction IDs and log sequence numbers. Both are u64 but are not interoperable in any way. So in Rust, a natural implementation would be to apply newtype idiom twice:

pub struct TransactionId(u64);

impl TransactionId {
    fn new(id: u64) -> Self {
        Self(id)
    }

    fn get(&self) -> u64 {
      self.0
    }
    ...
}

impl fmt::Display for TransactionId { ... }

pub struct LogSequenceNumber(u64);

impl LogSequenceNumber { ... }
...

Let's enumerate the options in C++.

C++: do nothing

Do nothing, and use std::uint64_t for both types. No compiler protection, no documentation at the type name, thus the most bug-prone option. Obviously there is nothing to stop us from using this same option in Rust too.

C++: use type aliases

Introduce type aliases:

// Can also be done with typedef, but let's stick to modern C++:
using transaction_id = std::uint64_t;
using log_sequence_number = std::uint64_t;

This expresses the intent, documents things whenever the type name appears, and is not too verbose. The downside is that it does not introduce new types, only aliases for existing ones, meaning that transaction IDs assign to LSNs and back freely.

C++: introduce new types

Introduce new types. Like in Rust, differently-named structs with identical fields can be used.

struct transaction_id {
  std::uint64_t val;
}

struct log_sequence_id {
  std::uint64_t val;
}

Now type safety is increased and the type mix-up is prevented by the compiler. But so are most operations with type variables, requiring writing extra code to have the desired functionality, compared to the first two options. Writing this extra code will be more verbose than the same in Rust because the latter has support for traits, which can have default implementations.

struct log_sequence_id {
  ...
  // explicit is important, we don't want to make the incompatible types
  // implicitly-covertable again inadvertently
  explicit log_sequence_id(std::uint64_t v) : val{v} {}

  log_sequence_id& operator += (std::size_t log_delta) {
    val += log_delta;
    return *this;
  }
  ...
}

Naturally, limiting available operations is advantageous too, in both languages. For example, it makes no sense to add two transaction IDs together.

Since this is C++, meaning that we have the template-hammer, making all the problems look like template-nails for better or worse, we could try avoiding spelling out structs every time:

// Written this way only to show a point. The actual implementation would be more
// complex to be able to handle move-only types and wrap large objects efficiently.
template<typename T, typename Tag>
class newtype {
 public:
  explicit newtype(T v) : val{v} {}
  void set(T v) { val = v; }
  T get() const { return val; }
 private:
  T val;
};

struct log_sequence_id_tag{};
using log_sequence_id = newtype<std::uint64_t, log_sequence_id_tag>;

struct transaction_id_tag{};
using transaction_id = newtype<std::uint64_t, transaction_id_tag>;

Now introducing a newtype is reduced to two lines of code. Again, C++ developers do not usually discuss newtype but they do discuss strongly-typed using and typedefs, which is the same thing, called differently.

In most cases we are wrapping a single value of a primitive or string type. Those wrapped values are then operated using free functions or methods of some other classes. Thus, in this setting, this is a great option and we are done. But suppose we want to add some methods to the newly-introduced type instead of using free functions. The newtype template will not allow this, not unless we introduce inheritance:

using log_sequence_id_base = newtype<std::uint64_t, log_sequence_id_tag>;

class log_sequence_id : public log_sequence_id_base {
   ...
};

At which point the use of the newtype template becomes questionable and the code simplifies by folding the value into the class:

class log_sequence_id {
 public:
  ...
 private:
  std::uint64_t value;
};

Here we are back to creating a new type manually, just like before, without templates. This seems to be different from Rust, where a single-field struct will clearly show its newtype origins in the declaration, regardless of how much functionality it acquired later on.

So, there you have it. Both languages are strongly typed and have means to introduce new distinct types built on the existing ones, with Rust calling this newtype, and developers having a choice in C++ between type aliases, which don't actually increase type safety, to succinct templates and verbose types with some trade-offs.

Monday, September 25, 2023

Implementing durability in a MySQL storage engine

update 2023-09-28: edited for non-durable SE commits under group commit, and fixed the trx->flush_log_later discussion.

update 2023-09-27: Binlog group commits asks the storage engines to commit non-durably, will edit the post even more.

update 2023-09-26: trx->flush_log_later is actually used. Will edit the post.

Let's review how a MySQL storage engine should implement transaction durability by flushing / syncing WAL writes to disk. For performance reasons (group 2PC), let's also review when it specifically should not sync writes to disk. The reference durability implementation is, of course, InnoDB.

The main storage engine entry point is handlerton::commit. Since in general the storage engines participate in two-phase commit protocol with the binary log, there is also handlerton::prepare, and handlerton::flush_logs participates too. Let's ignore rollbacks, savepoints, explicit XA transactions, read only transactions, transactions on temporary tables only, crash recovery, and transaction coordinators other than the binlog.

Background: Group Commit

It was implemented (WL#5223) in its current form in MySQL 5.6, and its internals are described in this Mats Kindahl's blog post. I will not repeat everything here (and I'm sure I'd miss a lot of details), but for durability discussion, from the storage engine side, the group commit looks as follows:

prepare(t1) with reduced durability;
prepare(t2) with reduced durability;
…
prepare(tn) with reduced durability;
flush_logs(), making all the prepares above durable;
commit(t1) with reduced durability;
commit(t2) with reduced durability;
…
commit(tn) with reduced durability.

A surprise here is that the commits are performed with reduced durability too. How do reduced-durability commits implement full durability for the committed transactions, then? Turns out, the design of binlog group commit is only the commit of binlog itself is durable, and for the storage engines, prepares are made durable in batches and that's it. If their commits are lost, binlog crash recovery will roll forward the prepared transactions.

This design is counterintuitive if one thinks that innodb-flush-log-at-trx-commit=1, as documented, makes InnoDB commits durable in this setup, which it does not, and it is possible to see binlog crash recovery in action. Davi Arnaut reported this as bug #75519 in 2015, and IMHO few users are aware of this behavior.

Anyway, back to the implementation. Apparently the server developers did not want to change the prepare/commit handlerton interface, so the server durability request (full or reduced) is not passed in as an argument, but must be queried by thd_get_durability_property returning an enum with two possible values HA_REGULAR_DURABILITY and HA_IGNORE_DURABILITY.

Later, in 8.0, this durability property was reused to implement correct & performant commit order on multithreaded replicas, when binlog is disabled (WL#7846).

InnoDB: `handlerton::commit`

Implemented by innobase_commit.

Comes last in the group commit, but let's review it first. In other setups it might be the only entry point.

Wherever I say "write [to the disk] and sync|flush", the mental model is that of a buffered write with a separate flush/sync afterwards. If O_SYNC or O_DSYNC is used to write the log instead, then the write and the sync are a single operation.

Let's ignore non-default innobase_commit_concurrency setups.

First the code sets trx->flush_log_later and then goes through the call stack innobase_commit -> innobase_commit_low -> trx_commit_for_mysql -> trx_commit -> trx_commit_low. The last one calls trx_write_serialisation_history, which makes the necessary commit writes to a mini-transaction, then trx_commit_low commits the mini-transaction by creating the redo log records. Nothing is done for durability yet at this point. Finally trx_commit_low calls trx_commit_in_memory, which sees that trx->flush_log_later is set and sets trx->must_flush_log_later. (if trx_commit is called from other API than SE commit, then flush_log_later will not be set and the durability will be ensured in this function).

At this point the callstack returns all the way back to innobase_commit, which calls trx_complete_for_mysql, which now checks trx->must_flush_log_later (set), durability request (reduced), and whether this is a DDL transaction. If it is not, then nothing is done, and InnoDB reports the commit as successful. If it is a DDL transaction, then log is flushed ignoring the reduced durability request and innodb_flush_log_at_trx_commit setting..

The above mentioned that DDL transactions are flushed more than regular ones, regardless of innodb_flush_log_at_trx_commit setting. This is a deliberate design decision, which has to do with the data dictionary, I believe. To understand why, consider the relevant parts of server startup sequence:

InnoDB comes up, and performs its own recovery from its redo log.
Server data dictionary is initialized.
Binlog crash recovery runs.

If any DD transactions are trapped in prepared state by the time of the data dictionary initialization, they will be invisible, while their disk changes (e.g. a tablespace renamed on disk) will be present on disk. This inconsistency is likely to be fatal for the DD, and binlog crash recovery runs too late to recover from that.

InnoDB: `handlerton::prepare`

Implemented by innobase_xa_prepare.

It calls trx_prepare_for_mysql -> trx_prepare -> trx_prepare_low, which updates the undo log state for the transaction a mini-transaction, committing which makes the top-level transaction prepared. Then trx_prepare calls trx_flush_logs, which will either do nothing or write and flush the redo log up to the mini-transaction's commit LSN, depending on the server durability request.

InnoDB: `handlerton::flush_logs`

Implemented by innobase_flush_logs.

It has a bool argument telling whether it was invoked as a part of binlog group commit, which is the interesting case here, ignoring the other option of it being invoked by FLUSH LOGS SQL statement. It writes the redo log buffer to disk and flushes it according it to innodb_flush_log_at_trx_commit value.

Bugs reported while writing this:

https://bugs.mysql.com/bug.php?id=112456

Bugs found while writting this:

https://bugs.mysql.com/bug.php?id=75519

Bugs that made me write this:

https://github.com/facebook/mysql-5.6/issues/1370

Friday, August 25, 2023

MySQL Build Times: Use Ninja

I had noticed Ninja as one of the possible CMake generators long time ago, but never paid attention to it, as I could not imagine it being better than Make so much that it'd be worth switching. Then, when I posted my MySQL -ftime-trace results, I got a comment on LinkedIn that Ninja visualizes build time nicely. I tried that, and it did, and I went back to Make builds.

A few months later, I am looking at Vittorio Romeo's "Improving Compilation Times" presentation slides (download them, do not read inline on GitHub; there is also the talk video itself), and the very first low-hanging fruit advice is "use Ninja".

OK, so let's actually try, say, Debug build on Facebook MySQL 8.0.28:

make -j13: 4m43s
ninja: 4m17s

A 10% improvement with roughly zero effort is nice. There are other niceties too: you don't have to figure out the right make parallelism argument for -j, as Ninja handles that automatically, and the terminal is not spammed with the build log of all the source files that have been built uneventfully. Only compiler warnings and any irregular build output is there.

To use it, add -G Ninja to CMake invocation, and then use ninja instead of make to build. I have patched my scripts.