MySQL 8.0 replaced legacy FRM
files with a transactional data dictionary, stored in
InnoDB. Here I'll describe how any other storage engine could become the DDSE, that
is, serve as the backend for the data dictionary, replacing InnoDB. This is purely an
internals post, and will not look into user-facing motivations and implications of
the change (which can be found here), now will it discuss storing any internal
SE-specific data dictionary-like information in the server DD, nor should the DD
tables be confused with system tables, a distinct category.
This is based on the experience of implementation this feature in MyRocks.
The needed changes broadly fall into three areas:
- Changes to the SE to make it able work as the DDSE
- Changes to the core server layer which no longer can assume InnoDB being the DDSE
- Changes to InnoDB so it can keep on working while not being the DDSE
Changes to the SE to make it DDSE
The DDSE must:
- be transactional
- be linked statically into the server
- know how to assign any internal IDs to DD tables
- implement the handlerton API for DDSEs
- make DD transactions durable
- be able to do non-locking reads
- support attachable transactions
The first two items are straightforward, in the motivation if not in the
implementation. The storage engine must be transactional to maintain the
transactional data dictionary. It also must be linked statically into the server, as
dynamic plugins are registered in a mysql.plugin
table, which is hard to access
without data dictionary. This can be done by adding MANDATORY
as an argument to
MYSQL_ADD_PLUGIN
in the CMakeLists.txt
for the SE.
Internal ID assignment
The main task for the DDSE is to know how to access the DD tables without needing to
refer to the DD. That is achieved by the DDSE hardcoding its internal IDs and any
other needed data for the DD schema metadata. To get an ID,
handler::get_se_private_id
is called for each DD table, allowing the DDSE to update
the passed dd::Table
object to set the private IDs. For example:
dd_table->set_se_private_id(next_dd_index_id++); for (auto *index : *dd_table->indexes()) { auto &properties = index->se_private_data(); properties.set("index_id", next_dd_index_id++); }
The create table implementation can then access this metadata to use the allocated IDs.
One question may arise, why bother going through get_se_private_id
at all? Can't
the CREATE TABLE
code recognize that it's being called for a DD table, allocate the
ID itself, and store it in the passed dd::Table
object there? In principle, yes, it
could, but: 1) there are asserts in the server layer that the DD tables have their SE
private data set; 2) modifying passed dd::Table
objects for handler CREATE TABLE is
only allowed for HTON_SUPPORTS_ATOMIC_DDL
engines, while there is no requirement
for a DDSE to be one.
Handlerton API for DDSEs
There are several handlerton APIs that a DDSE must implement:
- dict_get_server_version, dict_set_server_version: read and write the server of the DD schema in a SE-specific way. Called by the server layer during upgrades.
- ddse_dict_init: the name suggests that it performs any DD initialization as needed in the DDSE. But it is also tasked with returning any extra SE-specific DD tables and tablespaces (in the InnoDB sense) back to the server DD layer. It also has a mode argument, with four possible values. While it may appear daunting needing to implement DDSE initialization in four different ways, in MyRocks this function does nothing. Not having SE-specific tables and tablespaces helps.
- dict_recover: also perform any DD initialization in the DD, governed by the mode argument. The name and docs suggest that these are recovery-related actions. In MyRocks this function does nothing.
- dict_register_dd_table_id: allows the DDSE to note which DD object IDs belong to
DD tables. It is up to the engine to decide how to use this information. For
instance, MyRocks checks whether a
DROP TABLE
and similar operations. are not executed against a DD table, for defense in depth. - is_dict_read_only: if the SE has a read only mode (InnoDB does), this allows
telling the server layer when the DD can be read but not updated. Otherwise,
return false;
is a great implementation. - dict_cache_reset_tables_and_tablespaces: tells the DDSE to invalidate any cached metadata about tables and tablespaces, because it is going to be reloaded anew. This call happens in the server startup sequence right after binlog crash recovery.
- dict_cache_reset: remove metadata for a single table from any internal SE caches. Called in upgrades. No-op in MyRocks, because there is no need to do any DDSE upgrades in MyRocks yet.
Durability for DD transactions
I have previously discussed SE durability implementation, and it touches upon DD transactions: if a transaction involves DD, it should be persisted regardless of any reduced durability settings (MyRocks implementation). While reduced durability for user transactions may result in lost data for the last transactions before a crash, in the case of DD it may result in desynced DD, making the whole instance nonviable, which I strongly suspect was the cause for this requirement.
Non-locking reads for user transactions for the DD tables
INFORMATION_SCHEMA
views provide read-only access to the underlying DD tables for
users. Now this access must be non-blocking, both ways: a DDL statement in progress
such as ALTER TABLE
must not block any INFORMATION_SCHEMA
queries and conversely
an INFORMATION_SCHEMA
query, if done in a session with a high transaction isolation
level, must not block any DDL statements from proceeding.
That is implemented by a new parameter for handler::extra
method:
HA_EXTRA_NO_READ_LOCKING
. Having received this, the SE should skipping acquiring
read locks, if any, for the handler's table. MyRocks implementation: setting,
reading. For InnoDB implementation, you can take a look at WL#7446 (InnoDB: provide a
way to do non-locking reads), note that the parameter was named differently there.
Attachable transactions
All the DD accesses are wrapped by DD transactions. These DD transactions differ from
the regular user transactions in that they can and will happen in the middle of those
regular transactions. The server layer takes care of starting and stopping them, and
of swapping the transaction context in the thread handle. The storage engine must be
prepared to handle this swapped context correctly, that is, if it observes
THD::ha_data == nullptr
in the middle of a transaction, it should allocate a new
object to this field, and invalidate any cached assumptions about its contents. Once
that is ensured, add HA_ATTACHABLE_TRX_COMPATIBLE
to handler table_flags return
value bitset. The server implementation design is WL#8003 (Server support for
attachable transactions) and the InnoDB implementation is WL#7828.
Changes to the core server
The server assumes several things about the DDSE, and also that there is exactly one engine involved.
- What the DDSE is
- Early initialization of static plugins
READ COMMITTED
andREAD UNCOMMITTED
- Foreign keys
- Implicit primary keys
TABLESPACE
supporthandler::delete_row
is bug-for-bug compatible with InnoDB implementation- Duplicate entry or lock wait timeout?
- Supporting DD tables in multiple SEs
- Changing the DDSE of an existing instance
- Clone plugin compatibility
What the DDSE is
The DD layer is full of ha_resolve_by_legacy_type(thd, DB_TYPE_INNODB)
calls for
every time it needs to get the DDSE handlerton. Naturally, you'll have to grep all of
sql
for DB_TYPE_INNODB
and refactor out some helper function to return the right
handlerton for you. I wish grepping only sql/dd
were enough, but there is
sql/resourcegroups
too.
That's not all. SQL definitions assume the engine too, the default constructor of
Object_table_impl
adds a hardcoded "ENGINE=InnoDB"
string, which will need
patching.
Early initialization of static plugins
Linking the DDSE statically into the server binary is not enough to ensure it is
initialized early enough. The initialization of early plugins is handled by
plugin_register_builtin_and_init_core_se
in sql/plugin.cc
. It has a loop where it
iterates over all known at this stage plugins and initializes some of them. You'll
have to add logic to initialize your SE as well.
Transaction isolation level requirements
The main level used in DD operations is READ COMMITTED
. Luckily, this is sort of a
"default" isolation level for transactional engines, and even if the SE implements
something else, like snapshot isolation, it still has to map that to the levels MySQL
understands, likely to READ COMMITTED
again.
But, the server data dictionary also expects the DDSE to support READ UNCOMMITTED
.
Even though I spent better part of the year working on this, I still don't fully
understand the reason behind it (any transaction can read its own uncommitted
changes in any isolation level in every storage engine I know about, and reading
uncommitted changes from other transactions for correctness looks suspicious). I have
tried an experiment where I patched the server to always use READ COMMITTED
with
InnoDB still being the DDSE, to see what breaks. And what broke was the foreign key
operations but only when there was more than one of them in a single ALTER TABLE
.
You have to evaluate how much of a showstopper that is for you, and whether go back
to the SE and implement READ UNCOMMITTED
, or patch the server to always use READ
COMMITTED
with DDSE.
A further complication here is that a SE may or may not support atomic DDL
(handlerton flag HTON_SUPPORTS_ATOMIC_DDL
), and the DDSE could go either way. If it
does not, then there are interim DD transaction commits in the middle of a DDL
statement, complicating this picture. I only worked on the case where the DDSE does
not support atomic DDL.
Last but not least, if the unsupported isolation levels are strictly rejected, users may find it surprising that their DDL statements fail if their session uses one of the unsupported levels, for instance, if there is another engine for user tables that does support them. Consider accepting an unsupported level for the scope of any DDL statements. I am not fully certain why issue this happens at all – the attachable transactions should run in one of the accepted levels in the DDSE – but maybe the DDSE gets to see the DDL statement in an outer transaction first.
Foreign keys
The DD table definitions in .cc
files under sql/dd/impl/tables
feature many
foreign keys. If your SE does not support them, you'll glad to learn that the DD
schema has no referential actions, so the keys effectively serve as defense-in-depth
against incorrect operations, but their presence is optional, and omitting their
definitions works just fine. This commit also handles the hardcoded "ENGINE=InnoDB"
discussed above and the next two items.
Implicit primary keys
Some of the DD tables, for example mysql.index_column_usage
, have no explicit
primary keys declared. But, they do have several unique non-NULL indexes, one of
which is promoted to a primary key. As an implementation detail, it might be hard
(and brittle) to track this promotion logic to repeat this decision correctly in the
SE. If that's the case, patch the SQL definitions to make one of those unique keys a
primary key explicitly.
TABLESPACE
support
All the DD tables are created in the mysql
tablespace. If your SE has no notion of
tablespaces, patch the SQL definitions to remove this.
delete_row
bug-for-bug compatibility with the DD-InnoDB interface
This was a fun one. MySQL deletes rows through handler::delete_row(const uchar
*buf)
, where buf
points to either record[0]
or record[1]
. But, the reference
implementation InnoDB does not even look at what the buf is and relies on a
pre-positioned cursor to delete the row. And the DD layer added a new call to
delete_row
that always passes buf == record[1]
, pointing to a memory block of
zeros. Benign with InnoDB, anything can happen with another engine. Check
sql/dd/impl/raw/raw_record.cc
, so that your Raw_record::update
passes
m_table->record[0]
instead of m_table->record[1]
.
Duplicate entry or lock wait timeout
The data dictionary holds the statistics tables, which usually have higher volume of
writes than the rest of DD tables. They are not typical DD tables also in that it is
possible, and legal, to write statistics data for the same object from two different
threads, concurrently. One of them will win the race and the other may get
ER_DUP_ENTRY
error, which is then silently discarded.
The above is true if InnoDB is the DDSE. Other engines may handle duplicate index
entries by returning a different error, for example, MyRocks returns
ER_LOCK_WAIT_TIMEOUT
instead and cannot be easily made to return ER_DUP_ENTRY
.
Thus the server has to accommodate by silently discarding either error.
Supporting DD tables in multiple SEs
While only one storage engine can host the data dictionary, it still makes sense for other storage engines to be able to have their own additional private tables in the DD, as discussed below in the context of InnoDB. This requires some changes in the server layer:
- Handlerton method
dict_register_dd_table_id
needs to be called for the correct SE. The DD layer can no longer assume the engine of a DD table, and must get the handlerton for thedict_register_dd_table_id
call from the table object. dict_init
,dict_recover
, anddict_cache_reset_tables_and_tablespaces
, need to be called for all participating SEs. Their implementations, especially for the first two, must handle both cases of the SE being the DDSE and not being one. But note that this code will have to be further adjusted if the DDSE change needs to be supported on existing instances.
Changing the DDSE of an existing instance
If you want to change the DDSE on existing instances, you'll have to design and write the code to perform this change. I was not the main author of this part of the project, thus I'll skip discussing it.
Clone
Technically not the core server, but a plugin. Clone is an InnoDB-centric plugin, but if your storage engine already supports clone, then almost everything will work. One thing that will not, without a larger patch for clone than I thought was worth developing, is clone rollbacks on startup. This does not happen on successful clones, and a failed clone instance with a non-InnoDB DDSE will just fail in a different way.
Changes to InnoDB
It is hard to blame InnoDB for not working straight out of the box when it does not store the data dictionary. Several changes need to be made there:
- Rewrite InnoDB-specific
INFORMATION_SCHEMA
DD view implementations - Allow InnoDB to register its DD tables even when it's not the DDSE
- Mark DDL transactions as read-write
- Register DDL transactions with server 2PC coordinator
- Allowing UNDO tablespace truncation to run without RW workload
Reimplementing InnoDB-specific INFORMATION_SCHEMA
DD views.
There are several INFORMATION_SCHEMA
views that are specific to InnoDB, that is,
they are not server DD-layer views over DD tables: INNODB_TABLES
,
INNODB_TABLESPACES
, INNODB_DATAFILES
, INNODB_COLUMNS
, INNODB_VIRTUAL
. InnoDB
implements those views by opening the underlying DD tables using low-level InnoDB
APIs. Since those tables are in a different engine now, the views have to be
reimplemented (main commit, fix 1, fix 2) using server layer APIs, luckily they are
rather straightforward to use:
template <typename O, typename F> int for_all_innodb_objects_in_dd(THD *thd, F f) { auto *const dc = dd::get_dd_client(thd); const dd::cache::Dictionary_client::Auto_releaser releaser(dc); std::vector<const O *> dd_objs; if (dc->fetch_global_components(&dd_objs)) { return 1; } for (const auto *obj : dd_objs) { if (obj->engine() != innobase_hton_name) continue; const auto ret = f(thd, *obj); if (ret) return ret; } return 0; }
where the first template argument is dd::Table
or dd::Tablespace
and f
reads
the needed information and fill out the result row, possibly also looking into
partitions, if needed:
[&tables](THD *thd, const dd::Table &table) { const auto table_id = table.se_private_id(); if (dict_sys_t::is_dd_table_id(table_id)) return 0; if (table_id != dd::INVALID_OBJECT_ID) { fill_i_s_innodb_tables_row(thd, *tables, table_id); } else { assert(!table.leaf_partitions().empty()); for (const auto *partition : table.leaf_partitions()) { assert(partition->engine() == innobase_hton_name); const auto table_id = partition->se_private_id(); if (table_id != dd::INVALID_OBJECT_ID) fill_i_s_innodb_tables_row(thd, *tables, table_id); } } return 0; }
InnoDB-specific DD tables
A storage engine should be allowed to have its private DD tables even if it is not
the DDSE of the instance. InnoDB in particular needs its own tables in mysql
schema: innodb_dynamic_metadata
, innodb_table_stats
, innodb_index_stats
, and
innodb_ddl_log
. They get created and registered in ddse_dict_init
handlerton
method implementation and the handler method get_se_private_data
will be called for
them too.
There's more to it if DDSE changes need to be considered. The private data for DD
tables in InnoDB is table ID, index ID, B-tree root page ID, etc. The first two are
assigned sequentially. If InnoDB is not the DDSE, a naive implementation would give
out the IDs that belong to different tables when InnoDB is the DDSE. To avoid this
can of worms, it is prudent to play it safe and make sure that in both cases the same
IDs are assigned to the private InnoDB DD tables. Last but not least, one of the
InnoDB private data pieces assigned in advance is the B-tree root page ID, which must
be in sync to what the underlying code actually will do at a later point. And that
varies with innodb_page_size
, so make sure to test and handle all possible settings
for this system variable. Everything discussed in this paragraphs is handled here and
also in dict_table_build_def with a btr_create
debugging counterpart.
Mark DDL transactions as read-write
For some DDL operations, InnoDB did not bother marking their transactions as
read-write in the DDL implementation proper, because the operation involved a write
to a DD table in InnoDB anyway, taking care of that RW bit. With the DD table being
in another SE, the DDL transaction incorrectly becomes RO in InnoDB, which had the
effect of skipping the prepare
call in the 2PC, breaking crash recovery. The
existing handler::mark_trx_read_write
fails to work here, being a private method. I
ended up a copy-paste-shortening it to a new helper ha_innodb::mark_trx_rw
, which
is called from the various InnoDB DDL entry points (commit):
/** Mark transaction as read-write in the server layer. Normally this is handled by handler::mark_trx_read_write, but when InnoDB is not the DDSE (and transactions do not call that method many times), the few remaining calls happen too early, when the transaction is not started yet, or with an incompatible table_share value. @param thd thread handle */ void mark_trx_rw(THD &thd) { auto &ha_info = thd.get_ha_data(innodb_hton_ptr->slot)->ha_info[0]; if (ha_info.is_started()) ha_info.set_trx_read_write(); } // ... if (!innobase_is_ddse()) { mark_trx_rw(*thd); }
Register DDL transactions with server 2PC coordinator
With binlog disabled, when InnoDB is the DDSE, and it is, for example, creating an InnoDB table, it correctly assumes that it is the sole engine in this transaction and does not bother to register itself with the server 2PC coordinator. This assumption does not hold when InnoDB is asked to create its own table when it's not InnoDB.
Fixing this consists of adding numerous instances of
if (!innobase_is_ddse()) { innobase_register_trx(hton, thd, trx); }
to the InnoDB DDL entry points (same commit as for the previous item).
GTID persistence with binlog disabled
If binlog is disabled, InnoDB persists the transaction GTID info itself. This logic needed some tweaking.
Allowing UNDO tablespace truncation to run without RW workload
With DDSE being another engine, InnoDB purge may get stuck and leave undo tablespaces
untruncated. After I recovered from the initial "huh? Why is there a relationship
between purge and DD?", I found that undo tablespace truncation runs only if there is
something to purge (makes sense, doesn't it?). In the case of executing DROP UNDO
TABLESPACE
, InnoDB the DDSE would execute some read-write DD transactions,
potentially creating something to purge. If it's not the DDSE, no actual writes to
InnoDB happen, it has nothing to purge, consequently never getting to the undo
tablespace truncation and the final drop.
One way to address this is to call trx_purge_truncate_undo_spaces
from
srv_do_purge
. Doing that unconditionally if InnoDB is not the DDSE works, but might
be less than optimal.
Testing
If your branch of MySQL needs to support both InnoDB and your SE as DDSEs, one way to
do MTR testing for DDSE regressions is to have a server variable to choose the DDSE
(which is likely what you'll already have if you need two support both DDSEs), and
some MTR include files to skip some tests with one or the other choice. Then adding
--mysqld=--ddse-var=non-default-one
to MTR invocations will do the trick. Any
failing tests then can be saved separately with --initialize --mysqld=--ddse-var
in
their option files and added as regression tests to the main testsuite runs.
Conclusion
A checklist for everything discussed above:
- Changes to the SE to make it able work as the DDSE
- be transactional
- be linked statically into the server
- know how to assign any internal IDs to DD tables
- implement the handlerton API for DDSEs
- make DD transactions durable
- be able to do non-locking reads
- support attachable transactions
- Changes to the core server layer which no longer can assume InnoDB being the DDSE
- What the DDSE is
- Early initialization of static plugins
READ COMMITTED
andREAD UNCOMMITTED
- Foreign keys
- Implicit primary keys
TABLESPACE
supporthandler::delete_row
is bug-for-bug compatible with InnoDB implementation- Duplicate entry or lock wait timeout?
- Supporting DD tables in multiple SEs
- Changing the DDSE of an existing instance
- Clone plugin compatibility
- Changes to InnoDB so it can keep on working while not being the DDSE
- Rewrite InnoDB-specific
INFORMATION_SCHEMA
DD view implementations - Allow InnoDB to register its DD tables even when it's not the DDSE
- Mark DDL transactions as read-write
- Register DDL transactions with server 2PC coordinator
- Allowing UNDO tablespace truncation to run without RW workload
- Rewrite InnoDB-specific
- Testing
Reference
Oracle design documents:
- WL#6380 (Formulate framework for API for DD)
- WL#6394 (Bootstrap code for new DD)
- WL#6599 (New Data Dictionary and I_S integration)
- WL#7464 (InnoDB: provide a way to do non-locking reads)
- WL#7828 (InnoDB: attachable transaction support)
- WL#8003 (Server support for attachable transactions)
Blog posts:
No comments:
Post a Comment