Monday, October 23, 2023

Strong typing: comparing Rust newtype to C++

Rust source code often makes heavy use of the newtype idiom, where a new type is created for an underlying primitive type to differentiate it from other uses of the same underlying type. The term "newtype" comes from Haskell where it's not an idiom, but a keyword, thus a built-in language feature.

I was wondering why I never heard of newtype as a C++ developer, because C++ is obviously a strongly-typed language, where the same issue exists. There are different ways to approach it and this is my attempt to get the thoughts on the topic in order, there will be no earth-shattering insights.

Rust: newtype

Suppose you are developing a database and have transaction IDs and log sequence numbers. Both are u64 but are not interoperable in any way. So in Rust, a natural implementation would be to apply newtype idiom twice:

pub struct TransactionId(u64);

impl TransactionId {
    fn new(id: u64) -> Self {
        Self(id)
    }

    fn get(&self) -> u64 {
      self.0
    }
    ...
}

impl fmt::Display for TransactionId { ... }

pub struct LogSequenceNumber(u64);

impl LogSequenceNumber { ... }
...

Let's enumerate the options in C++.

C++: do nothing

Do nothing, and use std::uint64_t for both types. No compiler protection, no documentation at the type name, thus the most bug-prone option. Obviously there is nothing to stop us from using this same option in Rust too.

C++: use type aliases

Introduce type aliases:

// Can also be done with typedef, but let's stick to modern C++:
using transaction_id = std::uint64_t;
using log_sequence_number = std::uint64_t;

This expresses the intent, documents things whenever the type name appears, and is not too verbose. The downside is that it does not introduce new types, only aliases for existing ones, meaning that transaction IDs assign to LSNs and back freely.

C++: introduce new types

Introduce new types. Like in Rust, differently-named structs with identical fields can be used.

struct transaction_id {
  std::uint64_t val;
}

struct log_sequence_id {
  std::uint64_t val;
}

Now type safety is increased and the type mix-up is prevented by the compiler. But so are most operations with type variables, requiring writing extra code to have the desired functionality, compared to the first two options. Writing this extra code will be more verbose than the same in Rust because the latter has support for traits, which can have default implementations.

struct log_sequence_id {
  ...
  // explicit is important, we don't want to make the incompatible types
  // implicitly-covertable again inadvertently
  explicit log_sequence_id(std::uint64_t v) : val{v} {}

  log_sequence_id& operator += (std::size_t log_delta) {
    val += log_delta;
    return *this;
  }
  ...
}

Naturally, limiting available operations is advantageous too, in both languages. For example, it makes no sense to add two transaction IDs together.

Since this is C++, meaning that we have the template-hammer, making all the problems look like template-nails for better or worse, we could try avoiding spelling out structs every time:

// Written this way only to show a point. The actual implementation would be more
// complex to be able to handle move-only types and wrap large objects efficiently.
template<typename T, typename Tag>
class newtype {
 public:
  explicit newtype(T v) : val{v} {}
  void set(T v) { val = v; }
  T get() const { return val; }
 private:
  T val;
};

struct log_sequence_id_tag{};
using log_sequence_id = newtype<std::uint64_t, log_sequence_id_tag>;

struct transaction_id_tag{};
using transaction_id = newtype<std::uint64_t, transaction_id_tag>;

Now introducing a newtype is reduced to two lines of code. Again, C++ developers do not usually discuss newtype but they do discuss strongly-typed using and typedefs, which is the same thing, called differently.

In most cases we are wrapping a single value of a primitive or string type. Those wrapped values are then operated using free functions or methods of some other classes. Thus, in this setting, this is a great option and we are done. But suppose we want to add some methods to the newly-introduced type instead of using free functions. The newtype template will not allow this, not unless we introduce inheritance:

using log_sequence_id_base = newtype<std::uint64_t, log_sequence_id_tag>;

class log_sequence_id : public log_sequence_id_base {
   ...
};

At which point the use of the newtype template becomes questionable and the code simplifies by folding the value into the class:

class log_sequence_id {
 public:
  ...
 private:
  std::uint64_t value;
};

Here we are back to creating a new type manually, just like before, without templates. This seems to be different from Rust, where a single-field struct will clearly show its newtype origins in the declaration, regardless of how much functionality it acquired later on.

So, there you have it. Both languages are strongly typed and have means to introduce new distinct types built on the existing ones, with Rust calling this newtype, and developers having a choice in C++ between type aliases, which don't actually increase type safety, to succinct templates and verbose types with some trade-offs.

No comments: