char8_t Was a Bad Idea
We recently pondered whether to change all our char
s to C++20 char8_t
and decided against it, at least for now.
At the face of it, adding char8_t
seems straight-forward and appealing. There is already a char16_t
and char32_t
for UTF-16 and UTF-32 strings, so why not have a char8_t
for UTF-8 strings?
And I am all for strong types, that is, aliases of basic (or other) types that share the same representation but do not implicitly convert to other types to avoid mistakes. If we are counting apples, we may as well have a type for it, so we do not accidentally assign the count of apples to a variable counting oranges. If we want to express, for whatever reason, buying the same number of apples as oranges, we can still convert explicitly between counts of apples and oranges.
Unfortunately, C++ support for strong types is not great and requires a lot of boilerplate. Nothing helpful made it into the standard library so far. Rust, being a more modern language, is much better in this regard.
So why not applaud the introduction of char*_t
so we have at least strong types for Unicode encoded characters?
First of all, nowadays most strings are Unicode-encoded anyway. So we go through the effort of introducing a new keyword to distinguish the 1% non-Unicode encoded characters from the 99% Unicode characters.
And we do it while the real problem is elsewhere: strings do not only differ in their encoding but, actually much more frequently, in other invariants. File paths may not contain certain characters. Strings may be escaped, for example according to JSON or XML or SQL rules, which are all different. The UI may only support printing a subset of control characters. Do you want to ring a bell when encountering BEL?
It gets worse. For file paths, Windows is happy with any sequence of 16-bit values, it does not even have to be valid UTF-16. This is actually a good reason not to settle on a single character encoding (UTF-8 being the most attractive) for the whole program: some strings we encounter in the wild may not be proper Unicode, but as long as your program is aware of these idiosyncracies and uses these strings only in their specific contexts, everything works just fine.
Ideally, all these strings with different invariants should really be different types. But keep in mind that we are talking about string types, not single character types. Many of the invariants, including encoding and escaping, are invariants of a sequence of characters, not of a single character. Having a type for a single character does not help to maintain the invariant.
Making the strings different types is not so easy either. We already use different string types because we want to store strings in different ways. String constants are C arrays, dynamic strings are std::basic_string
, and they may be passed by reference as std::span
or std::basic_string_view
. At think-cell, in our cross-process shared heap, we store strings in std::vectors
because they work with allocators with custom (in this case, relocatable) pointer types.
The invariants of a string are orthogonal to its storage. We would need to take any of these storage types and restrict their conversion rules and other operations, for example by some sort of tagging. We have not done this at think-cell. Instead, we rely on conventions. We use character type aliases such as tc::filechar
and prefix each variable, for example those containing OS file or path names with "path" and those containing HTML with "html".
Of course, this is really weak. But as you can see, introducing char8_t
does not help a bit with these problems, and just creates more conversions. This is why we decided against it.
Do you have feedback? Send us a message at devblog@think-cell.com !
Sign up for blog updates
Don't miss out on new posts! Sign up to receive a notification whenever we publish a new article.
Just submit your email address below. Be assured that we will not forward your email address to any third party.
Please refer to our privacy policy on how we protect your personal data.