Given any problem: if it involves text, it's worse than you think; if it involves C, it's worse than you think. If it involves both, run.
—DanielKeep
CStr
/CString
are broken because they assume all C strings are encoded as UTF-8. Fixing this requires either breaking stable APIs, or contorting them beyond reasonable limits. As such, they should be deprecated and replaced.
I'm posting this as a pre-RFC because I'm not sure whether the approach I have in mind is the right way to go about this.
What follows is a long, somewhat rambly description of the problem. If you're already convinced, feel free to skip to the proposed solution.
Problem
For those who aren't aware: yes, they really do operate under the assumption that all C strings everywhere are encoded using UTF-8. This is so wrong, I'm kind of amazed this hasn't blown up in anyone's face as of yet. Strings intended for C have to be transcoded from Unicode into the current C multibyte encoding, since that's what pretty much every C API is going to expect. The exception are libraries that specifically use UTF-8 internally, but this does not apply to things like the C runtime itself.
It's also super, duper, totally wrong on Windows, which can't even set UTF-8 as the C multibyte encoding at all. Ever. Heck, stock Windows is configured to support CP500, which isn't even a superset of ASCII.
Also, the default locale for C, as far as I can tell, only admits ASCII, so unless Rust programs are routinely setting the C locale properly, they're probably broken in that way, as well.
This means that almost all extant code written in Rust using CStr[ing]
only works by coincidence. I don't think that's a very good position for the language to be in when it cares so much about safety and correctness.
I've known about this for a little while, and recently started writing a library to handle string FFI of all kinds.
Unfortunately, that library's design currently ICEs the compiler, so I decided to instead see if I could just fix CStr[ing]
. I figured I could just paper over the UTF-8 assumption.
Take a guess, from the title of this post, how that went. Here are the problems I've found:
fn CString::new<T: Into<Vec<u8>>>(t: T) -> Result<CString, NulError>
This is designed to let you pass a String
or Vec<u8>
into it without introducing unnecessary copies. Sadly, because you will very often have to re-encode strings, this is misleading. Worse, if you pass a &str
or &[u8]
, you will now be doing a minimum of two allocations.
Even worse, you can pass &[u8]
, which is really weird since the only way to properly encode the string data is to have Unicode text in the first place. Which means it has to take the Vec<u8>
it constructs internally, and re-validate it as UTF-8, even if it was originally a String
.
Oh, and as a final kick in the pants: the only error you can indicate is that there was an interior zero terminator. Not that, say, the string cannot be encoded into the current C multibyte encoding.
fn CString::from_vec_unchecked(v: Vec<u8>) -> CString
I could have added this one under the previous method, but: the way this is documented implies that v
should be in whatever the current C multibyte encoding is. This is both true and false. It is true if you pass a pointer into C code. It is false if you do anything that causes CStr[ing]
to read the string back, in which case it has to be valid UTF-8. You can't win, and I don't know which encoding I should be accepting.
fn CString::into_string(self) -> Result<String, IntoStringError>
First, the wording requires that the CString
be valid UTF-8, making it definitionally useless for working with actual C strings, unless the current encoding happens to be UTF-8. In other words, this method should fail on any non-ASCII data 100% of the time on Windows machines, which sucks.
Secondly, the error type only allows for one kind of error: UTF-8 validity errors. This can be handwaved, but only by perverting the meaning of str::Utf8Error
to mean something far broader.
fn CStr::to_str(&self) -> Result<&str, Utf8Error>
Again, this specifically only works for UTF-8. Again, this means that the second you throw a valid, but non-ASCII, string at it on Windows, it fails. Again, this sucks, because it encourages writing programs that work on Mac/Linux, but don't work on Windows. I could get around this if it just returned Cow<str>
instead. Having said that...
fn CStr::to_string_lossy(&self) -> Cow<str>
This has the opposite problem: it unconditionally succeeds.
Now, you could kind of squint your eyes and let it process non-UTF-8 strings, but you run into a problem with the "replacement" part. The standard APIs for doing string conversions in C don't let you resume from conversion failures. The conversion functions all effectively say that once something goes wrong, you cannot continue the conversion, and have to either start again or give up.
Now, I suspect it's possible to get around this by restarting conversions in the middle of strings... except there's nothing in anything I've found that says you're allowed to do this. So it would be a case of writing more code that only works by coincidence on some platforms, maybe not on others. It also doesn't help that when something goes wrong, there's no obvious, correct way to work out how much of the input is in error, and how far to skip ahead.
impl Debug for CString
Basically, because you have things like CP500, you can't even assume that the encoding is a superset of ASCII. I'm not especially worried about this one, since even if the contents of the string is gibberish, it's human-readable gibberish, and it's only intended for debugging anyway.
Sometimes, UTF-8 is right
As I noted above, libraries can specifically ignore the C encoding and specify that all strings they deal with be in UTF-8. In those cases, using CStr[ing]
is the correct thing to do, and "fixing" them would break such code in a backward incompatible fashion.
In other words: damned if we do, damned if we don't.
Proposed Solution
First, CStr
and CString
should both be deprecated. In addition, their documentation should be adjusted to explicitly spell out, at the beginning of all relevant sections, that it only works with zero-terminated UTF-8 strings, and does not interop correctly with regular C strings, except by coincidence.
Second, introduce a new pair of types that are specifically for interop with C strings, and a new pair of types that implement the semantics of CStr
and CString
. This allows all old code to be migrated correctly.
The library I've been working on already contains most of the code needed for these types, and the ICE-triggering parts can be avoided for these types. They implement most of the functionality from CStr[ing]
, just with different signatures.
As for names, since the obvious ones are already taken, I'm partial to ZMbStr/ZMbString
and ZUtf8Str/ZUtf8String
; Z
= zero-terminated, Mb
= multibyte.
For those wondering, I currently have two code-paths: one for C95 where *wchar_t
is UTF-16 (i.e. Windows), and one for C11 where *char32_t
is UTF-32 [1]. They do not rely on any additional libraries or dependencies.
Why not keep CStr[ing]
?
Because there's a huge body of text on the internet now that answers the question "how do I call C from Rust with strings?" with "use CString
". No matter how blatantly you note that this is wrong in the docs, the people who most need to see those notices are the ones least likely read to them.
Also, the names are just wrong.
Why not add new, more useful methods?
The primary problem is that these methods would be creating non-UTF-8 C strings, whilst the existing methods would still be creating UTF-8 C strings. This would make it ludicrously easy to mess up and use the wrong methods.
Also, as noted previously, the existing semantics aren't always the wrong ones to use.
Finally, the good names are all taken. It's hard to come up with evocative names that aren't unwieldy and long, that also distinguish themselves from the semantics of the existing ones.
[1]: I have gone through so much depressing stuff in trying to research this stuff. Perhaps the most depressing of all is that when the C standard committee was defining C11, they introduced char16_t
and char32_t
, along with a full family of conversion functions... but specifically did not require that these types be any kind of Unicode. I don't even...
The second most depressing was finding out that Windows has at least five different "active" code pages in a process, all of which do different things. There might be a sixth, but I could never work out what that one was for, so I gave up.