[pre-RFC] Deprecate and replace CStr/CString

Given any problem: if it involves text, it's worse than you think; if it involves C, it's worse than you think. If it involves both, run.

—DanielKeep

CStr/CString are broken because they assume all C strings are encoded as UTF-8. Fixing this requires either breaking stable APIs, or contorting them beyond reasonable limits. As such, they should be deprecated and replaced.

I'm posting this as a pre-RFC because I'm not sure whether the approach I have in mind is the right way to go about this.

What follows is a long, somewhat rambly description of the problem. If you're already convinced, feel free to skip to the proposed solution.

Problem

For those who aren't aware: yes, they really do operate under the assumption that all C strings everywhere are encoded using UTF-8. This is so wrong, I'm kind of amazed this hasn't blown up in anyone's face as of yet. Strings intended for C have to be transcoded from Unicode into the current C multibyte encoding, since that's what pretty much every C API is going to expect. The exception are libraries that specifically use UTF-8 internally, but this does not apply to things like the C runtime itself.

It's also super, duper, totally wrong on Windows, which can't even set UTF-8 as the C multibyte encoding at all. Ever. Heck, stock Windows is configured to support CP500, which isn't even a superset of ASCII.

Also, the default locale for C, as far as I can tell, only admits ASCII, so unless Rust programs are routinely setting the C locale properly, they're probably broken in that way, as well.

This means that almost all extant code written in Rust using CStr[ing] only works by coincidence. I don't think that's a very good position for the language to be in when it cares so much about safety and correctness.

I've known about this for a little while, and recently started writing a library to handle string FFI of all kinds.

Unfortunately, that library's design currently ICEs the compiler, so I decided to instead see if I could just fix CStr[ing]. I figured I could just paper over the UTF-8 assumption.

Take a guess, from the title of this post, how that went. Here are the problems I've found:

fn CString::new<T: Into<Vec<u8>>>(t: T) -> Result<CString, NulError>

This is designed to let you pass a String or Vec<u8> into it without introducing unnecessary copies. Sadly, because you will very often have to re-encode strings, this is misleading. Worse, if you pass a &str or &[u8], you will now be doing a minimum of two allocations.

Even worse, you can pass &[u8], which is really weird since the only way to properly encode the string data is to have Unicode text in the first place. Which means it has to take the Vec<u8> it constructs internally, and re-validate it as UTF-8, even if it was originally a String.

Oh, and as a final kick in the pants: the only error you can indicate is that there was an interior zero terminator. Not that, say, the string cannot be encoded into the current C multibyte encoding.

fn CString::from_vec_unchecked(v: Vec<u8>) -> CString

I could have added this one under the previous method, but: the way this is documented implies that v should be in whatever the current C multibyte encoding is. This is both true and false. It is true if you pass a pointer into C code. It is false if you do anything that causes CStr[ing] to read the string back, in which case it has to be valid UTF-8. You can't win, and I don't know which encoding I should be accepting.

fn CString::into_string(self) -> Result<String, IntoStringError>

First, the wording requires that the CString be valid UTF-8, making it definitionally useless for working with actual C strings, unless the current encoding happens to be UTF-8. In other words, this method should fail on any non-ASCII data 100% of the time on Windows machines, which sucks.

Secondly, the error type only allows for one kind of error: UTF-8 validity errors. This can be handwaved, but only by perverting the meaning of str::Utf8Error to mean something far broader.

fn CStr::to_str(&self) -> Result<&str, Utf8Error>

Again, this specifically only works for UTF-8. Again, this means that the second you throw a valid, but non-ASCII, string at it on Windows, it fails. Again, this sucks, because it encourages writing programs that work on Mac/Linux, but don't work on Windows. I could get around this if it just returned Cow<str> instead. Having said that...

fn CStr::to_string_lossy(&self) -> Cow<str>

This has the opposite problem: it unconditionally succeeds.

Now, you could kind of squint your eyes and let it process non-UTF-8 strings, but you run into a problem with the "replacement" part. The standard APIs for doing string conversions in C don't let you resume from conversion failures. The conversion functions all effectively say that once something goes wrong, you cannot continue the conversion, and have to either start again or give up.

Now, I suspect it's possible to get around this by restarting conversions in the middle of strings... except there's nothing in anything I've found that says you're allowed to do this. So it would be a case of writing more code that only works by coincidence on some platforms, maybe not on others. It also doesn't help that when something goes wrong, there's no obvious, correct way to work out how much of the input is in error, and how far to skip ahead.

impl Debug for CString

Basically, because you have things like CP500, you can't even assume that the encoding is a superset of ASCII. I'm not especially worried about this one, since even if the contents of the string is gibberish, it's human-readable gibberish, and it's only intended for debugging anyway.

Sometimes, UTF-8 is right

As I noted above, libraries can specifically ignore the C encoding and specify that all strings they deal with be in UTF-8. In those cases, using CStr[ing] is the correct thing to do, and "fixing" them would break such code in a backward incompatible fashion.

In other words: damned if we do, damned if we don't.

Proposed Solution

First, CStr and CString should both be deprecated. In addition, their documentation should be adjusted to explicitly spell out, at the beginning of all relevant sections, that it only works with zero-terminated UTF-8 strings, and does not interop correctly with regular C strings, except by coincidence.

Second, introduce a new pair of types that are specifically for interop with C strings, and a new pair of types that implement the semantics of CStr and CString. This allows all old code to be migrated correctly.

The library I've been working on already contains most of the code needed for these types, and the ICE-triggering parts can be avoided for these types. They implement most of the functionality from CStr[ing], just with different signatures.

As for names, since the obvious ones are already taken, I'm partial to ZMbStr/ZMbString and ZUtf8Str/ZUtf8String; Z = zero-terminated, Mb = multibyte.

For those wondering, I currently have two code-paths: one for C95 where *wchar_t is UTF-16 (i.e. Windows), and one for C11 where *char32_t is UTF-32 [1]. They do not rely on any additional libraries or dependencies.

Why not keep CStr[ing]?

Because there's a huge body of text on the internet now that answers the question "how do I call C from Rust with strings?" with "use CString". No matter how blatantly you note that this is wrong in the docs, the people who most need to see those notices are the ones least likely read to them.

Also, the names are just wrong.

Why not add new, more useful methods?

The primary problem is that these methods would be creating non-UTF-8 C strings, whilst the existing methods would still be creating UTF-8 C strings. This would make it ludicrously easy to mess up and use the wrong methods.

Also, as noted previously, the existing semantics aren't always the wrong ones to use.

Finally, the good names are all taken. It's hard to come up with evocative names that aren't unwieldy and long, that also distinguish themselves from the semantics of the existing ones.


[1]: I have gone through so much depressing stuff in trying to research this stuff. Perhaps the most depressing of all is that when the C standard committee was defining C11, they introduced char16_t and char32_t, along with a full family of conversion functions... but specifically did not require that these types be any kind of Unicode. I don't even...

The second most depressing was finding out that Windows has at least five different "active" code pages in a process, all of which do different things. There might be a sixth, but I could never work out what that one was for, so I gave up.

11 Likes

I read through this twice, but I still don't actually know what problem you're trying to solve. Could you be more specific? I feel like you've assumed all of your readers know how C strings are used with respect to multibyte encodings, particularly on Windows. I for one feel like I'm missing a lot of context while reading your post. Could you add a bit more background please?

I'm also left wondering why you didn't address the purely byte-based APIs of CStr/CString. In particular, I was surprised at what appears to be your central thesis:

because they assume all C strings are encoded as UTF-8

But I feel like you never actually explained why this was true. In particular, I don't need to ever materialize a String or anything related to UTF-8 to roundtrip bytes through CStr/CString. As you mentioned, CString::new permits creating a CString from a Vec<u8> (and CStr has similar constructors for &[u8]), but the methods you left out, into_bytes and as_bytes and into_bytes_with_nul and as_bytes_with_nul all permit extracting the raw bytes out of a CString without ever caring about UTF-8 at all.

Stylistically speaking, I would also like to note that while I was reading your comment, I felt like I was being screamed at. It's not pleasant.

8 Likes

Every string has to be encoded somehow. If you're passing a string to, or getting a string from some C code, you have to account for however the C code expects that string to be encoded. CStr[ing] currently ignores this entirely, assuming Rust's string encoding and C's string encoding always match.

This isn't true in the abstract (C doesn't mandate any particular encoding for strings), and it's not true in practice. The C runtimes on Windows will not allow you to set UTF-8 as the C string encoding.

Beyond that, the encoding can be changed on the fly on other platforms to pretty much anything. Also consider that whilst this isn't really an issue for Rust embedding C (since the Rust code can just set the encoding once), it's less clear-cut for C embedding Rust. In particular, I've had encounters in the past with software that only works with a specific encoding, not whatever the system is configured to use by default. If the author of such (hopefully) legacy software wanted to integrate some Rust code, they'd currently find CStr[ing] very difficult to use.

The problem with CStr[ing] is that they don't actually provide an abstraction around C strings. They provide an abstraction around UTF-8 encoded, zero-terminated strings.

This is true, but one of the major uses for CStr[ing] is to interop with C code. They're not much use if you cannot look at their contents, or create one from a string. Here's the very first example for CString:

use std::ffi::CString;
use std::os::raw::c_char;

extern {
    fn my_printer(s: *const c_char);
}

let c_to_print = CString::new("Hello, world!").unwrap();
unsafe {
    my_printer(c_to_print.as_ptr());
}

Every use I've ever seen of CStr[ing] is either doing this (passing a string from Rust to C), or doing the opposite. The conversion functions going both ways require UTF-8.

Even the standard library makes this mistake, passing Rust strings directly into CString without first transcoding them. It also passes through OsStrs, which I'm less certain about.

I do have to admit: it is entirely possible to use CStr[ing] correctly: you just have to not use it with strings. I feel this rather conflicts with the purpose implied by the name.

I just have no idea where this comes from, but it's come up enough times that barring some conspiracy, it's a problem with me. Maybe I should give up on text and just post audio. That or just stop posting.

3 Likes

What I'm hearing from your second comment is that you think the CString API makes it too easy for folks to just assume that all C strings are UTF-8 strings. This is very different from "The CString API assumes that all C strings are UTF-8 encoded." For example, one might say that the existing API is pretty faithful to what C strings are---a sequence of bytes terminated by a NUL---with convenience routines for safely converting between C strings and Rust Strings in the case where the C string is valid UTF-8. Otherwise, the CString API completely punts on the issue of encoding altogether and leaves it up to the callers to deal with Vec<u8>. This seems like a reasonable design point to be at, so I guess I feel like I'm still missing where your pain is coming from. Maybe you'd be interested in a higher level more encoding aware API?

I'm also not sure I see the footguns here. If you're using the CString API with the various String methods on non-UTF-8 data, then I feel like you're going to run into an error pretty quickly. e.g., If your CString is UTF-16 encoded, then all of your Rust String methods are going to fail immediately and the lossy string conversion method is going to give you total junk.

Please don't stop posting. :slight_smile: Your second comment was much much nicer. :slight_smile:

6 Likes

When I’ve worked on CString I have not assumed it is a UTF-8 sequence, but I’ve been worried that it’s not more explicitly documented. One example, its Debug prints it as a “byte string” using the common way to represent byte strings – printable ascii bytes as they are and others with \xXY escapes. Which seems correct.

I didn’t see this as a problem with CString, but as with Rust over all: it supports interchange with UTF-8 data everywhere (and only that) because it’s easy that way.

You have a good point, Rust libstd doesn’t care about other encodings and it creates an API that can mislead.

2 Likes

Zero-terminated is how CString defines the c-string format that it works with, so that is consistent. That's basically the extent of what it should care about in the format.

2 Likes

Adding more bullet points to https://github.com/rust-lang/rust/issues/29354 would be great, if we can turn this thread into some.

1 Like

How about this: if you want to call a libc function that takes a string, the only thing CStr[ing] helps you with is the zero-termination. Unless you are willing to either implement the transcoding yourself (which is non-trivial and requires at least two code paths, possibly more), or you don't care about supporting Windows or other systems not using UTF-8 by default, then CStr[ing] cannot be used to construct or pull the contents out of C strings.

A quick search of libstd reveals:

  • sys/unix/process/process_common.rs
    • CString::new("<string-with-nul>").unwrap().
    • CString::new("foo=bar").unwrap()
  • sys/unix/weak.rs
    • CString::new(name)
  • sys/windows/compat.rs
    • CString::new(symbol).unwrap();
  • sys/windows/dynamic_lib.rs
    • CString::new(symbol)?;
  • sys_common/net.rs
    • CString::new(host)?;
  • thread/mod.rs
    • CString::new(n)

All of these are constructing CStrings using Rust strings. Most appear to be cases where those strings are destined for C libraries, and not simple round-tripping back to Rust. That would make them wrong. If even the Rust developers themselves are getting this wrong, how is anyone reading "use CString for C strings" on StackOverflow supposed to get it right?

The reason I suspect this hasn't caused bigger problems before now is because of three things. First, so far as I know, Linux and Mac systems default to UTF-8 these days. Secondly, most strings are ASCII, and non-ASCII compatible encodings are probably very rare in practice; I've seen my fair share of programs that worked fine until you fed them Unicode, and then they exploded. Third, most libstd code specific to Windows (where you cannot use UTF-8) goes through totally different APIs that explicitly take UTF-16.

However, that doesn't cover third-party libraries people might want to use, or when being embedded by C code, that rely on regular C strings. For example, the cairo API explicitly uses UTF-8 for most text (so CString is fine there), but uses C strings for things like file paths.

My first comment was written when calm and in a fairly good mood. My second was written while feeling judged, frustrated and miserable. I consider not posting because I seem to consistently be interpreted as aggressive, and I don't understand why. If I can't communicate properly, I shouldn't communicate at all.

Technically, even that is wrong, but non-ASCII encodings are hopefully rare enough that we'll never have to care. Just as an example, "gªrçon" encodes as [135, 154, 153, 72, 150, 149, 0] in codepage 500, which I have a test for.

As an aside: what's really crazy is that once you set the codepage to 500, you have to encode C strings correctly, or you can't call setlocale to specify a different locale, because it won't understand the parameter any more!

I've hemmed and hawed about this a lot. I would contend (sadly without evidence), that most people would assume "C string" means "a string I can pass to libc functions", not purely that it's zero-terminated. And I say that in the sense that even if they say the latter, they behave as though they meant the former.

Sure.

I think that Rust is trying to leave this world of legacy encodings behind. I don't blame it.

One thing that comes up is that one could assume that "C String" means it must be allocated with malloc and free'd with free right? That too would be the user's misconception.

1 Like

I could only think of one pertinent one, which I've added.

Neither do I, but we're not there yet. Tragically.

This is actually something I tried to address with my (currently shelved) strffi crate. I had an abstraction for switching between allocators, but the default was malloc/free, so that transferring ownership between Rust and C was slightly more likely to work.

This is also something you see on StackOverflow. Although, you also see other problems, like translating void blah(const char *); as extern "C" { fn blah(s: CString); }.

I see two alternative solutions that don’t involve scrapping the whole thing:

  1. Document that CString is a bag of bytes of unknown encoding and add neccessary methods for working with non-utf8 CStrings.

  2. Change CString to CString<Encoding=UTF8>. This will allow CString<Encoding=CP500> and such to exist.

I agree that UTF-8 default is pretty bad on Windows, but OTOH it’s perfectly fine for macOS and a sensible option in many Linux configurations.

2 Likes

The major worry I have with this approach is that the existing methods are named such that people are more likely to use those by default. This will lead to programs that appear to work; that's always the failure mode I fear the most, because it's not obvious to the programmer that something is wrong.

If the problematic methods were deprecated, and new ones written, that could work. My concern then would be that users now have to track what encoding a given string is in by themselves, but I have to admit that's less of an issue.

I actually did this in strffi, although what you actually want is CString<Encoding=CMultiByte>. The C multibyte encoding is not any specific encoding, since it can change on the fly.

(Just to be clear, I'm not proposing we support code that insists on flipping between encodings willy-nilly; but a program can start with one encoding, and end with another.)

Assuming you're talking about setting the Encoding parameter based on platform: again, I worry about people unintentionally writing platform-dependent code because they think "everyone uses UTF-8 these days, and it works for me!" Or the reverse, for that matter.

If code can be made to seamlessly work, by default, across all supported platforms... then why not?

1 Like

This sounds like you're implying CP500 is the default, which it's not.

I'm not sure why you're referencing wide characters all of sudden. Wide-character strings are also an interesting issue (wholly unsupported by Rust atm, except on Windows through OsString) but they have little to do with the standard byte-based multi-byte C string.

Also I'm fairly certain wchar_t on Windows is UCS-2, not UTF-16 (i.e. codepoints above U+FFFF not allowed)

You've mentioned multibyte, but that's not compatible with char * APIs Sorry, I've learned that is the terminology used by stdlib to refer to encodings that don't use "wide" (multibyte wchar :unamused:) encodings.

As for support of code pages (encodings with 1 byte per character), then I don't think it can be both seamless and correct.

Specifically, if you make CString use "system default codepage" it will be less broken than UTF-8, but it will still be broken (as someone who's written in Polish on computers in the '90s, I can insert the Bane meme here). The codepage is not a global constant, and the console can have its own encoding changed at run time. There won't be a correct way to use strings saved in files or exchanged with another machine.

If you want it correct, then it's necessary to very diligently label encoding of every string, with full awareness that every input and output may have its own different codepage.

I was thinking about using type system to specify the encoding (as PhantomData), so there'd be 0 overhead in memory for it (e.g. CString<Encoding=CP1250>), but I suppose it'd be too inflexible for the dynamism of Windows codepages, so it's better if CString stores an enum specifying actual encoding of its bytes.

2 Likes

You sound like you didn't mean this seriously, but I think it's the Right Thing. The observation that the C runtime might be expecting (narrow) strings in some locale-specific encoding other than UTF-8 is really only the tip of the iceberg here. The locale just as specified in C2011 has four active encodings all of which may be different, and none of which is necessarily even Unicode-compatible (for arrays of char, wchar_t, char16_t, and char32_t) — maybe we can ignore the "not necessarily Unicode-compatible" part until someone wants to use Rust on MVS, but still. (And yes, there have been operating systems where wchar_t not only wasn't Unicode, it varied with the locale. I recall tripping over this on Solaris in the early 2000s.)

On top of that, POSIX and Windows both add facilities for working with strings and files in encodings that don't correspond to the active locale, and there are third-party libraries that only work with a particular legacy encoding which doesn't necessarily correspond to any available locale. I have personal experience of this with libthai, and if you count UTF-16 as a legacy encoding, JavaScript interpreters also count.

So I suggest that we want the following family of types, names all provisional:

  • CharEncoding: A trait which defines character encodings, providing iconv-like functionality. (I don't think libstd has this now, but I could be wrong.)
    • UTF8, UTF16, UTF32, ISO8859_1, etc: use when the encoding is known at compile time.
    • DynamicEncoding: use when the encoding will only be known at runtime.
  • Text<E: CharEncoding> — Text in an encoding E. O(1) length calculation; not guaranteed to be NUL-terminated; may contain terminating NULs. Comes in str-like and String-like variants.
  • NulTerminatedText<E: CharEncoding> — as Text, but is NUL-terminated, may not contain terminating NULs, and length calculation may be O(n).
  • str, String — typedef for Text<UTF8>.
  • OsStr, OsString — typedef for NulTerminatedText<DynamicEncoding>?

CStr(ing) are scrapped.

3 Likes

As a non-expert of matters of text encoding I’d like to point out that:

All of these encoding stuff is about legacy encodings and/or support interop with other systems outside the Rust ecosystem. As such, this doesn’t belong in the stdlib but rather as an external crate. But since the stdlib does have a few places where it itself needs to interop with windows, that would be a good opportunity to rely on the benefits of rustbuild which uses cargo and allows the stdlib to depend on external crates. iow, non-utf8 string support is currently an implementation detail of the stdlib but IMO doesn’t belong to its API surface.

1 Like

I think of char / wchar_t / char16_t as distinct string types, and not as string encodings (in contrast with e.g. ISO-8859-1 and CP1252 both of which can be used with char*).

As far as I understand CString is explicitly only for char* (8-bit 0-byte-terminated strings), and is not meant to be used in any case with any wchar_t, char32_t or any other multibyte string type.

Lack of compatibility between CString and w_char is not a design flaw in CString. It’s just completely unsupported by design. It’d be clearer if the type was called ZeroTerminated8BitString :wink: If that’s the point of confusion, perhaps the documentation/rust book could make it clearer that it was never meant to work with 16-bit chars at all?

You could create CString16, CString32, CWString and such to handle other kinds of strings.

2 Likes

it's UTF-16: