[pre-RFC] Deprecate and replace CStr/CString

Absolutely not. I just find it somewhat amazing that there's an EBCDIC codepage available to use in a stock Windows install, when it is basically the same thing as CP1252 (just arranged differently). That it's there implies to me that someone is still using it, and thus I can't completely discount its existence.

Because the goal is to go between *char and Unicode in a portable, supported fashion, and wide strings are a part of the "how". C11 gives us *char32_t which can be UTF-32, but not all versions of MSVC supported for Rust have those functions. The closest I could find was using the wide string conversion functions to get to *wchar_t, which I know is UTF-16 on Windows, which I can then finish decoding.

The use of *wchar_t and *char32_t is just an implementation detail.

It's both and neither. It's supposed to be UTF-16, but nothing ever seems to validate this, so you can end up with stuff that's not valid UTF-16 but would be valid UCS-2. However, since I'm trying to get to Unicode, if I run into a *wchar_t that isn't valid UTF-16, then I can't decode it anyway, so it's a moot point. For all practical intents and purposes, *wchar_t on Windows is unvalidated UTF-16.

I am saying "multibyte" because that's how the C functions for string conversions refer to them. For example: mbrtowc. I know the current encoding is not necessarily multibyte (either in the "size of individual units" or "length of multi-unit points" sense), but that's the name the functions are using, so it's the name I'm sticking to in lieu of something more accurate which is still succinct enough to use in practice.

Less broken is still an improvement. Also, I'm not proposing we use the "system default codepage". I'm proposing we use the current C runtime codepage. The thing that the C runtime itself is using.

I mean, yeah, there are a ton of other places that can have their own settings. On Windows alone you have the ANSI, OEM, console output, console input, and CRT code pages. This is something I was trying to deal with in the crate I was writing. But that crate's mothballed for now, so I wanted to at least do what I could for std and the people using it for the common case: talking to C code using run-of-the-mill C strings.

I would absolutely love to get more information on where all this can go wrong. I've tried to test the conversions under unusual settings, but it's hard to know if I'm missing something.

Ok, I wanted to keep this out of this particular proposal, but I kind of already wrote this library. As I've said above, though, it crashes the compiler so I've put it on hold for now.

The design I have is two main types: SeStr<S, E> and SeaString<S, E, A>. "S.E.A." sounds like "sea", sounds like "C", and it's for interop with C strings, among others. It also works as a mnemonic for remembering what the parameters are, and the order they appear in. The parameters are "Structure" (zero-term, double zero-term, unit prefix, byte prefix, slice, etc.), "Encoding" (C multibyte, C wide, UTF-X, Java's weird modified UTF-8, etc.), and "Allocator" (C malloc/free, Rust, the weird BSTR-specific allocator, etc.). That ensures you can more or less build any kind of foreign string from the component pieces.

The only thing it doesn't do at the moment is support runtime-variable encodings (hard to work out where to attach that information without custom DST).

I also started making wrappers for "common" combinations, to make the documentation a little less incomprehensible. The proposed ZMbStr and ZMbString would just be wrapper types from strffi copy+pasted into std, with the generics flattened.

The issue I have with this is that this has already been stabilised, meaning people are using it and depend on it. I don't think it's very nice to deprecate an std API with the message "uh, use an external crate, I guess? lol". Maybe not worded exactly like that. It also sucks to know the code you need exists, and is shipped with the compiler, but you can't use it.

And as I've said above: said crate exists, it just breaks rustc at the moment. I'd have published it before making this post if I'd been smart enough to get it to not do that.

I think mine is called ZRaw8Str. As I intimated above, I intended to solve this problem very thoroughly.

I see CString is definitely not flexible enough for your needs. I still wouldn’t go as far as deprecate CString because:

  • While C has lots of string types, the 8-bit 0-terminated char * of ambiguous code page is the “classic” one, and it’s the primary type used in the C standard library.

  • ZRaw8Str would be a more precise name for it, but the Z prefix is mostly a Windows jargon. CString name is not unusual. MacOS calls it C String. Python FFI calls it C string. Java has GetStringUTFChars/NewStringUTF and the documentation doesn’t even bother to mention it’s 0-terminated.

  • Rust uses a single allocator. It is a problem for all kinds of buffers, not only strings. Adding allocator to types may be a good idea, but it’d probably be better to add it to Vec first, and then String and CString will follow. For now it’s only an efficiency issue, and doesn’t prevent being correct (if you do copies).

  • Rust already has separate String, OsString and CString types. It could have been one String<Encoding=Utf8>, String<Encoding=Os>, String<Encoding=ZRaw8>, but the arbitrary decision went the other way. So now a consistent way would be to add new types for new kinds of strings, e.g. C16String or WideString.

1 Like

If you want a string type for wchar_t / char16_t, it should be fine.

If you want a string type for the legacy 8-bit codepages, oh, that's a mess. Fortunately probably the worst hacks are now obsolete, like non-Unicode fonts that determine the de-facto encoding (Windings being the last oddity that can be ignored).

The biggest problem is that the program will inevitably create pipelines where strings of different encodings are mixed and converted back and forth, sometimes even unintentionally, and there's no single good strategy to deal with that.

You need to decide what encoding string concatenation uses. If you concat CP1250 + ISO-8859-7, does it result in the first encoding, the second encoding, the default codepage, or UTF-8, or UTF-16?

  • if you always pick one of the 8-bit encodings, you'll end up dropping some characters in some situations.
  • if you always pick a Unicode encoding, it'll tend to spread and most of your program will end up using that encoding, and have to convert back to 8-bit on output.
  • if you make the decision dynamically to avoid needlessly converting encodings, then the type will end up being known only at run time, and all your string functions will need to have a match for the encodings, which may be another performance problem.

In-place string modification is doomed to fail (str[i] = char), because it has to use the encoding of the destination.

In the end most internationalized systems and programs have switched to the strategy of converting all input to a single superset encoding (UTF-8 or UTF-16), using single encoding throughout, and converting to a specific encoding on output. It costs to encode to and from the internal encoding, but it's the most manageable solution, and you avoid risk of doing the conversion accidentally multiple times in the program (e.g. imagine a scenario where your program's config file ends up being encoded in a different encoding than the default system codepage — now every interaction of the system with your config causes string encoding changes back and forth).

1 Like

This is why I didn't want to bring the design of strffi up. I'm not saying CString needs to be replaced by SeaString. CString should be the "one size fits most people" knife that does the job 80+% of the time because it cuts stuff good enough. SeaString is the giant cabinet filled with every conceivable kind of blade with interchangeable handles and weights that crazy people use when they need to de-bone a rhinoceros.

I don't think CString needs to support every encoding, and every structure. I do think it should support the most obvious one implied by its name. If that can't be done, it should be replaced so that people don't unknowingly use the wrong one.

To be clear: as much as I don't like the incorrectness of CStr[ing], even less do I like the potential for people to unwittingly make a mistake that's easy to go unnoticed. Given that there are numerous cases in Rust's standard library that appear to make this very mistake, I don't think that point can simply be ignored.

But it's not of an ambiguous code page, in the sense that we don't know what it is. C strings are encoded in the current multibyte encoding, unless you're dealing with an exception like cairo's text functions. Every example I've seen of Rust code that's currently using CStr[ing] and creating them from a Rust string want the C strings to be in this encoding, but currently aren't.

Also, if a C function wants a string not in this encoding, these days the odds are that it wants UTF-8, which is also why I proposed adding a type specifically for that. If it's not the current C runtime multibyte encoding or UTF-8, then that's where you can break out the strffi box of toys.

It's just an abbreviation for the structure parameter to keep names from being ridiculously long. Technically, the closest analog to the current CString in strffi would be ZUtf8RString, which is a wrapper around SeaString<ZeroTerminated, UnvalidatedUtf8, RustAlloc>. It's selection had nothing to do with anything in Windows; it was just the first letter of "ZeroTerminated" and was distinct. For comparison, the same thing using slices for the structure would be SUtf8RString, but this is all kinda tangential. There's a method to the madness.

Speaking in terms of the proposed CString replacement for libstd, MbString would also be fine. For me, the important thing is not being in a situation where old documentation on the internet directs people to the wrong type. If more focused changes are made (such as deprecating just the problematic methods), then MbString wouldn't be needed, and we could keep CString. Again, I just want to avoid people being bitten by making the "obvious" choice.

Neither of those say that "C strings" don't have any particular encoding. The Python one in particular specifically says that Unicode strings will be transcoded using the current default encoding (though I don't know how the Python runtime's encoding and the C runtime's encoding interact). Python also separates "unknown lump of 8-bit codes" into a totally distinct type code: et.

I feel like we're talking across purposes, here. My primary concern is that the type called CString, which one might reasonably want to use to pass a Rust string you have to C, does not handle encoding that string, making it unfit for that purpose unless the current environment happens to be configured the right way.

To put it another way: I believe CString the type interprets the name "C string" too broadly, and its interface is trivial to use incorrectly, and such incorrect use will not cause any obvious problems on less than comprehensive auditing.

I don't think that's reasonable. Of the ones I can recall off the top of my head, there's ZMbCString, ZWCString, ZUtf8CString, ZUtf16CString, ZUtf16beCString, ZzMutf8CString, PbUtf16BstrString, PbRaw8BstrString, ZAnsiCString, ZOemCString, SAnsiCString, SOemCString, SWCString, SAnsiWinString, SWWinString, GoUtf8RString. Oh! And there's bstrings (which are not BSTRs). And also the twin console codepages on Windows, too.

This is why I wanted to build the toolbox instead of trying to enumerate all of them. If I left any one out, people might be forced to either try and push fixes upstream, or reimplement it themselves when 80% of the code probably already exists. And I absolutely do not think this level of complexity belongs in libstd.

If we're talking about C (if you're doing this in Rust, you do it with Rust strings in Unicode as far as I'm concerned), then my understanding is that all your *char strings should be in the current multibyte encoding (as set by setlocale). If you have strings in different encodings, and you try to concatenate them, then that's a bug in your program that libc cannot be expected to deal with.

Again, I feel like we're arguing about different things. I'm not concerned with turning CString into the be-all, end-all representation of foreign strings. I just want it to use the encoding that functions in libc are going to expect. Like, if someone constructs a path in Rust, and wants to pass it to the cairo function that saves a PNG to disk, they need to encode the path into the encoding libc expects.

Yes, it would be nice if they used entirely Unicode-aware APIs. That's not always an option, though. If they're binding cairo, and they're on Windows, or for some reason running in an environment where setlocale returns something other than UTF-8... well, better to work as best it can, rather than cause weird behaviour or to have the image saved under a garbled filename.

That's pretty much exactly what I think should be done. I view CStr[ing] as an "input/output" step. String manipulation in Rust code should be done using Rust strings in Unicode. Once you go to call a C function, you need to transcode the string to whatever that function is going to expect. It's possible that's UTF-8, or Latin-1, or that really old DOS codepage that ZIP uses, but the reasonable default would be the encoding that libc itself is using for all of the strings it deals with.

1 Like

Why should CString be tied to the C runtime and its current encoding or whatever? It’s a container for a 0-terminated chunk of bytes.

I think this is just an addition to the not-list:

  • Not necessarily valid UTF-8
  • Not necessarily allocated with libc’s malloc
  • Not necessarily in the libc’s current encoding

The point about to_str assuming UTF-8 is taken, but I don’t think CString is that complicated. A complicated different type could be developed, sure…

4 Likes

Perhaps there should be more variants of to_str method (to_str_encoding(Encoding enum) or to_str_current_libc_locale() and such) to make authors stop and think which one is needed for a particular situation?

4 Likes

I thought I had a basic grasp of how text encodings worked, and I’ve read all the posts so far, but I still don’t understand what is wrong with CString.

I think the main claim is that CString is supposed to be using whatever the “C multibyte encoding” is, but given how much that can vary I don’t see how that’s an improvement over “unknown encoding”. To clarify: “unknown encoding” is what I thought the current CString type represented, and after re-reading its docs I’m still under that impression (though it ought to say something more explicit). Does it actually mean something else?

1 Like

Large parts of UNIX also treat strings as simply 0-terminated chunks of bytes, so CString is consistent with that. For example the filesystem doesn’t care if you put invalid UTF-8 or control characters into filenames. So the only criticism that seems valid is about making it too easy to put UTF-8 into a CString, which may be incompatible with how those bytes will actually be used by whatever C call is made. So perhaps there’s a case for deprecating the UTF-8-related calls?

3 Likes

How about the following:

Deprecate Cstr::to_str and Cstr::to_str_lossy in favor of more descriptive Cstr::to_utf8 and Cstr::to_utf8_lossy Deprecate CString::new in favor of fn from_bytes(bytes: Vec[u8]) (note the lack of Into, so strings won’t work directly) Deprecate CString::into_string in favor of CString::into_utf8_string (or a better name that somehow makes it explicit it that it assumes utf8).

The documentation for the new methods have explicit documentation about how utf-8 may not be the correct encoding. There should also be a way to encode strings in non-unicode encodings, but I think that should have a uniform interface for encoding to byte arrays, CStrings, and possibly OsStrings (I don’t know if that makes sense for windows). However, I’m not sure if such encoding belongs in the standard library or an external crate. Maybe an Encoding trait and some basic encodings (ex: utf-8, ascii, and maybe utf-16 and ISO-8859-1) could be part of std, and additional encodings are added by crates?

1 Like

C strings are encoded in the current multibyte encoding, unless you’re dealing with an exception like cairo’s text functions.

I believe, as others have pointed out, that this statement is simply false, and is at the root of the confusion. This statement makes the assumption that never will a nul-terminated chat * exist unless it came directly from a human via a Windows-mediated GUI.

In reality, the standard C library, as well as many other libraries, use C strings to hold all sorts of things that may or may not have any relationship to Unicode. Environment variables, file paths on posix systems, URLs, file paths on other computers, bytes of net protocols, etc.

Whenever a CString is used with C code, you are writing unsafe code, and get exactly the guarantees regarding coding that C gives you, which is zero, except that it is nul-terminated…

5 Likes

Except that you’re the one that seems to be confusing terminology…

C apparently uses the term “multibyte-encoding” to mean whatever the encoding is of the bytes contained in a char* which is configured by calling the setlocale() function.

That means that on windows, paths, environment variables, etc will be assumed to be encoded by some windows code page as configured in your regional settings and it may or may not be Unicode.

Don’t forget that windows has strong backwards compatibility guaranties and it still allows one to execute DOS programs from the 80s. Those will predate the invention of Unicode. For that matter, C itself and its terminology obviously predates Unicode as well.

1 Like

Here’s my perspective on the Windows side of things:

On Windows, any C program with a int main(int argc, char** argv) already incurs data loss before the program even starts running (when the command-line arguments are converted to MBCS by the C runtime); and thus may be unable to open argv[1] as a file.

Given the existence of files that cannot be named in MBCS, any correct way to handle text on Windows MUST NOT involve MBCS. There are three main approaches to handle text on Windows:

  1. The Microsoft-recommended way: build with -DUNICODE, use TCHAR/WCHAR everywhere. Rust code interfacing with C code written in this manner will have to use OsString / UTF16 wchar_t.

  2. Use UTF-8 char* internally; convert from/to UTF-16 wchar_t when interfacing with the Windows API or C-runtime. This lets most of the code use char* consistently across platforms. The current CString works great to interface with such C code.

  3. The legacy way: Use MBCS char*, and accept the data loss on inputs. Blame the users when they put non-MBCS characters in their filenames. Such software also often has internal encoding bugs where it confuses UTF-8 file contents with MBCS. Simple solution: blame the users as soon as they use non-ASCII characters. (this approach works well because most users can’t tell the difference between non-MBCS and non-ASCII)

Rust can kinda interface with such C code – CString works, but the CString<->String conversion “works” only if you use the “blame users on non-ASCII” approach.

I think Rust should add MBCS conversion functions to CString. Rust could also deprecate+rename the existing UTF-8 CString conversion functions, but I don’t think that’s necessary.

2 Likes

Adding parameters to CString conversion methods indicating only one of the encodings involved seems incorrect to me, since perhaps the most important fact about CString encoding (I think) is that CStrings don’t have a programmatically type-system-specified encoding. So, for instance, to_utf8 wouldn’t be correct, because it wouldn’t indicate what assumptions are made about the encoding used in the CString itself.

If standard library functionality for converting to/from arbitrary encodings is provided in the Rust standard library, I’d suggest that the API must clearly indicate (or parameterize) both the source and the destination encodings. I’d also suggest that while CString might be a good choice of types to use for the inputs and outputs of conversion functions, it’s not clear to me that the functions themselves should necessarily be methods of CString. In particular, going from 8-bit encodings to Windows wide encodings and vice-versa should probably involve different data types, one (possibly CString) containing u8s and the other containing u16s.

The encoding of the rust string is always utf-8

I’m sorry, I don’t really see what that’s responding to.

Sorry, I mean you don’t need to specify both the source and destination encodings, because one of them (the rust string) will always be utf8. In fact no encoding/decoding is necessary if the CString is in utf8.

I agree with your point about not having the methods on CString though. I think CString should be treated like a Vec<[u8]>, but with a terminating null byte. As such I think it makes sense for the conversion methods to be owned by String (and &STR) as they are for byte arrays.

Here are likely explanations why this hasn't blown up:

  • As BurntSushi points out, Rust's CString doesn't actually assume anything about the encoding (as it mostly deals with [u8] and Vec<u8>). It only makes it easy to convert to Rust strings when you can assume UTF-8.

  • On Windows, as you note, UTF-8 cannot be set as the encoding used by system APIs that take strings consisting of 8-bit code units, so using CString with Windows system APIs is wrong (except when working around specific bugs in Windows) and, instead, on Windows, system APIs that take UTF-16 strings should be (and are) used instead.

  • On non-Windows platforms, the reason why Rust code needs to call into libc is to perform system calls. The part of libc that wraps system calls does not perform locale-aware text processing and is, therefore, oblivious to the codeset of strings. (The wide-character IO part of libc should never be used.) The part of libc that is locale-aware performs userland operations, and those operations should be implemented in Rust instead of being delegated to the unreasonably-designed C standard library functions. (C programs shouldn't call those functions, either, to escape the fundamental misdesign of the C standard library!)

  • The bogus i18n approach of the C standard library where the interpretation of strings in APIs depends on the configuration environment does not apply to all C libraries. As a particular prominent example, in all GLib-based code (Gnome, etc., including Cairo that you mention) strings are always UTF-8 (except for filenames which are opaque bytes unless displayed to the user and which are interpreted as UTF-8 for display purposes by default; to override the default display interpretation, you need to set the G_BROKEN_FILENAMES environment variable, which indicates clueful attitude towards these issues).

  • On macOS, iOS, Android, Sailfish and OpenBSD the string encoding in system C string text APIs is UTF-8 (though OpenBSD also supports the POSIX C locale that makes the encoding US-ASCII).

  • Red Hat has defaulted to UTF-8 since 2002, SuSE since 2004 and Debian since 2007 (the most prominent Debian derivative, Ubuntu, defaulted to UTF-8 before Debian itself).

  • Solaris has at least supported UTF-8 locales since Solaris 8. Non-OpenBSD BSDs at least support UTF-8 locales. (It's unclear to me how exactly the defaults work.)

As noted above, libc should only be used as a wrapper for system calls.

It is indeed the case that Windows doesn't allow UTF-8 as the codepage of a process. Concluding that Rust's CString is wrong is the wrong conclusion though. The right conclusion is that on Windows only the UTF-16 APIs should be used to interact with the system.

Again, libc should only be used as a system call wrapper, and that part of libc doesn't care about the character encoding.

Here we agree. The conclusion shouldn't be for Rust to accommodate the bogosity of the C standard library but the conclusion should be to treat the text processing parts of the C standard library as so fundamentally wrong as to not provide any accommodations for using them. (C code shouldn't use the text processing parts of the C standard library, either.)

Don't use codepage 500. It's not the default either for terminal or for "ANSI" mode APIs for any Windows locale. Setting the code page for a process to something weird that's supported by Microsoft's converters and that isn't prohibited like UTF-8 is a self-inflicted problem. Rust doesn't need to fix it.

I strongly disagree with adding encoding conversion functionality for legacy encodings other than UTF-16 to the standard library. As noted above about UTF-16 being the right way to interface with Windows and UTF-8 being either the only way, the default way, or the only sensible way to deal with non-Windows systems. APIs that take non-UTF strings should be shunned and avoided. To the extent there exists legacy-encoded data in the world, Rust programs should perform conversion to UTF-8 immediately after performing byte-oriented input operations, but the standard library should not be bloated with the conversion tables.

The scope of what encodings someone might find a fringe use case for is too vast for the standard library. Doing away with legacy encodings is a feature, and it's great that Rust has that feature.

It is a good thing if software developers stop allowing fringe legacy configurations (which is what Posixish platforms with non-UTF-8 locale configurations are) to inflict negative externalities on them.

People who configure a non-Windows system with a non-UTF-8 locale this day and age are engaging in anti-social (towards software developers) fringe activity, and I strongly think that they should bear the cost themselves and the Rust community should refuse to complicate the standard library to accommodate such behavior.

(For Windows, use UTF-16 to sidestep the legacy locale-dependent stuff.)

That C11 doesn't unambiguously make the interpretation of char16_t strings UTF-16 and the interpretation of char32_t strings UTF-32 highlights how badly in the weeds the C standards committee is when it comes to i18n and text processing. That there exist or have existed systems where signed integers are not two's complement but that at the time char16_t and char32_t were added there was no present or foreseeable reasonable interpretation for them other than UTF-16 and UTF-32, respectively, shows how utterly unreasonable the C standard is on these matters: it makes even less sense than not to commit to two's complement. Considering that Rust already doesn't seek to interoperate with a non-two's complement C implementations, it shouldn't be a radical idea that Rust shouldn't try to interoperate with C implementations that give char16_t and char32_t an interpretation other than UTF-16 and UTF-32.

But the issue is mostly moot, since, again, Rust should only use libc as a system call interface and avoid the userland text processing parts.

wchar_t is such a disaster that it shouldn't be used for any purpose other than calling into UTF-16 Windows APIs declared to take wchar_t (in which case the Rust type is known to always be u16).

Out of curiosity, was this something like taking a two-byte EUC sequence as a big-endian integer and putting that into wchar_t?

More likely it means that it's a leftover from the era when Microsoft worked for IBM (and IBM wants its catalog of legacy encodings supported) and nobody thought to explicitly prohibit it as the code page of a Windows process (like UTF-8 is explicitly prohibited).

One might take the view that Windows is very large like the Web is very large and at that scale there is always someone who does every ill-advised thing. Still, I think that Rust should not in any way accommodate someone doing this particular ill-advised thing.

Furthermore, I think that software designers should resist the lure of legacy encodings. They are an attractive nuisance. Legacy encodings fascinate software developers, because there's so much opportunity to geek out about all manner of quirks and weirdness. But it's wrong to think that there is value to legacy encodings or that you gotta catch them all. They are a problem that should be dealt with only to the extent needed to make sense of legacy data and no more.

(Note how CONTRIBUTING.md for encoding_rs sets clear expectations of scope.)

Yeah, if the current UTF-16 facilities in the standard library aren't convenient enough for interfacing with Windows, introducing something that makes dealing with the fact that NT uses UTF-16 but Rust uses UTF-8 more convenient would be appropriate.

15 Likes

I don't know of any useful POSIX functions that care about the setlocale encoding - file I/O functions, getaddrinfo and dlsym expect the input to be in the ill-defined "system encoding" (which is not affected by LC_CTYPE).

It seems that it's mostly Windows A functions that care about the process locale, and they are deprecated (Rust programs should call the W functions instead).

4 Likes

Am I the only person here getting a headache trying to understand the subtleties of the C standard and how it affects Rust? How would a user like myself know in an approachable and portable way which C functions depend on setLocale() without reading the C standard?

I think the idea to remove Rust’s dependency on libc to interact with the underlying system just got my support. That would require wrapping the platform specific APIs ourselves in Rust instead of relying on libc to be the portable layer, but at least the semantics will be properly defined and safe without crazy C loopholes.

To sum up, there are two cases: Windows, and non-UTF-8 Unix.

  • On Windows, some system APIs may expect strings encoded in the current code page. The code page can be changed, but not to UTF-8 or UTF-32, and not to UTF-16 except in managed applications (apparently). Therefore there is always the potential for data loss, so the correct answer is to not use those functions at all, prefering the wchar_t based ‘w’ versions. There should probably be a wchar_t variant of CString in the standard library; CString itself is still useful for libraries that use UTF-8 everywhere.

  • Non-UTF-8 Unix: If there is a problem, it is nowhere near limited to CString. OsStr is just an arbitrary collection of bytes, but OsStr::to_str and friends assume UTF-8; this is also what you get from env::args, File::read_to_string, and others. On the output side, io::Write::write_fmt assumes UTF-8, so plain old println! is broken in a non-UTF-8 locale. Properly supporting non-UTF-8 systems would require adding conversions in all these places, which I suspect is not going to happen. If it does, I suppose it would be worth distinguishing “theoretically current-C-encoding-encoded bag of bytes” C strings, as used by some libc functions, and “theoretically UTF-8-encoded bag of bytes”, as used by glib and other libraries. But to be clear, this only matters for display and user input purposes. For all other purposes, you want to preserve the original binary blobs. (In theory it also matters for hardcoded strings, such as standard path names (e.g. /dev/null). But I don’t think there is any non-negligible use of non-ASCII-superset encodings as C locales, such as, e.g., UTF-16 or EBCDIC; so it should be safe to encode in ASCII.)

3 Likes