Absolutely not. I just find it somewhat amazing that there's an EBCDIC codepage available to use in a stock Windows install, when it is basically the same thing as CP1252 (just arranged differently). That it's there implies to me that someone is still using it, and thus I can't completely discount its existence.
Because the goal is to go between *char
and Unicode in a portable, supported fashion, and wide strings are a part of the "how". C11 gives us *char32_t
which can be UTF-32, but not all versions of MSVC supported for Rust have those functions. The closest I could find was using the wide string conversion functions to get to *wchar_t
, which I know is UTF-16 on Windows, which I can then finish decoding.
The use of *wchar_t
and *char32_t
is just an implementation detail.
It's both and neither. It's supposed to be UTF-16, but nothing ever seems to validate this, so you can end up with stuff that's not valid UTF-16 but would be valid UCS-2. However, since I'm trying to get to Unicode, if I run into a *wchar_t
that isn't valid UTF-16, then I can't decode it anyway, so it's a moot point. For all practical intents and purposes, *wchar_t
on Windows is unvalidated UTF-16.
I am saying "multibyte" because that's how the C functions for string conversions refer to them. For example: mbrtowc
. I know the current encoding is not necessarily multibyte (either in the "size of individual units" or "length of multi-unit points" sense), but that's the name the functions are using, so it's the name I'm sticking to in lieu of something more accurate which is still succinct enough to use in practice.
Less broken is still an improvement. Also, I'm not proposing we use the "system default codepage". I'm proposing we use the current C runtime codepage. The thing that the C runtime itself is using.
I mean, yeah, there are a ton of other places that can have their own settings. On Windows alone you have the ANSI, OEM, console output, console input, and CRT code pages. This is something I was trying to deal with in the crate I was writing. But that crate's mothballed for now, so I wanted to at least do what I could for std
and the people using it for the common case: talking to C code using run-of-the-mill C strings.
I would absolutely love to get more information on where all this can go wrong. I've tried to test the conversions under unusual settings, but it's hard to know if I'm missing something.
Ok, I wanted to keep this out of this particular proposal, but I kind of already wrote this library. As I've said above, though, it crashes the compiler so I've put it on hold for now.
The design I have is two main types: SeStr<S, E>
and SeaString<S, E, A>
. "S.E.A." sounds like "sea", sounds like "C", and it's for interop with C strings, among others. It also works as a mnemonic for remembering what the parameters are, and the order they appear in. The parameters are "Structure" (zero-term, double zero-term, unit prefix, byte prefix, slice, etc.), "Encoding" (C multibyte, C wide, UTF-X, Java's weird modified UTF-8, etc.), and "Allocator" (C malloc/free, Rust, the weird BSTR
-specific allocator, etc.). That ensures you can more or less build any kind of foreign string from the component pieces.
The only thing it doesn't do at the moment is support runtime-variable encodings (hard to work out where to attach that information without custom DST).
I also started making wrappers for "common" combinations, to make the documentation a little less incomprehensible. The proposed ZMbStr
and ZMbString
would just be wrapper types from strffi
copy+pasted into std
, with the generics flattened.
The issue I have with this is that this has already been stabilised, meaning people are using it and depend on it. I don't think it's very nice to deprecate an std
API with the message "uh, use an external crate, I guess? lol". Maybe not worded exactly like that. It also sucks to know the code you need exists, and is shipped with the compiler, but you can't use it.
And as I've said above: said crate exists, it just breaks rustc
at the moment. I'd have published it before making this post if I'd been smart enough to get it to not do that.
I think mine is called ZRaw8Str
. As I intimated above, I intended to solve this problem very thoroughly.