The problem is that while on Unix these days UTF-8 is almost always the right choice, on Windows there are three options significantly represented: UCS-2, UTF-8 (and occasionally it’s messed-up variants) and legacy charsets. When Microsoft pushed the switch to UCS-2, other software vendors usually lacked the manpower to convert everything or needed to maintain backward compatibility, so they kept using legacy charsets. In fact, even Microsoft does in many places — I don’t remember seeing UCS-2-encoded XML from them ever, but I’ve seen lot of it encoded in Windows-1250. And then multiplatform software often switched to UTF-8 despite the lack of support for it on Windows (it has been assigned codepage number 65001, but it does not actually understand that value in many places like the
So on Windows the programmer really needs to be able to choose encoding for input and output.
Setting encoding can’t be handled in a trait without complicating all implementations of it, so the obvious way seems to be creating a wrapper to do the transcoding. This could be done completely in a library, except it would be useful to provide special implementation for Windows console and that makes more sense in the standard library. The issue with Windows stdin/stdout is that they have wide API for when they are attached to actual console, but should use transcoded data, usually to the legacy charset (except for cygwin, which uses utf-8).
So I’d like to propose:
- Moving the
std::io::Write::write_fmtfunctions to separate traits (
- Similarly for
std::io::BufRead::read_line. Could it be in
StringReadand optimized if the underlying source provides
BufReadand unoptimized (should still be doable with
chars()) when not?
- Providing implementation for utf-8-encoded streams; other encodings can be provided in a library and possibly later merged in standard distribution if desired.
- Special implementation for windows
stderr. Given the general inconsistency I think using the utf-8 encoding when the streams are redirected would be perfectly fine for the time being.
I think the methods should be removed from the base traits to avoid confusion between the raw streams and the transcoding ones. This also matches what python does: stdin/stdout/stderr come preconfigured for unicode input, but files have to be explicitly created via the codecs module when unicode input/output is desired.
Thoughts? Should I try to make it into formal RFC (or formal update for RFC 517)? Or prototype it as a library first?