I’ve read through the IO Reform RFC 517 and I would like to discuss how the read_to_string/write_fmt methods should work. The PR #575 does not address this.
The problem is that while on Unix these days UTF-8 is almost always the right choice, on Windows there are three options significantly represented: UCS-2, UTF-8 (and occasionally it’s messed-up variants) and legacy charsets. When Microsoft pushed the switch to UCS-2, other software vendors usually lacked the manpower to convert everything or needed to maintain backward compatibility, so they kept using legacy charsets. In fact, even Microsoft does in many places — I don’t remember seeing UCS-2-encoded XML from them ever, but I’ve seen lot of it encoded in Windows-1250. And then multiplatform software often switched to UTF-8 despite the lack of support for it on Windows (it has been assigned codepage number 65001, but it does not actually understand that value in many places like the wcstombcs&co. functions.
So on Windows the programmer really needs to be able to choose encoding for input and output.
Setting encoding can’t be handled in a trait without complicating all implementations of it, so the obvious way seems to be creating a wrapper to do the transcoding. This could be done completely in a library, except it would be useful to provide special implementation for Windows console and that makes more sense in the standard library. The issue with Windows stdin/stdout is that they have wide API for when they are attached to actual console, but should use transcoded data, usually to the legacy charset (except for cygwin, which uses utf-8).
So I’d like to propose:
- Moving the
std::io::Read::read_to_string (and std::io::ReadExt::chars) and std::io::Write::write_fmt functions to separate traits (std::io::StringRead, std::io::StringWrite?).
- Similarly for
std::io::BufRead::read_line. Could it be in StringRead and optimized if the underlying source provides BufRead and unoptimized (should still be doable with chars()) when not?
- Providing implementation for utf-8-encoded streams; other encodings can be provided in a library and possibly later merged in standard distribution if desired.
- Special implementation for windows
stdin, and stdout/stderr. Given the general inconsistency I think using the utf-8 encoding when the streams are redirected would be perfectly fine for the time being.
I think the methods should be removed from the base traits to avoid confusion between the raw streams and the transcoding ones. This also matches what python does: stdin/stdout/stderr come preconfigured for unicode input, but files have to be explicitly created via the codecs module when unicode input/output is desired.
Thoughts? Should I try to make it into formal RFC (or formal update for RFC 517)? Or prototype it as a library first?