Pre-RFC: Separate reading/writing String from std::io::Read/std::io::Write

I’ve read through the IO Reform RFC 517 and I would like to discuss how the read_to_string/write_fmt methods should work. The PR #575 does not address this.

The problem is that while on Unix these days UTF-8 is almost always the right choice, on Windows there are three options significantly represented: UCS-2, UTF-8 (and occasionally it’s messed-up variants) and legacy charsets. When Microsoft pushed the switch to UCS-2, other software vendors usually lacked the manpower to convert everything or needed to maintain backward compatibility, so they kept using legacy charsets. In fact, even Microsoft does in many places — I don’t remember seeing UCS-2-encoded XML from them ever, but I’ve seen lot of it encoded in Windows-1250. And then multiplatform software often switched to UTF-8 despite the lack of support for it on Windows (it has been assigned codepage number 65001, but it does not actually understand that value in many places like the wcstombcs&co. functions.

So on Windows the programmer really needs to be able to choose encoding for input and output.

Setting encoding can’t be handled in a trait without complicating all implementations of it, so the obvious way seems to be creating a wrapper to do the transcoding. This could be done completely in a library, except it would be useful to provide special implementation for Windows console and that makes more sense in the standard library. The issue with Windows stdin/stdout is that they have wide API for when they are attached to actual console, but should use transcoded data, usually to the legacy charset (except for cygwin, which uses utf-8).

So I’d like to propose:

  • Moving the std::io::Read::read_to_string (and std::io::ReadExt::chars) and std::io::Write::write_fmt functions to separate traits (std::io::StringRead, std::io::StringWrite?).
  • Similarly for std::io::BufRead::read_line. Could it be in StringRead and optimized if the underlying source provides BufRead and unoptimized (should still be doable with chars()) when not?
  • Providing implementation for utf-8-encoded streams; other encodings can be provided in a library and possibly later merged in standard distribution if desired.
  • Special implementation for windows stdin, and stdout/stderr. Given the general inconsistency I think using the utf-8 encoding when the streams are redirected would be perfectly fine for the time being.

I think the methods should be removed from the base traits to avoid confusion between the raw streams and the transcoding ones. This also matches what python does: stdin/stdout/stderr come preconfigured for unicode input, but files have to be explicitly created via the codecs module when unicode input/output is desired.

Thoughts? Should I try to make it into formal RFC (or formal update for RFC 517)? Or prototype it as a library first?

I have started a project to implement streaming codecs that would interoperate with the new I/O. But I have no working code yet.

I’ve only looked very briefly now. However:

  • There is no point in doing utf-8, it already exists. The strings are utf-8 in memory and they have to/from utf-16 conversion and codepoint iterator and collector.
  • I’d start with existing transcoders. iconv is in standard library in Linux, in a separate library, but seems always available in Darwin (Mac OS X). For windows there is a stand-alone implementation or we can use the native functions MultiByteToWideChar and WideCharToMultiByte at the cost of a bit of efficiency as that conversion is through utf-16…
  • Why are you decoding into custom struct instead of String? That would seem much more practical.

I think I’ve understood the Decoded struct, but it still seems weird to have the output buffer owned by the decoder.

I would simply create these layers:

  • In the lowest layer just replicate the iconv(3) interface except with return value:
    convert(in: &[u8], out: &mut[u8]) -> (Status, &[u8], &mut [u8])
    
  • Above it layer with the same interface, but added error handling options: Stop (stop at invalid sequence like the lower layer), Panic, Replace(char), ReplacePlusValue(char), Escape(&str, i32) (prefix+radix).
  • And on top of this convenience methods to encode &str to Vec[u8] and decode &[u8] to String and wrappers for std::io::BufRead and std::io::Write to read and write strings.

There is no need to fiddle with lifetimes anywhere and it should still be efficient and reasonably easy to use.

I have responded on the users’ forum, as discussion about a third-party crate does not have much to do with Rust internals.

Yes, the implementation can be discussed on users. However I posted here because the main point is splitting the std::io trait. (I’ve re-titled the thread to make that clearer)

True. I don’t think Read needs splitting, though. As my exercise shows (pending some yet unpublished commits), a decoder can be composable as a general purpose reader, and its implementation of read_to_string can be optimized relying on the fact that decoded content is valid UTF-8. There may be use for an add-on trait to BufRead, that would fill and return the internal buffer as a &str when UTF-8 content can be guaranteed (in my design that’d be for a decoding reader combining an inner BufRead source and a decoder).

For writing, I’m thinking to have the encoding writer implement something akin to std::fmt::Write, but with an error variant that exposes a possible std::io::Error, which is close to what you are suggesting. In fact, I don’t see why std::fmt::Error itself cannot provide an optional underlying I/O error.

I mean not because it would need splitting (the wrapper simply does not use the methods and implements it’s own versions), but because it could be confusing to have the raw streams do something and wrappers doing something else.

But then since the recoding is done in a wrapper it probably isn’t that confusing after all.

On a second thought, std::fmt::Write is hardly sufficient for a stateful encoder because it lacks a method to finalize encoding, which may be necessary for some legacy encodings (or some broader applications of the encoding framework).

Would the PR#770 on RFCs solve that?

It proposes adding close() and flush() methods to std::io::Write and making call to close() mandatory for ensuring the output actually goes where it should.

That may be good for std::io, but it does not say anything about string-oriented output.

It does not have to say anything about string oriented output. What matters is whether the methods can be used for finalizing the encoding.

If an encoding writer must implement std::io::Write to make close available, that would go against the idea of only providing string-oriented output methods to enforce UTF-8 input at compile time rather than validating it. If there was an orthogonal trait just for close (or maybe something like the semantics of into_inner returning the underlying writer after the encoding is finalized), that would be more useful.

A parallel thread suggests @SimonSapin’s TextWriter as a proposed standard text-oriented writer API. For a finalizing method, I’m leaning towards an into_inner approach detailed here.

Are you aware of crates.io: Rust Package Registry ? We use it in Servo.

1 Like

Yes; I wanted a baggage-free experiment to plug into the new I/O and try to enable low-copy codecs. encoding seems to be rather heavy and duplicates a lot of what became the new I/O by its own means. I don’t mind this to be merged into the encoding crate at some point, or vice versa, I could borrow some codec stuff from there.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.