[Pre-RFC] Stabilize UTF-16 encoding in std

Proposals:

  • Rename str::utf16_units and Utf16Units to encode_utf16 and Utf16Encoder
  • Stabilize them

Motivation

  • While it’s the technically correct term, I suspect many users don’t know or care what exactly a code unit is.
  • UTF-16 handling is simple and stable enough that it belongs in std IMO.
  • People are asking for it
  • UTF-16 decoding is already stable, through String::from_utf16.

Bonus proposals

Only slightly related, but while we’re at it: change char::encode_utf8 and char::encode_utf16 to return iterators, and stabilize them.

There is precedent of returning “short” iterators in e.g. char::lowercase. This also means there is no need to expose a generalized Utf16Encoder that takes any char iterator as input, one can use Iterator::flat_map with char::encode_utf16.

CC @aturon, @alexcrichton

Note that we'd probably call the iterator EncodeUtf16 to match iterator naming conventions. I don't have too many opinions on this name, but it does seem to match char::encode_utf16 well.

In the past these haven't returned iterators due to performance concerns, but I agree that returning an iterator would be more idiomatic, and perhaps LLVM has improved or mistakes were made originally?

EncodeUtf16 sounds fine.

I don’t know about performance of iterators vs a &mut [u16] parameter. If it’s a concern, let’s defer the “bonus” part of this thread.

Sorry for the very late reply. I’m in favor of this change. @brson originally investigated using iterators here, he may be able to tell you more about the perf issues he ran into. Assuming those can be dealt with, moving to an iterator would be great.

I don’t remember any details, but the iterator I wrote had a lot of new branches compared to the existing code, and it did not optimize well.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.