[Pre-RFC] Stabilize UTF-16 encoding in std

SimonSapin · May 28, 2015, 1:02pm

Proposals:

Rename str::utf16_units and Utf16Units to encode_utf16 and Utf16Encoder
Stabilize them

Motivation

While it’s the technically correct term, I suspect many users don’t know or care what exactly a code unit is.
UTF-16 handling is simple and stable enough that it belongs in std IMO.
People are asking for it
UTF-16 decoding is already stable, through String::from_utf16.

Bonus proposals

Only slightly related, but while we’re at it: change char::encode_utf8 and char::encode_utf16 to return iterators, and stabilize them.

There is precedent of returning “short” iterators in e.g. char::lowercase. This also means there is no need to expose a generalized Utf16Encoder that takes any char iterator as input, one can use Iterator::flat_map with char::encode_utf16.

CC @aturon, @alexcrichton

alexcrichton · May 28, 2015, 4:48pm

Note that we'd probably call the iterator EncodeUtf16 to match iterator naming conventions. I don't have too many opinions on this name, but it does seem to match char::encode_utf16 well.

In the past these haven't returned iterators due to performance concerns, but I agree that returning an iterator would be more idiomatic, and perhaps LLVM has improved or mistakes were made originally?

SimonSapin · May 28, 2015, 10:12pm

EncodeUtf16 sounds fine.

I don’t know about performance of iterators vs a &mut [u16] parameter. If it’s a concern, let’s defer the “bonus” part of this thread.

aturon · June 3, 2015, 8:55pm

Sorry for the very late reply. I’m in favor of this change. @brson originally investigated using iterators here, he may be able to tell you more about the perf issues he ran into. Assuming those can be dealt with, moving to an iterator would be great.

brson · June 4, 2015, 6:07pm

I don’t remember any details, but the iterator I wrote had a lot of new branches compared to the existing code, and it did not optimize well.

system · March 25, 2019, 8:10am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ASCII methods for u16	17	2833	April 11, 2021
Strings and UTF-8 language design	21	7861	March 25, 2019
FCP Library APIs in the 1.8 release announcements	2	1701	March 25, 2019
Micro RFC: `String::from_utf8_with` for different handling of invalid UTF8 libs	20	1789	June 4, 2023
Pre-pre-RFC: Support `write_uppercase(&self, &mut String)` libs	14	1580	May 29, 2022

[Pre-RFC] Stabilize UTF-16 encoding in std

Proposals:

Motivation

Bonus proposals

Related topics