ASCII methods for u16

ColinFinck · January 9, 2021, 5:17pm

Hi all! Is there a specific reason why is_ascii, to_ascii_uppercase, to_ascii_lowercase, and friends are implemented for u8 but not for u16? Would a PR to provide similar implementations for u16 be accepted upstream?

The Windows as well as the UEFI world use UTF-16 strings extensively, which makes it hard at times to interact with UTF-8 strings in Rust code. On the other hand, Rust implements str::encode_utf16 since 1.8.0, so these issues shouldn't be unknown to you.

I take it that the methods were originally implemented to match ctype.h from C (<ctype.h> functions for AsciiExt · Issue #39658 · rust-lang/rust · GitHub). An implementation on u16 would thereby match C's wctype.h.

Best regards,

Colin Finck

Tom-Phinney · January 9, 2021, 6:11pm

Are you querying about methods on u16, or on UTF-16? They are not identical.

ASCII, or US-ASCII, is a 7-bit character code for the more common characters in US English. As such it can be conveyed by [u8] and is often used in programming languages such as C.

ColinFinck · January 9, 2021, 6:49pm

@Tom-Phinney Rust's encode_utf16 returns an iterator over u16, each value representing a single character or surrogate. A wide-string (UTF-16) originating from a Windows API would also be represented as an array of u16 values.

Which is why I would find it very convenient to have wctype.h-like methods on u16 to convert between uppercase/lowercase for characters within the 7-bit ASCII range. Just like this is already possible for single-byte characters represented as u8.

chrisd · January 9, 2021, 7:21pm

I don't think this is merely an oversight. I think Rust purposely doesn't do UTF-16 beyond conversions. The slogan is "UTF-8 everywhere". So you're meant to immediately convert to UTF-8 before you operate on a string and then convert back to UTF-16 only at the point where the string leaves your program (e.g. it's passed through FFI).

In short I think there are philosophical reasons why these aren't in the standard library. Not everybody agrees with this reasoning though.

ColinFinck · January 9, 2021, 9:15pm

@chrisd I know about Rust's preference for UTF-8 and it's obviously right for a language of the 21st century. Nevertheless, it also supports the already mentioned encode_utf16 and various ASCII functions on pure byte strings (u8).

I'm not asking for more than the same set of ASCII functions on 2-byte strings (u16). This would be a tremendous help when dealing with UTF-16 in no_std environments (like UEFI), where allocations and therefore conversions aren't always possible.

jdahlstrom · January 9, 2021, 9:25pm

C’s wchar_t is not defined to be uint16_t or indeed have any specific size or encoding. On Linux, for example, it’s typically uint32_t (but rarely used in practice because UTF-8 is the norm).

Having ”character” methods on builtin integer types (as opposed to semantically proper wrapper types) is dubious anyway; with u8 it’s somewhat justifiable for historical reasons, standard C interop, and because u8 is the code unit of Rust’s standard string encoding. UTF-16 support, outside of basic conversion to/from UTF-8, is something better relegated to an external crate.

kornel · January 10, 2021, 5:52pm

This is an answer under assumption that the question meant methods for 16-bit encodings similar to methods for 8-bit encodings, not merely methods for 8-bit encodings in 16-bit units.

It's because for Unicode Rust has these methods on the char type. Rust doesn't have a dedicated char type for ASCII, so it used u8 for it.

u16 can't represent UTF-16 code points. UTF-16 is not fixed-width 16-bit encoding. It's a variable-width encoding, and therefore something like to_uppercase would be ill-defined on a lone u16, because that could be only a half of a code point.

The confusion between u16 and UTF-16 is the root cause of the whole encoding mess Windows got itself into. It predates the UTF-16 encoding, and started off with what we now call UCS-2. Since UCS-2 is now a legacy encoding that isn't actually used anywhere other than by a mistake, it's very unlikely that Rust would add a special support for it.

If you're really dealing with UCS-2, not UTF-16, then you can use something like this:

fn is_ucs2_uppercase(c: u16) -> bool {
   std::char::from_u32(c as u32).unwrap().is_uppercase()
}

but if you're working with Windows' encoding, then it is UTF-16 now, so you should use char for code points, not u16. std::char::decode_utf16 is the right method for this.

withoutboats · January 10, 2021, 6:26pm

I see a lot of lecturing about unicode in this thread, but I don't see any reason why Rust shouldn't have ASCII conveniences for u16 in the same way we have them for u8. They would exist for the exact same reason the u8 ones do: sometimes users are just dealing with ASCII, and choose not to support unicode for whatever reason specific to their situation. Sometimes that ASCII is in u8s and sometimes its in u16s, depending on the platform. Why should users with u16s have to suffer?

These methods are given a clear name that they are limited to ASCII to warn users that using them is not compatible with supporting unicode, and their use is not encouraged by our documentation or our API, which gives strong preference for the UTF8 string types.

I would be inclined to merge a PR that adds these to nightly. The only reason they don't exist yet, as far as I am aware, is that no one has tried to add them.

withoutboats · January 10, 2021, 6:39pm

Also note that Rust definitely does not have a "UTF-8 everywhere" philosophy: it has a "UTF-8 by default" philosophy, but we actually put users through a lot of pain to handle non-UTF-8 data transparently without special work when it is possible to appear. The primary example of this is our Path API, which has several unpleasant aspects that derive from the fact that paths do not have to be UTF-8. We absolutely don't expect users to convert data to UTF-8 at the program boundary regardless of their situation, we just want to make dealing with unicode correctly the most obvious thing to do as often as possible.

ColinFinck · January 10, 2021, 7:09pm

Thanks for all the replies! Indeed, I was only referring to the 128 characters in the ASCII range when talking about additional methods for u16.

However, I did some experiments today and found out that I don't just need is_ascii, to_ascii_uppercase, to_ascii_lowercase in my code, but also need to detect invalid UTF-16 code points. Therefore, I will use char::decode_utf16 and the already existing ASCII methods for char in my code.

Would still love to see ASCII methods for u16 someday though

gbutler · January 10, 2021, 7:13pm

I wonder if it could be argued with little to no doubt that this would always be a requirement when dealing with UCS-2/UTF-16 as a u16 array?

ColinFinck · January 10, 2021, 7:31pm

Just like UTF-8, UTF-16 guarantees that the numeric values < 128 always refer to an ASCII character. They are never part of a surrogate pair.

As such, one could perform operations like is_ascii or lettercase conversions by just looking at each u16 individually, without doing a full validation or conversion to char first.

gbutler · January 10, 2021, 8:16pm

By the reasons given for u16, would not the same arguments apply to [i8], [i16], [u32], [i32], [u64], [i64], etc.? If so, shouldn't any PR for the is_ascii/ascii_tolowercase/ascii_touppercase include those types as well? Perhaps not u64/i64 or larger, but maybe.

H2CO3 · January 11, 2021, 10:27am

Also note that because of the aforementioned guarantee, it is already possible to check for u16's representing ASCII character codes (however odd, underspecified, or misunderstood that requirement may be), safely, and without the need for as casting:

let flag = u8::try_from(the_u16).map_or(false, |b| b.is_ascii());

chrisd · January 11, 2021, 4:34pm

To be honest, the Path API was one reason for my "UTF-8 everywhere" comment. There's no way to interact with Paths via the standard library without converting paths to UTF-8 (or UTF-8 like).

For example, in Windows UTF-16 paths are immediately converted to "WTF-8" as soon as they enter the program. Then they are only converted back to UTF-16 when it's used in the Windows API.

Nokel81 · January 11, 2021, 5:15pm

I don't understand why you think you have to convert paths to String to work with them. PathBuf has several methods for modifying itself.

chrisd · January 11, 2021, 5:29pm

A Path on Windows is a wrapper around a WTF-8 buffer. It is not UTF-16.

system · April 11, 2021, 5:29pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fn char::as_ascii(self) -> Option<u8>	4	722	April 20, 2020
Case Insensitive UTF-8 Comparison libs	11	3416	January 12, 2024
[Pre-RFC] ASCII Type & Literals language design	14	1037	March 6, 2024
Strings and UTF-8 language design	21	7222	March 25, 2019
Wild idea: deprecating APIs that conflate str and [u8] libs	59	3552	November 12, 2020

ASCII methods for u16

Related topics