Implement From<char> for u64

FWIW, that has never surprised me because I don't think of char as being a "numeric" type. That said From conversion is a "newtype-unwrapping" conversion to me, not an "integer-widening" conversion. (char is essentially a struct char(u32); newtype, with some extra magic about having a more restricted value range.)

Notably there also isn't char: TryFrom<u16> even though there's char: TryFrom<u32>, and nor is there char: Into<f32> even though a 25-bit USV can also be exactly represented as a single-precision floating point type.

(But as boats said, one can always send a PR and see what happens.)

3 Likes

Is that not a very different thing from merely widening? A u64 is a superset of a u32. It's not that a u32 just so happens to be representable as a u64. A value of u32::MAX has the exact same meaning as u32::MAX as u64.

An u32 into f32 changes that meaning, even if it's exactly representable.

Yet we have From<u8> for char and TryFrom<u32>. Why not TryFrom<u16> too? Is that not inconsistent? I guess I don't understand the reasoning here.

1 Like

I think this is the core of where people can have different perspectives here, since they'll have different definitions for when something "changes its meaning".

An f64 is a superset of a u32 (or an i32) just as "a u64 is a superset of a u32". Does that change its meaning? I don't know, and reasonable people can probably disagree on it. Someone used to javascript, where an f64 often serves as an i54, would probably say no.

1 Like

Sure but if we're saying a char is an integer (which we are, no?) then I don't understand why an integer of 1 or 4 bytes is the only acceptable representation? Why not 2 or 8 as well? What's the distinction we're making here?

I think for want of a real numeric hierarchy in rust (maybe someday, maybe never) then every meaningful error-free widening operation should probably have a From impl, and missing ones just haven't been hit by someone who was willing to make a std PR. At some point the type system may evolve enough that these impls can be somehow automatic.

Even on LE systems you still have to change the alignment and pad it with zeros; I would have to disagree here. This operation can be done in one instruction (usually), but so can conversion to f64.

That's the point under debate. I think both of these are consistent and reasonable viewpoints:

  • char is an integer, so it should have conversions to other numeric things, including floats and bigintegers and such, because it's better to have that once in the library than everyone needing to figure them out themselves. Making people call an extra conversion method is annoying and we should just have all the transitive impls -- even if they're only situationally useful -- since it's From and thus not lossy.
  • char is text encoding, so should only have conversions that are needed in that context, so to/from u32 for unicode codepoints, and u8 for ascii (as that type has methods like u8::to_ascii_lowercase). The library should guide people to handling text correctly, and it's not a big deal for code doing something unusual to just do another conversion, certainly better than having a ton of extra From implementations in the docs that people would have to scroll past to get to the one they should actually be using.

They're opposed, but I don't think either one is wrong.


The thing I do feel strongly about is that it would be wrong to add just u64: From<char>. If that one's reasonable, then at least u128: From<char> is also reasonable. And if those are reasonable, I think it's also clear that i32: From<char> and similar are just as reasonable as well.

3 Likes

Well, I mean we already do for u8:

fn main() {
    // Note: `0xff` is not ascii.
    let c: char = 0xff_u8.into();
}

Playground Link

And for what it's worth the docs only say that a char is a Unicode scalar value, not that it's a UTF-32 text encoding (or any other encoding). Of course this isn't canonical but the fact it's documented as a Unicode value rather than a specific encoding strikes me as an important distinction.

This isn't like str, which is a specific encoding of Unicode text.

And yes, I don't see anything special about u64 in particular. Although as toc mentioned, the lack of a defined real numeric hierarchy in Rust may make conversions to signed integer arguable even if u64 was accepted.

Oh, I wasn't aware that From<u8> has also been added despite the earlier PR warning that it may not be the right thing to do from encoding perspective.

So in that case the precedent of "chars are just numbers" has been set, and u64 conversion would fit the existing ones.

And for UTF-8 code points, f32 is just as adequate as f64: f32 can represent 24-bit integers exactly, and UTF-8 code points can only go up to 21 bits.

I'm working on a PR.

EDIT: Submitted #79502

3 Likes

I'd argue that char is not a number type, because it doesn't implement any arithmetic operations (in contrast to Java, where a char is just a two-byte unsigned integer).

char isn't in a text encoding, it's in a number encoding (big-endian or little-endian depending on the platform). Of course this is an implementation detail, but it's highly unlikely to ever change.

I think it makes sense to implement conversions between char and u16/u64/u128, not because it's the correct thing to do, but because it might prove useful, and I can't see any downsides.

It may be a fairly trivial encoding method, but it's certainly a text encoding. UTF-32 is a Unicode Encoding Form in which Unicode scalar values are encoded as 32-bit numbers, with big-endian and little-endian variations. The [char] type in Rust corresponds precisely to UTF-32BE or UTF-32LE, depending on platform.

2 Likes

What's the rationale for char: TryFrom<u16>? Is it just a glorified version of char: TryFrom<u8>, where the former might work for most non-Chinese languages while the latter works only for slightly-extended ASCII? Philosophically, should Rust standardize support for these opinionated uses, which intrinsically cannot be language-agnostic?

2 Likes

FWIW, I actually have neither viewpoint. I thought From/Into were simply for infallible conversions where there's only one possible (or one obvious default) way of doing the conversion, such that no one could reasonably need to ask why this char value gets turned into that u32 value instead of some other u32 value. On that view, whether a char "is an integer" is simply irrelevant (or at least not directly relevant) to the question of whether these conversions should exist.

But I completely agree with this. In general, I think the only reason we shouldn't simply add every unambiguous From/Into impl we can is to avoid creating overlapping impls that no one can actually use (without UFCS), but these all seem pretty safe.

Admittedly, I'm not familiar enough with Unicode to be 100% confident that every char value fits in the positive i32s, but I think they do, and if they do then that impl's clearly fine.

1 Like

The char docs link to http://www.unicode.org/glossary/#unicode_scalar_value:

In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive.

So it definitely fits in an i32, since i32::MAX is 7FFFFFFF16.

1 Like

Hardly the only issue with this crate.

I did a very very brief look over it when deciding to use it or not in the past, and found 5 obviously unsound functions in that time and filed: https://github.com/RustSec/advisory-db/blob/master/crates/ncurses/RUSTSEC-2019-0006.md — the authors opinion at the time was that it's okay because it's a thin wrapper.

I have almost no doubt that there are more issues too, but haven't had the time to look further.

2 Likes

Since people coming from non-Unicode-aware languages seem to be universally confused about what a "character" or a "byte" or a "code point" is, and how it's "just a number" or "just a byte" or "just a 16/32-bit int", I think perpetuating these myths is entirely the wrong and irresponsible thing to do. Based on this reasoning, the language might as well lack a char type completely and just use u32 for code points everywhere instead. That doesn't help with correct string manipulation at all, however.

I find the "but the conversion is lossless" argument weak at best, vacuous at worst. There are many, many surjective or bijective conversions between types that could technically be defined without loss of information. This doesn't mean that their semantics need to match automatically.

When I'm developing a database model, I'm often using newtypes of larger integers (u32u128) as primary keys, and so do many others. Does it mean that it would make sense to convert back and forth between a struct UserID(u32) and a char? Certainly not.

Rather than trying to use from as a rowhammer, we should think about whether we should, even if we can. Conversions are the very signs of doing something where types don't match up exactly, and thus they need extra caution in the general case. Sometimes, they don't, but that's only when we are lucky.


I'm otherwise a big advocate of ensuring type-level interoperability by implementing as many of the std traits as possible, but only so long as their correctness is immediately obvious. In a context filled with misconceptions like Unicode, the right choice is to let the user ask him/herself the question: "am I doing it right?", instead of letting him/her pull the trigger on the footgun without even being aware of the existence of that gun.

11 Likes

The only use cases I can think of for treating a char as a number are (a) to check whether a code point is within a particular range, and char already implements PartialOrd, and (b) to print the codepoint value, and u32 already implements From<char> for that.

So I agree: I see no reasonable use case which would be made simpler by implementing other conversions. For niche corner cases such as the example given of interfacing with the ncurses library, the two-step conversion u32::from(' ').into() is enough.

In fact, the type required by waddch is not a normal C character, but a C character logical-ORed with video attributes. From the waddch man page:

Video attributes can be combined with a character argument passed to addch() or related functions by logical-ORing them into the character.

So the use case in the original post is invalid in my view.

3 Likes

Similar to the reasoning in Add non-`unsafe` `get_mut` for `UnsafeCell`, I'd suggest that the absence of the infallible conversion implies that there isn't a correct one. The error message could reasonably point us to use try_into.

I'm also sympathetic to the viewpoint that this particular conversion should be spelled something like

let n : u64 = 'a'.encode_utf32().into();

but I don't know if that's workable into a consistent API without deprecating some From impls.

5 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.