Is it necessary to limit Unicode-escape to only have at most six hex-digits?

In the current, the Unicode-escape can only have at most six hex-digits. In c++, there is no such restriction, which can arguably say is sustainable to be expanded if Unicode scalar values would be increased. Should we remove the such restriction in Rust?

Unicode has pretty extensively committed to never expanding past U+10FFFF; that seems exceedingly unlikely to change. And if it did, we can expand Rust's Unicode support at that time.

9 Likes

Yes, we could do that at that time. However, I think such a restriction is not necessary. We can have \u{ hex-digit ... } to have arbitrary numbers of hex-digit as long as the hex-value designated by the sequence of hex-digit is within the range of Unicode scalar values(current). There is no necessary to give restrictions on the grammar of Unicode-escape.

But, by definition, the range of Unicode scalar values is currently 0 to 0x10FFFF - any value larger than 0x10FFFF is not a valid Unicode scalar value. So the only thing that you get by permitting more than 6 hex digits is that you can add lots of leading zeroes to a valid scalar value - but then we have to have the compiler handle those leading zeroes sensibly, even though we could simply insist that you don't supply them.

And expanding Unicode scalar values to the point where you need more than 6 hex characters is highly unlikely; UTF-16 can only represent code points from 0 to 0x10FFFF, and we'd need to add more surrogates to expand UTF-16.

Note, too, that as of Unicode 15.0, we still have more than 2 unassigned codepoints for every assigned codepoint. This means that we're nowhere near running out of codepoints, and thus there's no reason to want to expand the number of supported codepoints. We have 10 completely unallocated planes, and only one of the 7 allocated planes is nearly fully assigned, and we don't need more than 6 hex characters until we expand beyond 256 planes - but since the beginning of Unicode, we've not even got 50% of the way through the 17 planes we have today.

1 Like

Having more specific, limited grammar allows typos (like accidentally deleting the }) to be caught earlier in parsing; that is a reason.

Is this an actual problem you're running into?

2 Likes

No, I meant, we shouldn't limit the number of hex-digits in \u{ hex-digit} to only have at most six.

What problem is being caused by limiting you to at most 6 digits?

You can represent all valid Unicode scalar values in 6 hex digits.

3 Likes

I gave one possible reason why we should limit the number of hex digits: earlier error detection.

Why do you think we should not?

1 Like

When Java decided to use 2 bytes to represent the Unicode code points, they thought that storage is adequate to represent all code points because the max one is U+FFFF. Similarly, at current, we think six hex-digits are adequate to represent all code points...

Yes. And when Unicode expanded from a maximum of U+FFFF to U+10FFFF, Java did the work to cope with that expansion.

But that was approximately 20 years ago, and we have not yet expanded beyond 7 allocated planes, of 17 total. We won't need more than 6 hex digits until we need more than 256 planes.

It's taken us 20 years to allocate 7 planes. If we continue to allocate at the same rate as we have since the beginning (which is unlikely), it'll take us 700 years to allocate 256 planes. Why should Rust prepare now for something that won't happen for another 700 years?

4 Likes

But, unless I'm mistaken, it would be a non-breaking change to support longer escape sequences whenever in the future they have some meaning. Is there any reason for making this change now?

6 Likes

Yes, I think we should remove the restriction now for future meaning. After all, permitting arbitrary hex-digits in a Unicode-escape does not introduce any problem or slowdown earlier error detection as mentioned in above comment. We just need to provide the value specified by the sequence of hex-digits is within the range of the currently defined code points.

If someone changes the restriction from "a Unicode escape is up to 6 hex digits" to "a Unicode escape is up to the minimum number of hex digits needed to represent the largest permissible Unicode scalar value as of Unicode 15.0", would you be happier?

If so, why?

[Did you mean "more than 14 planes"? I've actually done the linear extrapolation and found that if allocation continues at the same rate it has since the beginning, we run out of non-PUA space below U+10FFFF in roughly 600 years.]

I would argue that there are compelling reasons to remove the artificial upper limit on Unicode scalar values -- throughout the language, not just in \u escapes -- as soon as practical, as a willful deviation from Unicode, even though we don't expect to actually need them within the lifetime of anyone reading this in 2023:

  1. The imposition of an artificial upper limit was a mistake. It was driven solely by implementation concerns (the space addressable by UTF-16) that are no longer relevant. Enforcing it adds needless complexity to every implementation of UTF-8. This isn't a perfect case of "just delete the limit" because you do also have to add back support for 5-byte and 6-byte UTF-8 sequences, per the original RFC 2044 spec, but it's still worth it, because:

  2. It will be orders of magnitude easier to change this now than when we actually need it. Witness how putting it off as long as possible has made the IPv6 transition take substantially longer, cost more money, and do more collateral damage to things like the end-to-end principle than it would have if we'd collectively ripped the band-aid off in the early 2000s. Contrariwise, witness how smoothly the Y2038 transition is going because LP64 ABIs brought in 64-bit time_t quietly.

2 Likes

Unicode has been around for about 30 years. In Unicode 15.0, we have 17 planes defined, of which 7 have been allocated (planes 0, 1, 2, 3, 14, 15 and 16 have been allocated, planes 4 through 13 are unallocated).

A 6 hex digit limit on Unicode scalar escapes means that we can go to 256 planes defined before we need to revisit the definition of a Unicode scalar escape. We've taken 30 years to allocate 7 planes (including the 2 private use planes); thus, I'd expect that if the rate of plane allocation in the first 30 years is typical of the growth rate of Unicode, it'll take us 256 / 7 * 30 years to get to 256 planes allocated. Make the assumption that the first 10 years of Unicode had us engaging in contortions to keep within the BMP, because of legacy software that didn't handle supplementary planes, and I get to 256 / 7 * 20 years = 731 years before we allocate our last plane.

And there's always going to be some upper limit on Unicode scalar values - we can't use them if they're an arbitrary member of the set of natural numbers. For Unicode scalar escapes, we've set that limit at 2**24 for now, since this is a convenient limit if we're representing Unicode 15.0 codepoints with hex digits - we need at least 2 hex digits to represent all possible plane numbers from 0 to 0x10, so we might as well let the plane number part of the escape range up to 0xff, even though planes 17 to 255 are not possible right now.

As and when it looks like we'll run out of 24 bit codepoints (bearing in mind that today, a Unicode scalar value can be represented in 21 bits with 14 planes unusable), it'll be time to work out what the expansion looks like - do we want Unicode scalar escapes to be 32 bit? 36 bit? 96 bit? - but until then, we've chosen a perfectly reasonable value.

The difference is that the Y2038 change can be done program-by-program; as soon as I support 64-bit time_t, I'm ready, even if everyone I communicate with uses 32-bit time_t. With the IPv6 migration, not only do you need to migrate me to IPv6, you also need to migrate everyone I communicate with to IPv6 before I can drop IPv4 support.

The change to a larger Unicode codepoint limit is closer to the time_t change than to the IPv6 change - if Unicode 41 increases us from 17 planes to 65536 planes (making a codepoint into a 32 bit number), I can change over internally to working with all 65536 planes, even though I can only interchange with you with the 17 planes you understand.

Oh, I see what you mean now. Yes, I agree that if we leave the U+10FFFF limit on code points in place, then there is no need to revisit the syntax of \u escapes.

What I'm saying is that I think Rust (and the software industry in general) should revert to RFC 2044 / original ISO 10646, with the maximum codepoint value set at U+7FFF_FFFF, as soon as practically feasible (in particular, not waiting for the 17-plane space to be anywhere near full). That work would naturally involve, among other things, changing \u escapes to allow up to eight hex digits.

This is true; when I say "it will be orders of magnitude easier to change this now" I'm primarily thinking about data storage formats. Maybe a better example is MySQL's utf8mb3 and utf8mb4, where you can think you're fine with mb3 until someone wants to put an emoji in the column and then suddenly you have a painful database migration on your hands. If they'd gone directly to RFC 2044 UTF-8 from the -mb3 variant, there wouldn't be another painful database migration in the future of every database intended to still be around for centuries.

2 Likes

A reason to remove it is the Robustness Principle:

"Be conservative in what you send, be liberal in what you accept."

1 Like

Original ISO 10646 set the maximum codepoint value to U+FFFF - U+7FFF_FFFF was a proposal that got watered down to U+10_FFFF to handle the constraint of compatibility with past revisions of ISO 10646 (via UTF-16 and surrogates).

And while I agree that we should plan for more planes in the long run, that's not something Rust should be doing by itself - this is something that should be happening at the Unicode level. It'll need serious thought - not least, how do we deal with systems that want to use a 16 bit code unit for characters - but it's doable.

In the meantime, though, when the absolute upper limit on valid Unicode scalar values is 0x10_FFFF, why should we increase the maximum scalar value you can encode in an escape from 0xFF_FFFF, and what should the upper limit actually be? We're going to have trouble if we increase the upper limit in the escape to 0xFFFF_FFFF, but the eventual Unicode change to support more planes increases the upper limit to 0x10_FFFE_FFFF_FFFF - we end up in the same situation people who chose UCS-2 when they implemented the original ISO 10646 character set were in, having to do yet more work to handle a case that we didn't foresee. Might as well delay doing the work until we know what the real upper limit is going to turn out to be.

2 Likes

Rust is liberal in what it accepts - you can enter a Unicode escape with a scalar value of 0x110000, even though the upper limit on USVs is 0x10ffff.

If we're going to raise that limit, where should we raise it to?

My 2 cents would be pick a number greater than 6, maybe 32 or 256. You could also provide a way for it to be specified at compile or runtime and let the user decide how liberal or conservative to be.