Do we need unicode whitespace?

As you might, or might not know, rust uses Pattern_White_Space property to define what is considered whitespace by the grammar. It is defined as follows:

pub(crate) fn pattern_white_space(c: char) -> bool {
        match c {
            | '\u{0009}' 
            | '\u{000A}'
            | '\u{000B}'
            | '\u{000C}'
            | '\u{000D}'
            | '\u{0020}'
            | '\u{0085}'
            | '\u{200E}'
            | '\u{200F}'
            | '\u{2028}'
            | '\u{2029}' => true,
            _ => false
        }
    }

Do we really need those unicode characters there? ASCII-only whitespace seems more user-friendly (all tools understand it, many more programmers understand \r and \r than U+200F (I don’t understand the latter, for example)). It also should be easier for tools (this post is inspired by this bug(?) in rustc: https://github.com/rust-lang/rust/issues/60209).

Also, is discussing the lexical structure of whitespace is the forth clause of Wadler’s law?

2 Likes

For the reference, Swift (https://docs.swift.org/swift-book/ReferenceManual/LexicalStructure.html) and Go (https://golang.org/ref/spec#Tokens) use ASCII-only whitespace.

I especially like Go’s definition, which is space, \t, \n, \r, without weird stuff like \v.

Wouldn’t it technically be a breaking change to reduce this?

6 Likes

Yeah, it would be technically a breaking change (and even just fixing that rustc issue will also be breaking).

EDIT: I personally feel though that in this case this is justified: I expect that, de-facto, tools wouldn’t honor our precise definition anyway.

1 Like

@matklad As a first step, could you fix it in a PR and crater? That way we at least have some data to base decisions on.

2 Likes

Looks like the “non-ASCII idents” RFC mentions this - https://github.com/rust-lang/rfcs/blob/master/text/2457-non-ascii-idents.md.
Perhaps the RFC discussion thread has something about this as well, but it’s hard to find thanks to GitHub with its “631 hidden items”.

(I’d personally be entirely happy with a-zA-Z0-9_ idents and \t\n\r whitespace.)

Is it really a problem that Rust can parse input with characters which are technically whitespace but that some tools fail to recognize?

4 Likes

Changes this would be a clear breaking change and contrary to our pretty well-established unicode first philosophy. I think its very unlikely that this will change.

EDIT: I thought this was about the behavior of the std is_whitespace methods in general, not strictly rustc parsing. I still think we should fix the bug the other direction, but let me downgrade my statement to just pretty unlikely.

3 Likes

The fact that the compiler itself uses two different definitions of whitespace is an evidence that this is a problem.

2 Likes

Its evidence that a programmer made a mistake by not using the is_whitespace method provided in the standard library - its plausible this code predates that method existing, since newline escaping in Rust strings probably predates rustc, and was ported from the bootstrapping compiler. This is why we provide Unicode-based methods: to avoid these sorts of bugs that arise when users attempt to define character categories themselves instead of delegating those definitions to a standards body.

8 Likes

If I understand correctly, char::is_whitespace is not the definition of whitespace from the language reference. One is White_Space, the other is Pattern_White_Space. So, using char::is_whitespace would also be an error.

4 Likes

I suspect there’s little to no usage of non-ascii-whitespace.

Specifically, the Pattern_White_Space characters are:

  • ASCII whitespace
    • HORIZONTAL TABULATION
    • NEW LINE
    • VERTICAL TABULATION
    • FORM FEED
    • CARRIAGE RETURN
  • NEXT LINE (latin1 suppliment)
  • LEFT-TO-RIGHT MARK, RIGHT-TO-LEFT MARK
  • LINE SEPARATOR, PARAGRAPH SEPARATOR

The last two are effectively not used in favor of the ASCII whitespace. LTR and RTL marks are definitely used, but rarely: these are the “soft” versions (as opposed to the overrides), which means they only reverse “weak bidi” scripts (such as, iirc, arabic numerals) but not “strong bidi” characters (such as, iirc, the latin alphabet), and most of the time other “strong bidi” characters will be around to do the reversing (though that may not hold in a language with latin script keywords). I don’t know enough about NEXT LINE to comment on its usage, and I figure we all know ASCII whitespace.

All that said, we (eventually) agreed on TR31 XID Identifiers. I see no reason we shouldn’t use TR31. Pattern Whitespace. That also said, I could see a line of a string starting with an LTR/RTL MARK, so I don’t know the best solution here.

DISCLAIMER: this knowledge comes from being a hobbiest and not from actually working with scripts that use these marks. You’ll probably still want to ask someone who uses a rtl script how we can best handle it.

1 Like

Of course, the wonders of unicode deepen. I don't know the reason both of these categories exist, but my point is that we should use one (presumably pattern_white_space since that's what we decided once) instead of ad hoc defining our own category that is arbitrarily what we think programmers are used to. You even have noted already that Swift and Go use distinct definitions of "ascii whitespace."

3 Likes

From the reference:

Unicode productions

A few productions in Rust's grammar permit Unicode codepoints outside the ASCII range. We define these productions in terms of character properties specified in the Unicode standard, rather than in terms of ASCII-range codepoints. The section Special Unicode Productions lists these productions.

It states that what matters is the "character properties specified in the Unicode standard". Clearly the list of productions is simply incomplete. :wink:

White_Space is actually a shorthand for General_Category=White_Space, which is codepoints "generally perceived as" whitespace. This is a mutable set: codepoints can be added and removed from this set, and have been in both directions.

Pattern_White_Space is defined in TR31 to be a closed, stable set of codepoints recommended for use as whitespace for the purpose of computer language syntax.

The TR31 spec requires a language to describe their whitespace as a diff from the Pattern_White_Space set of codepoints. This can be any number of added or removed codepoints, but barring special circumstances, the UC's recommendation is to use it unmodified.

(Side note: I'd consider TR31 required reading before discussing lexical syntax.)

7 Likes

cc @Manishearth here as well

2 Likes

I think the most consistent decision is to make both strings and the lexer use Pattern_White_Space as their definition of white space, and then have the string whitespace removal be:

  • A backslash followed by an explicit line breaking codepoint causes that backslash, new line codepoint, and any following uninterrupted run of Pattern_White_Space codepoints to be removed from the resulting string.
  • If this run is ended by a backslash immediately followed by a Pattern_White_Space character which is not an explicit newline, the backslash is removed from the resulting string and the whitespace character (and any following it) are retained. (If the whitespace terminating character does not meet this condition, it is rendered in the resulting string as if this transformation had not occurred (e.g. a backslash escape is processed normally).)

This allows lines to start with Pattern_White_Space characters (useful for leading ASCII whitespace as well) and keeps the string concept of whitespace the same as the lexer’s.

3 Likes

What should we treat as an explicit linebreaking codepoint? Only \n, \r\n (bare \r in string literals are forbiddern)? Or should NEXT LINE, LINE SEPARATOR and PARAGRAPH SEPARATOR be included as well?

So, let me try write down arguments for why we should consider defining whitespace as space, \t, \r, \n in a somewhat more structured form :slight_smile:

This is a relatively minor issue, and, if someone from the lang team feels strongly that discussion is not helpful here, that would be enough for me to consider this settled (modulo fixing the actual bug in the compiler). That said, given that Rust could become the backbone language of the future, nailing down trivial details seems worthwhile to me.

The first problem with Pattern_White_Space is that it seems unlikely that all smaller tools will support it. It’s just to easy to use is_ascii_whitespace or is_unicode_whitespace, both of which are wrong. So, I am pretty sure that, even if we rule Pattern_White_Space, throwing NEXT LINE or PARAGRAPH SEPARATOR into the source code will break some tools. Given that compiler bug, I think it will break all existing tools even :slight_smile:

The second problem with Pattern_White_space is that it seems to cater for use-cases which are irrelevant for Rust. IIUC, one of the goals of Unicode is to be able to represent pre-existing documents in legacy encodings lossless, but there are no Rust code in legacy encodings. Specifically, FORM FEED, and VERTICAL TABULATION seem like they are included only for compatibility with old stuff. Note that \v is presumably one of the things Rust shipped without :slight_smile: NEXT LINE is an especially egregious case: it’s not even ASCII, it’s LATIN-1! I found little motivation in TR31 for why Pattern_White_Space is defined this way. It gives a nod to XML 1.0, version 5, which uses S ::= (#x20 | #x9 | #xD | #xA)+ itself =/

2 Likes

At the issue, Manish makes a good point that supporting LTR and RTL might be important to support unicode identifiers.