While discussing the "dedented string literals" RFC I noticed that Rust's treatment of newlines in source text does not appear to match what the Unicode annex that we're using as a normative reference for 'whitespace' says is supposed to happen.
The Rust reference says that "whitespace" is defined as any sequence of Unicode Pattern_White_Space characters with a reference to UAX#31 for the definition of Pattern_White_Space. The Rust reference doesn't, as far as I can tell, make very much distinction between types of whitespace, although in another section it does say that CR LF sequences (U+000D immediately followed by U+000A) are "normalized" to just U+000A very early in lexical analysis.
In UAX#31, by contrast, Pattern_White_Space is subdivided into three classes which I'm going to call "horizontal whitespace", "ignorable format controls", and "end-of-line markers." The "end-of-line markers" class is quite extensive:
A sequence of one or more of any of the following characters shall be interpreted as a sequence of one or more end of line:
I'm wondering whether we should change the newline normalizer so that all of the above characters are normalized to LF, as well as CR LF sequences. The motivation would just be to align ourselves more precisely with Unicode and not to have to explain that, despite talking about Pattern_White_Space, we ignored most of the actual UAX#31 recommendtaions; actual usage of any of these characters (other than CRLF line endings) is vanishingly rare. It would technically be a breaking change, but only for programs that put a literal instance of one of these characters inside a string (including a docstring).
I appreciate the principle of this proposal, but given the hazards of line mis-interpretation (e.g. changing whether a line is part of a comment or not), it might be even better to prohibit literal instances of characters in the “end-of-line marker” class that are not either "\n" or "\r\n".
Or maybe a slightly more lenient rule: any such character may appear as long as it is preceded or followed by at least one \n. This is a generalization of allowing "\r\n" line endings, and would be appreciated by those old-school folks who like using form-feeds in code formatting (because it expresses “put this section on a new page”).
It would be painful if source code line numbers didn't align between rustc and other tools and editors.
I don't expect other tools to support anything besides (CR)LF. If some do, they're a minority, and use of other line break characters is going to give inconsistent results.
I think it would be better for Rust to walk back that definition, and disallow other line breaks outside of string literals, and warn by default about them appearing non-escaped in string literals.
The part of me that grew up hacking on classic MacOS has nostalgia for bare \r as a line terminator, but nonetheless I could get behind disallowing the unusual line break characters unless preceded/followed by \n, as you suggest.
It's the current "silently treated as horizontal whitespace" behavior that I think is at least potentially problematic (e.g. if someone expected one of these characters, by itself, to terminate a // comment).
Nobody implements the standard, so neither can we? I verified that Emacs, busybox-vi, and VScode don’t. Though the latter recognises and warns about the last two. Four different terminal emulators all do vtab & form-feed. I’m stunned that they list U+0085: AFAIK it is not valid Latin-1 and thus not valid Unicode.
We could warn if they are encountered unless there is either of
If these would be seen too late, they could be special cased as having to be on the very first line. To avoid checking every file for this, it could be done when first encountering one of these line ends.
To be consistent we would also need a new method str::unicode_lines().
To expand on that a little, the original ISO 8859.1 spec did leave the 0x80 .. 0x9F range undefined, but they did that with the expectation that the range would be used to transmit the C1 control character set from IEC 6429. Unicode adopted that expectation as part of its definition of the U+0080..00FF block. So if someone sends you U+0085, that is officially supposed to be the C1 control character NEXT LINE, even though if they send you 0x85 in unibyte text it's far more likely to have the Windows-1252 meaning (U+201E DOUBLE LOW-9 QUOTATION MARK).