Unicode vs. Rust definition of newline for source text

zackw · June 7, 2025, 4:33pm

While discussing the "dedented string literals" RFC I noticed that Rust's treatment of newlines in source text does not appear to match what the Unicode annex that we're using as a normative reference for 'whitespace' says is supposed to happen.

The Rust reference says that "whitespace" is defined as any sequence of Unicode Pattern_White_Space characters with a reference to UAX#31 for the definition of Pattern_White_Space. The Rust reference doesn't, as far as I can tell, make very much distinction between types of whitespace, although in another section it does say that CR LF sequences (U+000D immediately followed by U+000A) are "normalized" to just U+000A very early in lexical analysis.

In UAX#31, by contrast, Pattern_White_Space is subdivided into three classes which I'm going to call "horizontal whitespace", "ignorable format controls", and "end-of-line markers." The "end-of-line markers" class is quite extensive:

A sequence of one or more of any of the following characters shall be interpreted as a sequence of one or more end of line:

U+000A (line feed)

U+000B (vertical tabulation)

U+000C (form feed)

U+000D (carriage return)

U+0085 (next line)

U+2028 LINE SEPARATOR

U+2029 PARAGRAPH SEPARATOR

Looking at the actual lexer implementation (rust/compiler/rustc_lexer/src/lib.rs at master · rust-lang/rust · GitHub), I see a bunch of places where it looks specifically for \n but not any of the other characters in this list. And the newline normalizer (rust/compiler/rustc_span/src/lib.rs at 5e0bdaa9dde845b8e44fd93bf0c09d21ca60daa1 · rust-lang/rust · GitHub) doesn't convert anything besides CR LF into LF.

I'm wondering whether we should change the newline normalizer so that all of the above characters are normalized to LF, as well as CR LF sequences. The motivation would just be to align ourselves more precisely with Unicode and not to have to explain that, despite talking about Pattern_White_Space, we ignored most of the actual UAX#31 recommendtaions; actual usage of any of these characters (other than CRLF line endings) is vanishingly rare. It would technically be a breaking change, but only for programs that put a literal instance of one of these characters inside a string (including a docstring).

(Part of Do we need unicode whitespace? touches on this issue but it does not seem to have been resolved.)

kpreid · June 7, 2025, 4:41pm

I appreciate the principle of this proposal, but given the hazards of line mis-interpretation (e.g. changing whether a line is part of a comment or not), it might be even better to prohibit literal instances of characters in the “end-of-line marker” class that are not either "\n" or "\r\n".

Or maybe a slightly more lenient rule: any such character may appear as long as it is preceded or followed by at least one \n. This is a generalization of allowing "\r\n" line endings, and would be appreciated by those old-school folks who like using form-feeds in code formatting (because it expresses “put this section on a new page”).

kornel · June 7, 2025, 4:42pm

It would be painful if source code line numbers didn't align between rustc and other tools and editors.

I don't expect other tools to support anything besides (CR)LF. If some do, they're a minority, and use of other line break characters is going to give inconsistent results.

I think it would be better for Rust to walk back that definition, and disallow other line breaks outside of string literals, and warn by default about them appearing non-escaped in string literals.

zackw · June 7, 2025, 4:46pm

The part of me that grew up hacking on classic MacOS has nostalgia for bare \r as a line terminator, but nonetheless I could get behind disallowing the unusual line break characters unless preceded/followed by \n, as you suggest.

It's the current "silently treated as horizontal whitespace" behavior that I think is at least potentially problematic (e.g. if someone expected one of these characters, by itself, to terminate a // comment).

daniel-pfeiffer · June 7, 2025, 8:28pm

Nobody implements the standard, so neither can we? I verified that Emacs, busybox-vi, and VScode don’t. Though the latter recognises and warns about the last two. Four different terminal emulators all do vtab & form-feed. I’m stunned that they list U+0085: AFAIK it is not valid Latin-1 and thus not valid Unicode.

We could warn if they are encountered unless there is either of

#![respect_unicode_line_end(true)]
#![respect_unicode_line_end(false)]

If these would be seen too late, they could be special cased as having to be on the very first line. To avoid checking every file for this, it could be done when first encountering one of these line ends.

To be consistent we would also need a new method str::unicode_lines().

Jules-Bertholet · June 7, 2025, 10:16pm

U+85 is a valid Unicode character (technically a “Unicode scalar value”) in the Latin_1_Supplement block, with General_Category “Control”.

zackw · June 8, 2025, 1:44am

To expand on that a little, the original ISO 8859.1 spec did leave the 0x80 .. 0x9F range undefined, but they did that with the expectation that the range would be used to transmit the C1 control character set from IEC 6429. Unicode adopted that expectation as part of its definition of the U+0080..00FF block. So if someone sends you U+0085, that is officially supposed to be the C1 control character NEXT LINE, even though if they send you 0x85 in unibyte text it's far more likely to have the Windows-1252 meaning (U+201E DOUBLE LOW-9 QUOTATION MARK).

Topic		Replies	Views
Do we need unicode whitespace?	25	3180	July 24, 2019
Should BufRead::lines() et al recognize more than just LF? libs	25	6431	March 25, 2019
Pre-RFC: A format spec for newline	6	3176	March 25, 2019
Allow escaping space in strings language design	18	4858	March 25, 2019
Completing rustfmt and the Rust style guidelines	84	18021	March 25, 2019

Unicode vs. Rust definition of newline for source text

Related topics