Should BufRead::lines() et al recognize more than just LF?

BufRead::lines(), BufRead::read_line(), and other line-oriented interfaces in stdlib only recognize LF as a line ending. However, there are many other line endings in unicode.

Specifically, the following codepoints (and one grapheme) are supposed to be recognized as line endings according to Unicode Standard Annex #14:

  • LF (U+000A)
  • CR (U+000D)
  • CR+LF (U+000D + U+000A)
  • NEL (U+0085)
  • LS (U+2028)
  • PS (U+2029)

Given the strong unicode support throughout the rest of stdlib, it seems inconsistent to only recognize LF in line-oriented functionality. But even aside from consistency, Windows conventionally uses CRLF for line endings and thus Rust programs written using these stdlib functions won’t behave as expected when dealing with Windows text files. That seems bad.

Is the current behavior just an oversight? And if not, I’m curious what the rationale behind this LF-only behavior is.

4 Likes

I would assume that it was designed from a UNIX perspective. One could argue that only recognizing LF is the correct behavior on UNIX systems. Then again, the method would need to only recognize CR on Mac (is that still the case or have they joined UNIX?), on windows only CR+LF. Also the current behavior is likely faster to implement.

That said, recognizing all unicode line endings should IMHO be the default system-agnostic API, while the others could be made available in the os:* crates.

2 Likes

I believe that CR for newlines is only for classic Mac OS prior to OSX.

If the code is indeed UNIX-specific even on other platforms, it should be fixed.

On a more careful reading of Standard Annex #14, FF (U+000C) and VT (U+000B) are also line-breaking characters. FF is required to be supported, but VT is optional. (See section “BK: Mandatory Break”)

So the complete list of required line endings that a conforming implementation must support is:

  • LF (U+000A)
  • FF (U+000C)
  • CR (U+000D)
  • CR+LF (U+000D + U+000A)
  • NEL (U+0085)
  • LS (U+2028)
  • PS (U+2029)

If someone could sanity-check that, I would really appreciate it! I may still have missed something or misread something.

@llogiq

As DanielKeep noted, CR on its own is Mac OS pre-OSX.

That said, recognizing all unicode line endings should IMHO be the default system-agnostic API, while the others could be made available in the os:* crates.

I agree. Especially because software often deals with text that isn't created on their host OS. Just because I'm running Linux doesn't mean I'll never see a CRLF text file, and just because someone else is running Windows doesn't mean they'll never see a LF file.

Also, it's relatively easy to get the current lines() behavior with split(), e.g.:

for l in some_reader.split('\n' as u8) {
    let line = String::from_utf8(l);
    // Do whatever you want with line
}

CR on its own is Mac OS pre-OSX

Duly noted.

I still think it doesn't hurt to have lines()methods in the os-specific modules. Makes the intent clearer (e.g. I want to read a file by line in unix format instead of I want to split my reader by 10u8). The split over all newlines shoudn't be too hard to implement, right?

CRLF is pretty popular in network protocols.

Well, \r does complicate things in one specific case: reading a single line without lookahead or context [1]. It makes it impossible, which is inconvenient.

Given that classic Mac OS is only used in legacy contexts, I've never felt too bad about just not supporting it.


[1] : This is the case when you want read_line(stdin) to work.

But don’t we already need some kind of buffer to keep the contents of the line? If so, I’d suggest that splitting on \r\n should be possible with little added complexity (just skip \n at the start of the line).

I fear you’ve missed the point: by definition, you don’t have any external state. Yes, there’s a buffer, but it’s created within read_line and is never passed back in. Also, I didn’t say you can’t split on \r\n, I said you can’t split on \r: you cannot possibly know if it will be followed by an \n or not. You have to assume one way or the other, sacrificing either classic Mac newlines (which are vanishingly rare) or Windows newlines (which would be an apocalyptically terrible idea).

@DanielKeep

I think @llogiq means that for things like the Lines iterator which don’t include the line endings in the returned strings, you can simply skip the LF if the last character was a CR. I’ve whipped up a simple (but inefficient and doesn’t-handle-errors-properly) example of how that could work:

struct UnicodeLines<B: BufRead> {
    buf_chars: Chars<B>,
    last_char: char,
}

impl<B: BufRead> Iterator for UnicodeLines<B> {
    type Item = io::Result<String>;
    
    fn next(&mut self) -> Option<io::Result<String>> {
        let mut line = String::new();
        
        while let Some(Ok(c)) = self.buf_chars.next() {
            match c {
                '\u{000A}' => {
                    self.last_char = c;
                    if self.last_char == '\u{000D}' {
                        self.last_char = c;
                        continue;
                    }
                    else {
                        self.last_char = c;
                        break;
                    }
                },
                
                '\u{000C}'
                | '\u{000D}'
                | '\u{0085}'
                | '\u{2028}'
                | '\u{2029}' => {
                    self.last_char = c;
                    break;
                },
                
                _ => {
                    line.push(c);
                    self.last_char = c;
                },
            }
        }
        
        if line.len() > 0 {
            return Some(Ok(line));
        }
        else {
            return None;
        }
    }    
}

For something like read_line(), however, that does include the line ending, that obviously doesn’t work. But I feel like it’s an acceptable trade-off, then, to choose only to support CRLF and not CR on its own. But where we reasonably can, we should support the full unicode set of line endings, IMHO, and only make compromises where required for technical reasons.

There is one problem with my proposal: Our version will fail to recognize an empty line at the start of our stream if the newline is LF-denoted.

Note: I am only saying it like this to ensure I cannot be misunderstood, I am not trying to berate you.

You cannot distinguish between \r\n and \r without external state. Solutions that involve external state do not count.

The whole point of the exercise was to write a read_line function that does not depend on any state because people kept using stdin wrong. No one expects stdin to construct a new, totally independent buffer every time you call it, often meaning that the first call would consume all available input, even if it wasn't used. I needed something that was not unreasonably demanding for my scanln! macro, thus I needed a way to extract a line without constructing any buffers or maintaining internal state.

I am aware that stdin now always returns the same buffered reader (or so I believe).

However, it's still worth keeping in mind that most people don't seem to realise or understand that when you wrap a reader, either with a buffer or some other kind of hidden state, you fundamentally cannot extricate that reader ever again without potentially losing information.

I think I get your point now. The problem is not iter_lines(), but read_line(), right?

You cannot distinguish between \r\n and \r without external state. Solutions that involve external state do not count.

Hmm... perhaps I'm misunderstanding what you mean by "external state". The example I just presented only has internal state, as far as I can tell (at least, that's the only additional state compared to how stdlib already functions). The state is entirely contained within the iterator. But maybe I'm still missing your point. If so, can you point out a use-case where the solution I presented doesn't work, but the current Lines iterator does?

In any case, as I already noted, the solution I presented depends on a specific property of the Lines iterator: it doesn't include the line endings in its returned strings. So this doesn't work for read_line, which does include line endings. Is that what you're getting at?

However, it's still worth keeping in mind that most people don't seem to realise or understand that when you wrap a reader, either with a buffer or some other kind of hidden state, you fundamentally cannot extricate that reader ever again without potentially losing information.

Indeed. And this is why creating a BufReader from a reader of some kind consumes the reader, correct? In fact, producing a Lines iterator from a BufReader consumes the BufReader as well.

But I guess what I'm trying to get at is that just because we can't give perfect unicode support here in all situations, doesn't mean we shouldn't give the best unicode support we reasonably can in each case. I'm suggesting that the Lines iterator can handle CR/CRLF just fine, and so it should. Whereas read_line cannot, and so it shouldn't. And it should be documented what limitations (if any) each has, and why.

Alternatively, we could also give read_line a boolean parameter specifying whether it should wait for the next character (or EOF) after meeting a CR or not. Then the calling code can decide whether to have fully proper unicode support vs better behavior in an interactive/streaming setting.

DanielKeep means state external to the "reader" stream. Std-input is an OS-provided stream. If you wrap it with a buffer and then later want to remove the stream from that buffer, you lose any bytes taken by the buffer but not yet consumed, since they cannot be put back into the stream. And if you add state to a wrapper (external to the stream but internal to the wrapper), that state is lost were you to remove the stream from the wrapper.

That said, I don't see why that is a major issue if in the specific case of a state flag saying I just read \r as an end-of-line character, so if the next character is \n then ignore it. But I still think keeping things simple by not supporting \r is a better idea.

To re-emphasise, the context of needing read_line is a scanln! macro that expands (very roughly) to this:

{
    let line = read_line(::std::io::stdin());
    scan!(&*line, $patterns)
}

The whole point is to be an argument-free, easy-to-use shorthand for parsing a line of input from STDIN. It can’t have any external state because that would defeat the reason for having it in the first place. The only thing that’s allowed to escape the expanded block is the result.

And yes, &self counts as external; it’s external to the function.

@DanielKeep: In this case it makes sense for read_line to ignore '\r' (at least unless running on MacOS pre-X). The Lines iterator could include it, but I doubt the additional code will pay off.

The formats may be os-derived but the files are not os-specific – they cross the system boundaries all the time.

Supporting other kinds of newline (maybe through a parameter for BufRead::lines?) makes sense, but this should not be implicitly based on the platform a program happens to be compiled for.

This can be written

for l in some_reader.split(b'\n') {