Idea: Opt-out Stdout synchronization

Many command programs doesn't use multi-threading, and Stdout synchronization is actually an overkill for them. It would be nice if it could be opt-out just like panic=abort.

Such solution would have to go quite deep and recompile the standard library (which even panic=abort doesn't do yet).

Why is not locking it and keeping it locked for the whole lifetime of program enough?

https://doc.rust-lang.org/std/io/struct.Stdout.html#method.lock

3 Likes

Because then you have to pass it around to every function that wants to print.

The buffered stdout is a global resource and you've probably seen that Rust insists on synchronization or equivalent for static variables and thread locals, even in a single threaded program. So we don't really get around the need for synchronization that easily.

Can this be solved by using a different logging framework that skips using std stdout at all?

1 Like

Stdout is already going to need to require indirection or branching in order to support an API to disable line-buffering. So perhaps a third option to further disable block buffering could also be added; in this case, no synchronization on the Rust end should be necessary. (only that which is inherent to the kernel)

If we want both block buffering and unsynchronized access, well... Yeah. That's rough.


What happens currently when multiple threads try to lock Stdout? Does the second one block? (I'm confused by the implementation of lock, which looks like it should always succeed immediately...). Perhaps if we added a Stdout::try_lock if necessary, then an application could store StdoutLock as a thread-local inside a log implementation?

What happens currently when multiple threads try to lock Stdout? Does the second one block? (I'm confused by the implementation of lock , which looks like it should always succeed immediately...).

Stdout contains a ReentrantMutex, which blocks when locking. It allows multiple calls to lock it from a single thread without blocking though.

https://doc.rust-lang.org/stable/src/std/io/stdio.rs.html#404

Once parking_lot is merged, it should be possible to make the overhead negligible.

On Unix writes to stdout can return partial writes (depending on what the stdout fd points to), so synchronization cannot be removed for multithreaded programs while preserving current behavior regardless of buffering mode or absence of buffering.

Making Rust behavior conditional on multithreading is problematic because C code could spawn threads via pthread_create and then call Rust code, and this can probably only be detected in hacky ways that aren't really worth the risk.

1 Like

If the unbuffered mode requires opt-in, preservation of current behavior is not a requirement.

I want to endorse doing something to make single-threaded pipeline programs easy to write yet still fast. I have a pipeline program in C++ that sped up by at least an order of magnitude when I added std::cin.tie(0) to the beginning of main. (Passing a StdoutLock instance around doesn't seem that bad off the top of my head, though. I've been meaning to try rewriting that program in Rust and see how it compares, maybe this is a good excuse :wink:)

On UNIX, writes below a certain size (PIPE_BUF) are inherently atomic and don't require synchronization (beyond what the kernel does internally). Could we avoid synchronization below that size?

It would be necessary to block on pending writes bigger than that size. Also what about pipes with O_NONBLOCK set?

Yes, exactly; you'd still need synchronization, but only for large writes. Eliminating synchronization for small writes would be well worth doing.

And I'm just talking about stdout and similar Rust-level constructs, where you don't need to worry about a file descriptor unexpectedly in nonblocking mode.

Imagine a short write from one thread racing with a large write from another thread. If the runtime does no synchronization around the short write, it could land in the middle of the large write. It might be possible to use atomic scalars in some clever way to prevent that from happening, without needing to take a full-fledged mutex around the short writes in the uncontended case, but I don't see how to do that off the top of my head and I'm not sure it would wind up being enough cheaper to be worth the design work—the cost of locking a well-designed mutex in the uncontended case is already just a single compare-and-swap.

Also, If I'm reading the spec correctly, POSIX requires this behavior only for pipes [search the page for "PIPE_BUF"]; not for regular files, terminals, sockets, or character devices. That means we would have to check what stdout actually is before enabling this optimization, and we'd have to decide on some rules for when you are allowed to change what file descriptor 1 points to. (It has to be allowed at least some of the time or some programs become unimplementable.)

Also also, I'm inclined to think that Unix-C-style smart autobuffering of stdin and stdout is likely to get us more performance win for our efforts. The semantics of that are: on the first use of either stdin or stdout, you check whether isatty is true for them. If either is a tty, it is line-buffered, otherwise it is fully buffered. If they both refer to the same tty, reading from stdin automagically flushes stdout first; otherwise there is no synchronization between the two. This can be overridden by calling setvbuf. I'd have to think about it a bit more to figure out how to make this fit into the existing Rust I/O API where the bare stdin and stdout are always unbuffered and you have to wrap them in BufReader/BufWriter if you want buffering, but I don't see any reason it couldn't be done.

2 Likes

What's the bare stdout in Rust? Since the default stdout is buffered by std::io::LineWriter.

1 Like

Only if you are writing to a pipe, and stdout may be any kind of file object, although I suppose you could detect and specially handle pipes.

libstdc++ (gcc) actually detects whether pthread is linked and switches between atomic and non-atomic increments/decrements of the std::shared_ptr counter accordingly... and I find it incredibly wacky.

I'm all for high-performance single-threaded Rust programs, but let's do it correctly: use a flag to choose between the two modes, disable thread-spawning and build a special version of std where all atomics compile to non-atomic instructions, ...

On leetcode I always write this:

const auto io_speed_up =[](){
    std::ios::sync_with_stdio(false);
    cin.tie(nullptr);
    return nullptr;
}();

on top of my C++ programs, and it speeds them up by at least an order of magnitude and often by two orders of magnitude or more. Now that leetcode supports Rust, I see my solutions being orders of magnitude slower than my C++ ones. The Rust standard library does not mention the time and space complexity of its operations pretty much anywhere, so every now and then when something is suspiciously slow I end up having to go back to my C++ solutions and removing that code to make them slower so that I can perform a more apples to apples comparison.

You're right, sorry about that. I mixed up issue 64413 with issue 60673.

It looks like part of what I proposed is already being implemented in PR 64681; the missing pieces are stdin/stdout synchronization and manual overrides.