Idea: expose Linux raw syscall interface in std::os::linux

Static linking is fine, but I don’t see why this couldn’t be done as a separate no_std crate if the nightly features were stabilized. You can still statically link I’d you use third-party crates.

I like the idea of including this in core. It seems more useful there than in std where you already have higher level OS abstractions (and can fallback to libc::syscall if you really need to). The constants for Linux syscall numbers however seems like a much better fit for an external library.

I'd be willing to believe that there are applications targeting exotic OS's that would need to make syscalls, but otherwise wouldn't need any fancy assembly. With this proposal they'd be able to run on a stable compiler, which otherwise wouldn't be possible because inline assembly has been stuck in nightly only for years now and doesn't look to be changing any time soon. Further even if they run on nightly, it isn't like there'd be a ton of variability in the implementation of a syscall() function, I have an almost identical one to the proposal here in my own code.

1 Like

I've published a Rust PR with the cnages: https://github.com/rust-lang/rust/pull/63745

Why can't you use a #![no_std] crate from crates.io that exposes this ?

See my reply in the PR thread.

I was alluding to the idea there could be separate targets as an alternative to musl, and an alternative impl of std which uses it. This sounds great for Rust unikernels or other use cases where minifying code (particularly C code) is desirable.

Yes, that can all be done out-of-tree as third-party crates and swapping out std in sysroot, but it'd be so much better if done in-tree, IMO.

1 Like

FYI: the PR is currently in FCP with "disposition close", see this comment.

1 Like

A lot of feedback has been given in the PR. I think that there is a real problem here affecting users, and that this problem is worth solving. It's very unclear to me that this solution is the best solution to the problem, and think that it would be a better idea to step back and write down precisely what the problem is, the general properties of a good solution (e.g. it should be possible to inline the syscall wrappers) are, what solutions do we currently have available for this problem in stable/unstable Rust and in std/#![no_std] binaries, and what their pros and cons are.

The constraint that it should be possible to inline the thin "system call" wrapper would require justification. For example, C code cannot inline syscall. The most that inlining here does is remove a single function call that jumps into an assembly blob. We can't inline through the assembly, and we can't inline further into the kernel. Also, inlining isn't always desirable, so it should at most be left up to the optimizer.


For example, some of the current solutions with their pros / cons would be:

use linked syscall wrapper from the C library

Binaries that link libstd can, on stable Rust, without any dependencies, reliably write:

extern "C" { fn syscall(...) -> ...; }

and use the libc syscall API. This solution can be more easily used on nightly, without any extern declaration, by enabling the unstable feature(libc) and just writing use libc::syscall.

Pros of this solution are that it is the stable solution of the platform, it is always guaranteed to work with all codegen backends, that it provides a nice variadic API, and that it can be made to work with a libc-free standard (the standard library would just provide a syscall symbol).

Cons of this solution are lack of inlining of the syscall wrapper, that these wrappers unnecessarily use errno, this solution does not work on #![no_std]

compile and link an assembly blob

We can do this with the cc crate on stable, or with global_asm! on nightly.

Pros: always guaranteed to work on stable Rust (the cc version), works with all codegen backends (the cc version), multiple syscalls can be provided, works with libc-free standard, works with#![no_std]`.

Cons of this solution: requires a C compiler (which is always available for Linux targets), can't be inlined.

compile and link a C wrapper

Stable Rust binaries with or without std can implement these wrappers in C with inline assembly, and link them into their own binary using xLTO.

Pros of this solution are that this is always guaranteed to work on stable Rust, the syscalls wrappers can be inlined, works with all codegen backends, multiple syscalls can be provided, a nice variadic syscall wrapper that does not use errno can also be provided and inlined, works with libc-free standard, works with #![no_std].

Cons of this solution: requires writing C code (bad, but it only must be done once), requires a C compiler (which is always available for Linux targets), requires building with xLTO which isn't straight-forward and requires using an appropriate linker.

using inline asm!

On unstable Rust we can always use asm!.

Pros: the syscall wrappers can be inlined, works for #![no_std], works with libc-free standard,

Cons: asm! is unstable and code using it can and does break, can't implement a variadic syscall API that does not use errno, doesn't work for different codegen backends (e.g. it would abort when used with librustc_codegen_cranelift)


Some of the ways in which we could solve this problem are:

Add this to core (somewhere)

Pros: all users might be able to just include them and use them, would inline wrappers, would work with #![no_std], would work with libc-free standard, we could add any API we'd want

Cons: increases the surface area of the standard library (requires convincing the libs team of the value this API adds), unclear how to implement this such that it works for backends that do not support inline assembly like rustc_codegen_cranelift (might require linking C code to core), duplicates an API that libstd already exposes (libstd already exposes syscall on all Linux targets, and core would be re-exposing a different way of doing syscalls).

Making xLTO easier and simpler to use

We could add support for xLTO to the cc crate to make it easy to use on linux for this, which is something probably worth doing anyways, and would immediately work on stable.

Stabilize inline assembly or global assembly

One could always push for an RFC that stabilizes inline assembly. There is no time pressure for this given the existence of the other solutions.


There are probably many other ways to solve this problem currently, things one can do to improve the current situation in the meantime, and probably many other ways to fully solve it with a great solution in the future.

From all the solutions I see, using the cc crate to compile C code that uses inline assembly sounds like the best way to do this today. It's unclear to me whether inlining the wrappers is worth it, but if it is, using -Clinker-plugin-lto when building the crate would solve the issue and would work with thinLTO so all --release builds would benefit from it.

From all other solutions, stabilizing inline assembly is something that we want to do anyways, and would allow solving this problem in pure Rust, making any other solution that we come up here obsolete.

4 Likes

Regarding errno, that exists entirely because C doesn't have anything like Result; I'm surprised to see that the syscall crate's interface returns a bare usize instead of Result<usize, Errno>. Distinguishing between an error return and a valid result is not as simple as "is it negative?", and it should properly be handled by the syscall wrappers.

(IIRC, the exact rule on Linux is that the value in the return-value register after a system call is an error code if and only if it is in the interval [−4095, 0). I don't know how the current generation of BSDs do it, but several historical Unixes used the carry flag to indicate an error, so you absolutely had to check it from assembly language immediately after the trap instruction.)

I feel like there should be a crate (using nightly inline assembly, at least for proof of concept) that has properly typed wrappers for all Linux system calls and also knows how to invoke the vDSO for clock_gettime and stuff like that. And a similar one for OSX and FreeBSD and any other popular hosted platforms. (Not Windows, where it's not even possible to bypass the lowest level of system-provided DLLs.) Does that already exist?

1 Like

It's possible but it's extremely unwise to do so as system calls have an unstable interface and in any case aren't officially documented.

I could be wrong but I was also under the impression that OSX behaved similarly to Windows in this regard. Apple has stated that:

A statically linked binary assumes binary compatibility at the kernel system call interface, and we do not make any guarantees on that front. Rather, we strive to ensure binary compatibility in each dynamically linked system library and framework.

I think FreeBSD also reserves the right to change kernel ABI, but has compat symbol versions in its libc. Linux is unusual in having its kernel developed independently.

More precisely what I meant is that, as far as I know, the NT kernel will always load ntdll.dll into each process's address space and there's no way to turn that off. So, even if the raw system call ABI was stable, and even if you knew how to bypass everything else (I have seen conflicting bits of leaked and/or reverse-engineered information regarding whether it's possible or useful to avoid having kernel32.dll loaded as well), there wouldn't be any point to bypassing ntdll.dll.

(On the other hand, a Rust runtime dependent only on ntdll and kernel32, not any of the various C runtimes for Windows, would be perfectly sensible and perhaps is even what we already have, I don't actually ever develop on Windows myself.)

Huh, that must be a policy change since the last time I had a reason to mess around with low-level programming on FreeBSD. Back in the 2000s they were pretty diligent about keeping old kernel entry points around. Maybe they still are and they just don't get all shouty about it the way Linus does.

With my glibc developer hat on, we have occasionally kicked around the possibility of a minimal low-level interface library that doesn't drag all of C's baggage with it. The problem is knowing where to draw the line. "Just the system calls" is not much good by itself; you would still want at least the dynamic loader and enough of the pthreads infrastructure for thread-local storage to work, but already that needs things like malloc and the ability to emit error messages, and before you know it ... well, I don't think it's an accident that ntdll.dll contains more bytes of machine code than libc.so.6 does, let me put it that way.

1 Like

I feel like this is getting off topic so I'll cut a long story short. Kernel32 is part of the win32 subsystem which NT "native" applications don't use (there are other subsystems too). There are also minimal/pico processes that don't even load ntdll (currently only used by WSL version 1, I think).

In any case the original point I was trying to make is that in practice (if not implementation) Windows and OS X aren't so different from each other in this particular regard. Raw syscalls only really make sense on systems (i.e. Linux) that provide stability guarantees at that level.

Assuming that only the syscall numbers change and not the ABI or semantics, you could extract them from ntdll at runtime.

@zackw

I think @newpavlov properly explained this above. But the Linux kernel syscalls return a bare usize, which is zero for success, negative for error, and each negative number corresponds to an error condition.

glibc could have just made their syscall API return that, but instead, they take that error condition, potentially translate it into an user-space error code, put it in a thread local, and only make syscall return 0 or -1. This gives glibc a consistent error API, even for syscalls. This used to be ok-ish for single threaded code, but errno hasn't aged well for multi-threaded applications.

I'm surprised to see that the syscall crate's interface returns a bare usize instead of Result<usize, Errno> .

First, it should be Result<(), NonZeroErrno>, because the success representation is always zero, and the error representation is always non-zero. This allows the Result to fit into a single usize, instead of requiring two registers, one for the discriminant, and one for the payload. Second, while we do perform this niche optimization on Result-like enums, we do not guarantee that this optimization is always performed (we only guarantee it for Option-like enums), so using this on FFI is probably not a good idea until that changes.

That is, code that uses this type on FFI might or might not map to a single usize, depending on the compiler version.

@cuviper

I think FreeBSD also reserves the right to change kernel ABI, but has compat symbol versions in its libc.

I can confirm that FreeBSD adds new incompatible versions of kernel APIs on every release using new symbol names, and changes its C library to dispatch to those. Most of the time, it also provides the older versions under the old symbol names, such that binaries compiled against those "just work" on newer kernel versions "as is". Rarely, FreeBSD does make backward incompatible changes to the symbols in place. This happens when the FreeBSD maintainers "believe" that no code is using these symbols. In my experience, the burden of proof for those beliefs is quite low. The consequence of those changes is that, e.g., Rust's libc crate ends up only being able to provide "guaranteed undefined behavior depending on your FreeBSD version" for those APIs. I can only hope that nobody is really using these.

That's not actually true, though, which is the whole reason I started talking about this in the first place.
It's true that the error codes returned by Linux system calls are always negative and never zero, but successful calls do not always return zero, and in fact successful calls can return negative numbers (when interpreted as isize).

The exact rule (see the definition of IS_ERR in linux/err.h) is that an unsuccessful system call will return an error code between −4095 and −1 (inclusive on both ends). Any other value is some sort of success, and often a successful value is meaningful. For instance, open returns a new file descriptor number on success, and mmap returns a pointer to the newly allocated memory area on success.

Your proposed optimization Result<(), NonZeroErrno> is applicable to system calls like setuid that always return 0 on success, but the wrapper library would have to know which ones those are. If we had NonNegativeI32, that could be used for the success case for open and similar; again, the library would have to know which ones those are. For an untyped syscall() wrapper, though, it really does have to be Result<usize, NonZeroErrno>.

And I brought this up because I think this is an actual bug in the interface provided by the syscall crate; it should be responsible for figuring out which return values are errors. And any hypothetical raw syscall interface in std should do the same.

You phrase this as if glibc had a choice in the matter, but the use of errno for C library system call wrappers was set in stone long before anyone started working on GNU's implementation of a C library.

2 Likes

Thanks @zackw, I did not know that.

We could use a three variant enum (LowerThan4095, FromMinus4095ToMinus1, and FromZeroOnwards) which fits in a usize (the variants are easy to create on nightly, and that would have two success variants. A bit weird, but just like the API I guess.

Windows definitely reserves the right to change the ABI and semantics of syscalls, not just their numbers.

FWIW the main reason we want to remove all kernel APIs from libc is that it is very easy to accidentally use an API that does not work correctly for the minimum kernel version that you want to support.

This is because Linux evolves its kernel APIs in such a way, that old code keeps working, but obviously not in such a way that new code still works on older kernel versions (that would pretty much prevent any improvements).

A consequence of this is that, even though the ABI of a syscall remains the same, their API changes. For example, a syscall might accept an integer with values 0 or 1 in some kernel version, and the kernel headers might provide a const with the largest value this API accepts (i.e. 1). The next kernel version adds the value 2 to it. So now there is a new const for the value 2, and the const specifying the largest value changes.

It is too easy to check what's the oldest version a syscall support, and thinking that that's the oldest kernel version your code supports. But that would be wrong if you were inadvertently to use the value 2 above in your code, which might require a much newer kernel version.

In C this is a non-issue. If you want to target kernel version X, just use the headers of that version, and you are golden. But in Rust we don't have this kind of versioning, and that's sort of out-of-scope for the libc crate.

I imagine that a linux-kernel crate could do something like what the llvm crate does. Have one cargo feature per kernel version (e.g. --feature=v5_0_1,v4_1_15 etc.) , and the build.rs checks what's the smallest version requested (e.g. multiple crates can enable different cargo features and cargo unification enables them all when building the crate), and that's the API you get.

We could then use ctest or similar to verify the library against all kernel-header versions, making sure that the library only exports what's available for a particular kernel header.

This doesn't seem like the right place to discuss this, but what you describe isn't best practice for conditional use of new kernel features in C. Best practice is you always compile against the newest headers you can get your hands on, and at runtime you check whether the hypothetical system call fails when invoked using mode "2", and fall back to some other implementation strategy. (If this isn't possible then of course you have a hard requirement for a kernel that does support mode "2".) Glibc (and I think also musl) extend this to system calls that may or may not even be available; the wrapper function is made available unconditionally, but might fail with an errno code indicating that it's not supported by the running kernel (ENOSYS).

I'd be interested in discussing how this pattern could be most sensibly supported by Rust stdlib and/or crates, but, again, this thread doesn't seem like the right place.

2 Likes