Idea: expose Linux raw syscall interface in std::os::linux

Regarding errno, that exists entirely because C doesn't have anything like Result; I'm surprised to see that the syscall crate's interface returns a bare usize instead of Result<usize, Errno>. Distinguishing between an error return and a valid result is not as simple as "is it negative?", and it should properly be handled by the syscall wrappers.

(IIRC, the exact rule on Linux is that the value in the return-value register after a system call is an error code if and only if it is in the interval [−4095, 0). I don't know how the current generation of BSDs do it, but several historical Unixes used the carry flag to indicate an error, so you absolutely had to check it from assembly language immediately after the trap instruction.)

I feel like there should be a crate (using nightly inline assembly, at least for proof of concept) that has properly typed wrappers for all Linux system calls and also knows how to invoke the vDSO for clock_gettime and stuff like that. And a similar one for OSX and FreeBSD and any other popular hosted platforms. (Not Windows, where it's not even possible to bypass the lowest level of system-provided DLLs.) Does that already exist?

1 Like

It's possible but it's extremely unwise to do so as system calls have an unstable interface and in any case aren't officially documented.

I could be wrong but I was also under the impression that OSX behaved similarly to Windows in this regard. Apple has stated that:

A statically linked binary assumes binary compatibility at the kernel system call interface, and we do not make any guarantees on that front. Rather, we strive to ensure binary compatibility in each dynamically linked system library and framework.

I think FreeBSD also reserves the right to change kernel ABI, but has compat symbol versions in its libc. Linux is unusual in having its kernel developed independently.

More precisely what I meant is that, as far as I know, the NT kernel will always load ntdll.dll into each process's address space and there's no way to turn that off. So, even if the raw system call ABI was stable, and even if you knew how to bypass everything else (I have seen conflicting bits of leaked and/or reverse-engineered information regarding whether it's possible or useful to avoid having kernel32.dll loaded as well), there wouldn't be any point to bypassing ntdll.dll.

(On the other hand, a Rust runtime dependent only on ntdll and kernel32, not any of the various C runtimes for Windows, would be perfectly sensible and perhaps is even what we already have, I don't actually ever develop on Windows myself.)

Huh, that must be a policy change since the last time I had a reason to mess around with low-level programming on FreeBSD. Back in the 2000s they were pretty diligent about keeping old kernel entry points around. Maybe they still are and they just don't get all shouty about it the way Linus does.

With my glibc developer hat on, we have occasionally kicked around the possibility of a minimal low-level interface library that doesn't drag all of C's baggage with it. The problem is knowing where to draw the line. "Just the system calls" is not much good by itself; you would still want at least the dynamic loader and enough of the pthreads infrastructure for thread-local storage to work, but already that needs things like malloc and the ability to emit error messages, and before you know it ... well, I don't think it's an accident that ntdll.dll contains more bytes of machine code than libc.so.6 does, let me put it that way.

1 Like

I feel like this is getting off topic so I'll cut a long story short. Kernel32 is part of the win32 subsystem which NT "native" applications don't use (there are other subsystems too). There are also minimal/pico processes that don't even load ntdll (currently only used by WSL version 1, I think).

In any case the original point I was trying to make is that in practice (if not implementation) Windows and OS X aren't so different from each other in this particular regard. Raw syscalls only really make sense on systems (i.e. Linux) that provide stability guarantees at that level.

Assuming that only the syscall numbers change and not the ABI or semantics, you could extract them from ntdll at runtime.

@zackw

I think @newpavlov properly explained this above. But the Linux kernel syscalls return a bare usize, which is zero for success, negative for error, and each negative number corresponds to an error condition.

glibc could have just made their syscall API return that, but instead, they take that error condition, potentially translate it into an user-space error code, put it in a thread local, and only make syscall return 0 or -1. This gives glibc a consistent error API, even for syscalls. This used to be ok-ish for single threaded code, but errno hasn't aged well for multi-threaded applications.

I'm surprised to see that the syscall crate's interface returns a bare usize instead of Result<usize, Errno> .

First, it should be Result<(), NonZeroErrno>, because the success representation is always zero, and the error representation is always non-zero. This allows the Result to fit into a single usize, instead of requiring two registers, one for the discriminant, and one for the payload. Second, while we do perform this niche optimization on Result-like enums, we do not guarantee that this optimization is always performed (we only guarantee it for Option-like enums), so using this on FFI is probably not a good idea until that changes.

That is, code that uses this type on FFI might or might not map to a single usize, depending on the compiler version.

@cuviper

I think FreeBSD also reserves the right to change kernel ABI, but has compat symbol versions in its libc.

I can confirm that FreeBSD adds new incompatible versions of kernel APIs on every release using new symbol names, and changes its C library to dispatch to those. Most of the time, it also provides the older versions under the old symbol names, such that binaries compiled against those "just work" on newer kernel versions "as is". Rarely, FreeBSD does make backward incompatible changes to the symbols in place. This happens when the FreeBSD maintainers "believe" that no code is using these symbols. In my experience, the burden of proof for those beliefs is quite low. The consequence of those changes is that, e.g., Rust's libc crate ends up only being able to provide "guaranteed undefined behavior depending on your FreeBSD version" for those APIs. I can only hope that nobody is really using these.

That's not actually true, though, which is the whole reason I started talking about this in the first place.
It's true that the error codes returned by Linux system calls are always negative and never zero, but successful calls do not always return zero, and in fact successful calls can return negative numbers (when interpreted as isize).

The exact rule (see the definition of IS_ERR in linux/err.h) is that an unsuccessful system call will return an error code between −4095 and −1 (inclusive on both ends). Any other value is some sort of success, and often a successful value is meaningful. For instance, open returns a new file descriptor number on success, and mmap returns a pointer to the newly allocated memory area on success.

Your proposed optimization Result<(), NonZeroErrno> is applicable to system calls like setuid that always return 0 on success, but the wrapper library would have to know which ones those are. If we had NonNegativeI32, that could be used for the success case for open and similar; again, the library would have to know which ones those are. For an untyped syscall() wrapper, though, it really does have to be Result<usize, NonZeroErrno>.

And I brought this up because I think this is an actual bug in the interface provided by the syscall crate; it should be responsible for figuring out which return values are errors. And any hypothetical raw syscall interface in std should do the same.

You phrase this as if glibc had a choice in the matter, but the use of errno for C library system call wrappers was set in stone long before anyone started working on GNU's implementation of a C library.

3 Likes

Thanks @zackw, I did not know that.

We could use a three variant enum (LowerThan4095, FromMinus4095ToMinus1, and FromZeroOnwards) which fits in a usize (the variants are easy to create on nightly, and that would have two success variants. A bit weird, but just like the API I guess.

Windows definitely reserves the right to change the ABI and semantics of syscalls, not just their numbers.

FWIW the main reason we want to remove all kernel APIs from libc is that it is very easy to accidentally use an API that does not work correctly for the minimum kernel version that you want to support.

This is because Linux evolves its kernel APIs in such a way, that old code keeps working, but obviously not in such a way that new code still works on older kernel versions (that would pretty much prevent any improvements).

A consequence of this is that, even though the ABI of a syscall remains the same, their API changes. For example, a syscall might accept an integer with values 0 or 1 in some kernel version, and the kernel headers might provide a const with the largest value this API accepts (i.e. 1). The next kernel version adds the value 2 to it. So now there is a new const for the value 2, and the const specifying the largest value changes.

It is too easy to check what's the oldest version a syscall support, and thinking that that's the oldest kernel version your code supports. But that would be wrong if you were inadvertently to use the value 2 above in your code, which might require a much newer kernel version.

In C this is a non-issue. If you want to target kernel version X, just use the headers of that version, and you are golden. But in Rust we don't have this kind of versioning, and that's sort of out-of-scope for the libc crate.

I imagine that a linux-kernel crate could do something like what the llvm crate does. Have one cargo feature per kernel version (e.g. --feature=v5_0_1,v4_1_15 etc.) , and the build.rs checks what's the smallest version requested (e.g. multiple crates can enable different cargo features and cargo unification enables them all when building the crate), and that's the API you get.

We could then use ctest or similar to verify the library against all kernel-header versions, making sure that the library only exports what's available for a particular kernel header.

This doesn't seem like the right place to discuss this, but what you describe isn't best practice for conditional use of new kernel features in C. Best practice is you always compile against the newest headers you can get your hands on, and at runtime you check whether the hypothetical system call fails when invoked using mode "2", and fall back to some other implementation strategy. (If this isn't possible then of course you have a hard requirement for a kernel that does support mode "2".) Glibc (and I think also musl) extend this to system calls that may or may not even be available; the wrapper function is made available unconditionally, but might fail with an errno code indicating that it's not supported by the running kernel (ENOSYS).

I'd be interested in discussing how this pattern could be most sensibly supported by Rust stdlib and/or crates, but, again, this thread doesn't seem like the right place.

4 Likes

So where would be the appropriate place to discuss this? I foun lrs-lang/lib which seems unmaintained and spent a while (15min) trying to extract the syscall crate and port it to modern rust. It seems to be a larger project since it implements it's own CStr, wrapping ints, atomics etc.

I'm having difficulty understanding why you would say that. This is a thread about Sys-Call Interfaces for Linux in Rust. Why would this not be the perfect place to discuss this? I really have a problem with unilaterally declaring something off-topic like this.

I'm having difficulty understanding why you would say that.

The general problem being considered in this issue is how to call functions that follow a custom calling convention, of which the syscall_ APIs are just one example, and inline assembly is just one way to solve this problem.

A crate for the linux kernel APIs would need to allow users to perform syscalls somehow, but how to organize such a crate is orthogonal to that issue. It would probably be confusing to try to mix both discussions in one thread, but we could open a different thread to talk about the other thing.

Do you think inline assembly could ever be safe?
(that will require rustc/llvm to understand the assembly itself)

P.S. not talking about stable, that would obviously happen someday.

No? I don't even think it would matter if Rust understood the assembly itself; a lot of the operations are inherently unsafe. A "safe assembly" wouldn't really be assembly any more. It would be closer to MIR.

2 Likes

@notriddle why? do you agree that assembly that does nothing is safe?
so is asm that just check equivalency, so is asm that adds to u8 to an u32 (as registers).
my point is that there's a line when ASM becomes unsafe. and if we could push that line forward it would be pretty awesome for safety (as all programs end with an unsafe code, the question is how far can we push it)

Safety has semantic constraints that aren't represented at the register level. Maybe that register you're adding to is a NonZeroU32, and the addition changed it to 0, which would be undefined behavior. Or maybe that register is a 32-bit reference (safe pointer) and you're adding an offset. etc. etc. I suspect if you constrained the asm block such that it could be "type" analyzed at this level, you'd have code that doesn't really need asm at all.

The safe way to expose assembly is through intrinsics that LLVM will understand directly, with suitable wrappers at the Rust level as safety requires.

8 Likes

I would really like to see one day rust's libstd removing libc as a dependency. @gnzlbg do you have any thoughts on that? :slight_smile: