Pre-RFC: std::os::unix::env::{argc, argv}

rtfeldman · December 27, 2023, 8:21pm

Summary

This would make it possible from std to access the original NUL-terminated UNIX process arguments, without having to convert to and from OsString - which reallocates them, removes the NUL terminators, and on Windows changes their encoding.

Today std does not provide a way to access process args or env vars without these allocations and reencodings. This means in use cases such as FFI calls, further allocations and reencodings are needed to get them back to the original (currently inaccessible) representation.

This proposal makes the original representations available using std::os::unix::env::argc() and argv(), new functions which follows the pattern of OS-specific std::os functions like std::os::unix::fs::chown.

There could be a separate case made for doing this for other operating systems besides UNIX, and also for doing it for environment variables as well. This proposal puts those out of scope because there are direct FFI workarounds for those use cases, whereas those workarounds are not available on UNIX targets.

Motivation

When making FFI calls from Rust on UNIX targets, it's common to need NUL-terminated UTF-8 strings. The same is true of NUL-terminated UTF-16 strings on Windows FFI calls. If these strings are obtained from environment variables or process arguments, on both UNIX and Windows targets, they already exist in the required format in memory.

Today in std it's only possible to access these values via VarsOs and ArgsOs, both of which are iterators over OsString values. These strings are not in the original format; they have been reallocated and had their NUL terminators dropped, meaning that further allocations and conversions are necessary to get them back into their original form.

On Windows, these allocations and conversions can be avoided through a direct FFI call to GetCommandLineW. There is an equivalent for this on some UNIX systems (e.g. macOS) but on others, there is no direct FFI call which exposes these.

This proposal would make all of these unnecessary allocations and conversions avoidable on UNIX using only std and no FFI.

Guide-level explanation

When writing FFI code that targets a particular OS, you may find that the function you're calling requires strings in a NUL-terminated format. Rust's String and OsString are not NUL-terminated, so if you have one of these, you'll need to do some conversions to use them in these FFI calls.

Whenever the strings you're passing happen to come directly from environment variables or process arguments, you can potentially avoid these conversions. For example, UNIX stores both env vars and process arguments in NUL-terminated strings, so you can avoid reencoding them to and from OsString or String by accessing pointers to the original strings using the target-specific VarsOsExt and ArgsOsExt traits.

Here's an example on UNIX of using ArgsOsExt to avoid reencoding and allocations when making a FFI call to execvp:

use std::os::raw::{c_char, c_int};
use std::os::unix::env;

extern "C" {
    fn execvp(file: *const c_char, argv: *const *const c_char) -> c_int;
}

fn main() {
    let args: &[*const c_char] = unsafe {
        std::slice::from_raw_parts(env::argv(), env::argc())
    };

    // Skip the first argument (it's usually the path
    // to this executable), and treat the second one
    // as the path. Forward the remaining args to execvp.
    unsafe {
        execvp(args[1], args[2..].as_ptr());
    }
}

Keep in mind that these are raw pointers to mutable data. Both environment variables and process arguments can be mutated, and any of these pointers may be null.

Proposed Design

Introduce these functions to a new module, std::os::unix::env:

fn argc() -> usize;
fn argv() -> *const *const c_char;

These functions would read from these atomics, which is why they do not need to take &self.

Today, these atomics are not exposed, and there is no direct FFI-based workaround to access the values they hold. That's in part because they rely on non-standard link_section extensions.

Alternate Designs

These functions could use CStr over *const c_char, but then they would have to be unsafe because CStr requires that the pointers be non-null, which is not a guarantee in this case. Additionally, since the motivation for this is FFI, the CStrs would likely need to be converted into *const c_chars anyway, so overall CStr seems both unsafe and unhelpful here.

It might sound reasonable to have a function which returns a slice instead of separate functions for argc and argv. However, as a comment in the current UNIX args implementation notes, argc is not necessarily an accurate length for argv, meaning that building a safe slice would require traversing the argv until a null pointer is encountered—which would be undesirable given that the motivation for this use case is to avoid overhead.

As an alternative, it could make sense to have an Iterator which iterates over argv until it encounters a null, and uses argc for a size_hint only. That said, as shown in the guide-level explanation example, there are certain FFI use cases where having access to the raw pointers is more helpful than an iterator. So it seems like the minimal proposal here would be to expose the pointers, and then optionally an iterator convenience method could be discussed on top of that.

Prior Art

There are various OS-specific functions in std::os already, like std::os::unix::fs::chown.

Future Additions

Even though there are already FFI workarounds for them, it could be worthwhile to offer ArgsOsExt implementations for other target OSes, such as Windows and WASI.

Doing something similar for environment variables could be worthwhile, as they have the same characteristic today of always needing to be converted to OsString even if the desirable format is the one the OS already has in memory. However, there are already direct FFI workarounds to access this on all OSes, which is why this proposal leaves env vars out of scope.

matklad · December 27, 2023, 9:11pm

Feels like these should be just free functions in std::os::unix?

We need FileExt because File holds a file descriptor inside, it’s an OS-level abstraction.

In contrast, ArgsOS is purely a language abstraction for memory management. If one ignores the naming, it seems there’s no reason at all to tackle extra functionality onto it via extension traits?

See, eg, chown in std::os::unix::fs - Rust for a free function president.

Maybe I am missing a reason why we need a trait here?

harmic · December 27, 2023, 10:07pm

One thing I've wanted to do in the past is set argv[0] so as to change the way a process is presented in ps, top, etc.

prctl does allow setting the process name, but it only modifies the process name as shown in /proc/$pid/status. To change /proc/$pid/cmdline you need to overwrite the string pointed to by argv[0], which is currently not possible in Rust.

the8472 · December 27, 2023, 10:47pm

Is it worth it though? How common are FFI calls which essentially only consume envs or args and are otherwise allocation-free? I'm wondering what a program has to look like where the allocations in shoveling arg/env into FFI is a bottleneck.

And I think on unix the allocation is necessary anyway due to locking. We only access the environment under a lock, which means the data has to be copied out so we can release the lock.

fn argv() -> *const *const c_char;

No, exposing those pointers is considered a mistake since there's no way to avoid UAFs when going through those pointers instead of getenv/setenv.

rtfeldman · December 27, 2023, 11:20pm

Yeah, good point! I didn't realize about the chown precedent

I updated the proposal to use plain functions instead of traits.

Can you elaborate on this? The proposed design is to use ARGV.load() in exactly the way the current implementation does, so I'm not sure why it would be more prone to UAFs.

Regarding getenv, its notes mention that it typically doesn't copy, it just returns a pointer directly into the original env string:

As typically implemented, getenv() returns a pointer to a string within the environment list. The caller must take care not to modify this string, since that would change the environment of the process.

So I'm also not sure why using getenv would affect chances of UAF!

the8472 · December 27, 2023, 11:38pm

Ah sorry I got things mixed up a bit since you were talking about both env and args. Environment vars need locking and the environ pointer shouldn't be accessed. Args is less often mutated but I think some C libraries still do that so in principle it'd have to be locked too but glibc doesn't provide a way to do that.

josh · December 28, 2023, 12:04am

In particular, as someone noted earlier in this thread, there are reasons people want to be able to mutate argv[0]. Which means that if these are pointers to the originals, this probably isn't safe.

Either these would always involve unsafe code that would have similar prerequisites as the environment functions ("make sure nothing that could possibly be running in parallel is mutating this"), or we should make a copy in the appropriate form and use that so that it can be safer.

carbotaniuman · December 28, 2023, 1:24am

Returning raw pointers would be safe regardless as the unsafety is when you mutate or read, or am I missing something?

josh · December 28, 2023, 3:35am

Sorry, I was being loose with terminology there. Yes, the functions themselves don't have to be marked unsafe. I mean that the use of them would always involve unsafe code, with prerequisites similar to those of the environment functions. Edited to clarify.

(Also, it ought to go without saying, but the *const *const argv pointer should not be cast and used for mutation.)

rtfeldman · December 28, 2023, 3:54am

Makes sense!

As an aside, this is (or I guess will be) my first RFC, so I'm not sure how to capture that - I guess just mention somewhere that documentation should note it?

pitaj · December 28, 2023, 4:14am

This doesn't need to be an RFC. You can just open an ACP instead

rtfeldman · December 28, 2023, 5:33am

Oh, I wasn't aware ACPs existed! Are there any guidelines I could read about how to follow the process for them correctly?

bjorn3 · December 28, 2023, 3:15pm

What should argc and argv return when libstd can't get them either? For example because it is called from C and not on a target where libstd can eg get it from a static initializer. std::env::args() returns an empty list in that case.

Vorpal · December 28, 2023, 5:08pm

Here is another idea (for the use case of setting arg for process name / args): just expose a function to set the arguments (on platforms that support that), instead of exposing raw access. This would be a safe rust abstraction on top of the underlying C "API".

Seems to avoid many of the potential issues with the full raw access approach.

rtfeldman · December 28, 2023, 5:22pm

Seems like to match the behavior of std::env::args, it should return a null pointer for argv and 0 for argc

Nilstrieb · December 31, 2023, 11:38am

Then it should return Option<NonNull<>> instead of a raw pointer, to explicitly signal that it may be null.

rtfeldman · January 9, 2024, 11:22pm

@josh does this sound like the right next step to you?

josh · January 10, 2024, 4:59am

Yes.

Sky9 · January 19, 2024, 7:18pm

Some more prior art could include the argv crate, which allows accessing the args as an iterator over &'static OsStr.

system · April 18, 2024, 7:18pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
PreRFC: trait converting functions for OsString libs	19	1058	April 14, 2020
Maybe there should be a `osformat!` macro libs	6	738	August 7, 2023
Allow std::process::Command to change args libs	7	736	August 9, 2022
PathBuf to CString libs	21	5129	January 2, 2021
Pre-RFC: Separate reading/writing String from std::io::Read/std::io::Write ideas (deprecated)	16	3380	March 25, 2019