Planned CStr changes seem inefficient

For a long time there have been warnings in the documentation of ffi::CStr that say that certain methods are constant-time now, but in the future they will do a strlen operation. For example, CStr::to_bytes:

Note: This method is currently implemented as a constant-time cast, but it is planned to alter its definition in the future to perform the length calculation whenever this method is called.

So, right now this function is optimal (in the sense that it performs only one strlen operation), but in the future, it may perform two -- one in from_bytes_until_nul, another in to_bytes.

/// Extract a C string from the first portion of a byte slice.
fn cstr_from_buffer(buf: &[u8]) 
    -> Result<(&[u8], &[u8]), ffi::FromBytesUntilNulError>
{
    let cslice = ffi::CStr::from_bytes_until_nul(buf)?.to_bytes();
    Ok((cslice, &buf[cslice.len() + 1..]))
}

And I don't see any good alternative. With unstable library features you can use split_once instead and it's easier to read but you don't get the hand-optimized memchr that backs from_bytes_until_nul; as far as I can tell, there's no way to access that hand-optimized memchr short of feature(slice_internals), which the compiler yells at you for using at all, even on nightly. Without unstable library features, I think I'd have to use either the memchr crate, or libc::strlen, to get equivalent efficiency to from_bytes_until_nul.

It seems to me that this means the plan to make CStr::to_bytes not constant-time is maybe not a good plan and it should be reconsidered.

(Playground link with all three variants of the above function I've come up with so far.)

3 Likes

The idea behind this is to make CStr a single pointer. This makes conversion to and from a char* free. Then getting the length would be a lazy operation, counting the bytes.

I think this would be the way to do it with a "thin" CStr

/// Extract a C string from the first portion of a byte slice.
fn cstr_from_buffer(buf: &[u8]) 
    -> Result<(&[u8], &[u8]), ffi::FromBytesUntilNulError>
{
    let cslice = unsafe {
        ffi::CStr::from_ptr(buf.as_ptr().cast())?.to_bytes()
    };
    Ok((cslice, &buf[cslice.len() + 1..]))
}

If we do move this direction, I think it would make sense for us to add variants of the from_bytes_* methods that just return the length alongside the CStr.

pub const fn from_bytes_until_nul_with_length(
    bytes: &[u8],
) -> Result<(&CStr, usize), FromBytesUntilNulError>
1 Like

This has undefined behavior if there isn't a NUL within buf, unlike from_bytes_until_nul.

5 Likes

I think that broadly, all this comes down to the two use cases of “parsing data in a buffer” vs. “FFI”, and the intent has always been that CStr is for FFI, where:

  • You want to be able to declare “my function's parameter is a null-terminated string” and get a safe Rust type out of that, so only the function declaration is unsafe. This requires that CStr be a thin pointer (without length).
  • You have to rely on the null being present anyway; there’s no UB-free way to check.
7 Likes

It would be great that if/when those changes are actually done, the stdlib added another string type that represents a C string with length. It's often convenient to carry around the length even for nul-terminated strings, like C++ strings do; and also it may be error prone to do this manually (like is done in C by some programs); and also it would be a hassle to introduce a new dependency for it, specially because the stdlib had this behavior for at least 10 years.

6 Likes

Is it really a safe-to-call function if CStr doesn't enforce the presence of the NUL upon construction?

I agree that it’s not a good plan, and in fact I suspect it won’t happen. This is because, in current Rust, it’s assumed that you can get the size of any value from a pointer to it, without dereferencing said pointer. This would no longer be true with a thin CStr, though. And that means that, if the thin CStr is inside a mutex of some sort, you can no longer soundly get its length without locking the mutex. Therefore, the proposed change would be breaking.

5 Likes

CStr has an invariant, which can be unsafely asserted or checked at construction time. Either is a valid approach. When accepting a C string from foreign code, it is necessary to make the unsafe assumption that a NUL is present; there is no way to check the assumption without reading out-of-bounds if it is false.

3 Likes

Oh, I see, I was imagining the FFI going the other way around (Rust calling C).

Sometimes there's a higher level structure telling you when to stop scanning for the NUL, but you're probably stuck trusting the caller to set that correctly, too.

Again, if it won't happen, there's another string type that could be added: a c string without length, and then use that for zero-overhead FFI. The stdlib already has many string types, but it's probably a mistake to not distinguish between those two strings.

6 Likes

I thought it did enforce that. Is there some way to create a non-nul-terminated CStr without unsafe?

I agree with the opinions above that we probably should have two separate types. Wouldn't it be better to introduce a separate CStrPtr type and keep the current CStr behavior and layout to prevent potential unpleasant surprises?

1 Like

Would be nice if the hypothetical CStrPtr could be just a regular &c_char or *const c_char. I guess it's not feasible given that c_char is just i8 and changing it to an opaque type would break things.

Generalizing a bit, might be useful to have something like &thin Unsized or Thin<Unsized> that you couldn't do much with except pass to FFI or unsafely construct a fat reference/pointer given the metadata part.

1 Like

Well, a separate type is useful as a "container" for string-specific methods like to_str, to_bytes_with_nul, etc.

2 Likes

Yes, but you could impl those for &c_char. That's why it would be good to have c_char be something else than i8, it would be confusing if &i8 had a bunch of string-related methods. The API would of course be almost entirely unsafe because there's not much that you can safely do with a thin reference to a C string, besides accessing the first character.

A &c_char doesn’t have the invariant that it points to a zero-terminated slice (and might have memory model implications on whether it can read more than one byte).

6 Likes

Ah, true. I overlooked the fact that it's actually possible to uphold that invariant if you can only ever safely create a thin pointer from an existing CStr or something else that guarantees the nul terminator. I tacitly assumed that such a pointer would only be used/useful in unsafe FFI contexts where the thinness matters, and safe APIs on top would give and take normal fat &CStrs.

It does now, but if we made it a thin pointer we'd either have to give that up at least some of the time, or eat the cost of even more strlen scans.

Work's under way to address that: Hierarchy of Sized traits by davidtwco · Pull Request #3729 · rust-lang/rfcs · GitHub

That would address the issue for a new ThinCStr type, but wouldn’t make it any less breaking to change the extant CStr.