`&CStr` from/to `&[c_char]`, safely

I think that ffi::CStr is missing some useful safe constructor to easily interact with C code. In details, AFAIK it is not possible to create a &CStr from &[c_char] in a completely safe and zero-cost way. I would like to propose something like this:

impl CStr {
    fn from_c_chars_with_nul(chars: &[c_char]) -> Result<&CStr, FromBytesWithNulError>;
    unsafe fn from_c_chars_with_nul_unchecked(chars: &[c_char]) -> &CStr;
    fn to_c_chars(&self) -> &[c_char];
    fn to_c_chars_with_nul(&self) -> &[c_char];
}

The following is a possible simple implementation: Playground (obviously, with an Ext trait just for demonstration).

I already asked on URLO for a preliminary feedback.

Questions:

  • Should the error of from_c_chars_with_nul be FromBytesWithNulError or a brand new error like FromCCharsWithNulError?
  • An RFC is necessary?

Feedback appreciated!

EDIT 1 @tspiteri highlighted that for now there is no reason to propose to_bytes{, with_nul}.

EDIT 2 I did not realize that a buffer with trailing null terminators are not valid as well. Obviously if there are bytes != b'\0' after the first terminator, CStr is considered invalid. However, when interacting with C strings with max fixed length, it is common to reuse the buffer, leading with garbage after the first terminator.

I would propose an additional function, in order to get the C string with the first valid null terminator:

fn from_c_chars_with_first_nul(chars: &[c_char]) -> Result<&CStr, FromBytesWithNulError>;

Playground

to_c_chars() might be called as_slice() instead, following the naming conventions.

One question would be whether as_slice() includes the null byte by default or not. I would expect it to include that by default. I would also spell nul as null for these APIs.

1 Like

I am puzzled about this, because you are right about naming conventions, on the other hand we lose the direct relationship with CStr::to_bytes. Moreover, I am not sure if the user would expect that as_slice returns &[c_char] instead of &[u8]. Said that, I really like your idea, and I think that as_slice would be better.

There are already two separated functions for this. Using as_slice naming convenction, we will have as_slice() and as_slice_with_nul().

That would break the naming convention of CStr::from_bytes_with_nul and CStr::from_bytes_with_nul_unchecked.

I am not familiar with the CStr API, and I guess it shows...

Ah yes, it looks like maybe to_bytes() predates the naming conventions? You're also right that as_slice() could mean both &[u8] or &[c_char]. Maybe the case for as_slice() or using as_ over to_ isn't as strong then.

Yeah, I was referring to an alternative where you have as_slice() and as_slice_without_nul(). But again, I was not aware of the corresponding to_bytes() methods, so if those already exist it makes sense to follow their structure.

1 Like

to_bytes is correct here. The to_ is supposed to indicate some kind of expense. The CStr::to_bytes method currently drops the NUL terminator, but the more important bit is that the intent was that the NUL check would be moved to the to_bytes call instead of at construction. This is mentioned in the docs and is probably where the to_ prefix is coming from.

Also, things like as_bytes() are not without precedent either. For example &str has an as_bytes method instead of as_slice because there is a type conversion (at least conceptually) happening here. I think the same applies in this circumstance.

4 Likes

What is the use case which makes these useful? (I'm not saying there isn't one, just that I don't obviously see it.)

The way I see CStr is that it is intended to look like a slice of u8 on the Rust side, and a pointer to c_char on the C side. The way you could normally get c_char slices rather than u8 slices would be through some other C code, but then you would have a pointer to c_char, not a slice of c_char, and these methods would still not be helpful without unsafe code.

Good question, maybe I should have explained why I pointed out this.

There are many cases in which a C string is stored in a fixed size array, which should be null terminated. In these cases, something like libc crate exposes a &[c_char] that should contain a \0 somewhere.

In real life, a typical bug in C libs is related to copying a string with the same size of the buffer, leaving a malformed non-null terminated representation. Now, if you use CStr::from_ptr you obviously get UB, but if you can pass the maximum size of the buffer, then a \0 is not found and an error is returned. And this is the reason I would use unsafe to cast &[c_char] to &[u8] and then use CStr::from_bytes_with_nul. However, this unsafe block could be avoided with the APIs I proposed to add.

Ah, so for example libc::utsname::sysname is of type [c_char; _] and you want to create a CStr for that. So I see the value of having the constructors from_c_chars_with_nul{,_unchecked} as counterparts to from_bytes_with_nul{,_unchecked}.

I still don't see the usefulness of to_c_chars{,_with_nul} as counterparts to to_bytes{,_with_nul}. And these are kind of a different feature from the constructors; to_bytes{,_with_nul} were in CStr since 1.0, while the constructors from_c_chars_with_nul{,_unchecked} were added later in 1.10, indicating that they're not really the same feature.

1 Like

Hey, you took exactly the API I stumbled on!!! :astonished:

I think that you are totally right: I proposed these two functions because they are straightforward to implement and they provide some sort of symmetry... but I don't have a real usage in mind. Maybe it is not worth to add them now, if someone find a valid reason they could be added in the future.

1 Like

Has anyone thought about how this would interact with TBAA, if we ever get that?

We won't. Rust as its own kind of aliasing story, and it's very different from TBAA (and much more powerful).

But also I fail to see the relation to TBAA here, when the only type used is u8?

I know the Rust aliasing narrative is mostly based on &mut, but I recall reading some GitHub discussion where TBAA was at least under serious consideration. I don't remember where it was or what came out of it, however.

There are two types involved: u8 and c_char. While the latter may be defined to alias the former currently, I don't believe this was ever promised anywhere. In fact, on most C targets char is signed, so the closest native Rust type for c_char to be defined as would be i8. And if TBAA ends up mattering, it need not mean that Rust u8 and i8 will necessarily be given the same ‘may alias anything’ special treatment that C character types are subject to.

As far as I know, everyone involved in Rust langauge design agrees that we do not want TBAA. It has certainly been my own personal position from the start, and I never got any pushback for that. I am not sure which GH discussion you mean, and which people were arguing for TBAA in Rust there.

6 Likes

The current CStr implementation uses c_char internally and casts pointers in its to/from [u8] methods, so unless the implementation is changed for whatever reason, constructing from c_char slices is even simpler.

And even if CStr internals are changed to u8 pointers, CStr already has a constructor from *const c_char, so I can't see how the proposed change would introduce any new issues.

3 Likes

Rust won't have Type Based Aliasing Analysis at any point; we've defined "reinterpret_cast" (mem::transmute) as working for "layout-compatible" types that are guaranteed to have the bytes in mutually intelligible locations.

The place you saw TBAA mentioned was probably around padding; TBAA tables in LLVM allow consumers to tell it about padding, which may allow it to skip copying padding in its memcpy intrinsic, if it sees that as beneficial. However, doing so currently comes with all of TBAA, so would miscompile Rust. Of course, whether LLVM actually uses that metadata from TBAA tables doesn't seem clear.

You can see all of the discussions involving TBAA by searching for TBAA (alternative search).

7 Likes