Suggestion: str::to_c_string_lossy or CStr::from_bytes_may_clone

Overview

The idea is that when a -sys crate requires a &::std::ffi::CStr as input, most rust wrappers chose to require a &str and then call CString::new() which always clones the slice.

This becomes obnoxious when the conversions happen back and forth, even though after the first time we know there is a terminating null byte.

On top of that, sometimes the string may contain a null byte, in which case the CString construction fails, even though truncation could be acceptable.

So I suggest adding the following API to the ::std lib, which only clones the given data when it contains no null byte, and with “optimal” perfomance (the string is not scanned multiple times).

It’s all here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=5cecbed40e6755d003d72f7069c89a05

Documentation example

impl str {

fn to_c_string_lossy(&'_ self) -> Cow<'_, CStr>

Converts a string slice into a valid C string slice.

Since the returned C string slice must uphold C strings invariant,

  • the conversion may be lossy since the returned C string slice will be truncated up to the first encountered null byte of the input string.

  • if no null byte is found, then the whole input string slice is copied with an appended null byte.

Hence the Cow<CStr>.

Complexity-wise, the string is always scanned once, up to the first encountered null byte (or the end of the string if none was found). If it contains no null bytes, it is then copied into the heap. In other words, its complexity is optimal when there is no information regarding the contained null bytes of the input string slice.

Examples

Basic usage

// a string slice with a null byte terminator
let hello_world = "Hello, world!\0";

// This just scans the string and casts it.
let c_hello_world = hello_world.to_c_string_lossy();

assert_eq!(
    c_hello_world.to_bytes_with_nul(),
    b"Hello, world!\0",
);

Appending the null byte if needed.

// a string slice with no null bytes
let hello_world = "Hello, world!";

// Since the string contains no null bytes, it is copied and mutated.
let c_hello_world = hello_world.to_c_string_lossy();

assert_eq!(
    c_hello_world.to_bytes_with_nul(),
    b"Hello, world!\0",
);

“Incorrect” interior null byte:

// a string slice containing an interior null byte
let hello_world = "Hell\0, world!";

let c_hello_world = hello_world.to_c_string_lossy();

assert_eq!(
    c_hello_world.to_bytes(),
    b"Hell", // This is indeed *lossy* !
);

This seems like a pretty bad idea. If you’re interfacing with nul-intolerant libraries, you should surface to the callee that they feed you nuls, rather than quietly dropping them. This screams potential RCE to me.

Moreover, it seems extremely unlikely to me that this is the shape you actually want, and makes me suspect a poor design choice in your wrapper, especially with

which in particular makes me believe you just want some impl Deref<Target=str> that is secretly a String that is always terminated with a nul, allowing you to support something like std::basic_string::c_str in C++.

1 Like

How can subslicing lead to an RCE? It cannot even panic! The truncation is literally the following


fn as_bytes_maybe_truncated(s: &'_ str) -> &'_ [u8]
{
    let bytes = s.as_bytes();
    if let Some(first_null) = bytes.into_iter().position(|x| b'0'.eq(x)) {
        assert!(first_null < bytes.len(), "by design of `.into_iter().position()` this cannot fail.");
        &bytes[..= first_null] // RCE? Where??
    } else {
        bytes
    }
}

The very point of truncating/subslicing is that it is the best “lazy” choice to recover from an intrusive null byte. If you want to handle that kind of error in a different / more sophisticated manner, the current API already lets you do that.

The only way there could be an RCE is if this lossy function was used somewhere where injectivity would be crucial for further unsafe code, and that’s clearly the programmer’s fault for using a lossy function.

Ok, “back and forth” was a little bit exaggerated. The motivation behind this was the following safe wrapper around a -sys crate: https://docs.rs/z3/0.3.2/src/z3/symbol.rs.html#32-43

That is, their author found it more convenient for their users that the function take s: &str as input instead of a &CStr, just to call CString::new(s).unwrap() afterwards. This is a case where s.to_c_string_lossy() could have been used instead, since truncation is an acceptable thing to do for a debug string. Obviously the best thing to do would have been to take &CStr, and let the user feed it a CStr, for instance with the ::byte_strings::c_str! macro.

I’m thinking of some input-validation scenario, where user-supplied input is shoved into something nul-intolerant. This is not the stack-smashing flavor of RCE, but rather the SQL-injection or XSS flavor (maybe you don’t consider this RCE?). I’ve definitely seen examples of this in the past of work (though for obvious reasons I can’t provide examples). One of the values of CStr, for me, is that is is null-free.

I think making this a convenient-and-easy-to-reach-for thing is a Bad Idea. Since the “safe” way of getting a CStr is CString::new, not an inherent of str, I expect this method will be reached for more (after all, who doesn’t reach for whatever the lossy OsStr -> Cow<str> method over the one that fails on non-UTF-8 most of the time?).

Also, scanning a string’s bytes for the first nul and shoving it into the unchecked CStr ctor can be implemented in a single stanza if you really really need it.

Ok so this is what you are talking about. An acceptable change would be to have this method as an inherent function in CStr instead of a str method, to reduce its “convenience”.

Saying that a memory-safe construction can be replicated at user-level using unsafe is actually considered more dangerous:

Lastly, since lossy may not scary enough for some people, I suggest the following:

pub struct TruncatedCStr<'a> {
  pub truncated: &'a CStr,
};

fn to_c_string_lossy (
    self: &'_ str,
) -> Result<Cow<'_, CStr>, TruncatedCStr<'_>>;

// use case
let s: &str = "Hell\0 World!";
let c_str: &CStr =
    s.to_c_string_lossy().unwrap_or_else(|s| s.truncated);

This looks like a more explicit API, do you prefer it?

No, my thesis is that this API shape is not the one you want.

You are not using CStr outside of unsafe code anyways, since you’re already talking to a C library that wants random char const*s. My point was that it is not eminently inconvenient, but sufficiently inconvenient that it will make users consider propagating nul-intolerance rather than silently truncating.

Moreover, the alleged performance gains of this new function are probably negligible, since that library is guaranteed to scan those strings, either by printing them (super slow) or by doing strlen on them (unless the library takes size_t len, char const str[len], in which case it seems insane that it would be taking a C-string).

(Not alleged, it does prevent a second scan in case of missing null byte). But the argument that C does indeed perform countless redundant scans because of their unfortunate nul-terminated char * usage (instead of char *, size_t) is a good point.

You have convinced me that it may indeed not deserve to be in ::std (=> crates.io += 1), although I’d like to see other people’s position on this.

PHP is NUL preserving and it got flack for allowing hacks such as file.exe\0.gif that passes extension check on the PHP side, and goes wrong when this is passed to fopen().

Rejecting all such strings seems safer, and I don’t think it removes any useful functionality. If you’re expecting NULs as a possibility, you’d rather use byte slides.

A CString-like type that is guaraneed UTF-8 so that it can deref to &str would be nice to have, but libstd would collapse into a black hole if it got one more string type :wink:

3 Likes

That makes sense.

You’re right, although there is a subtle difference here; in Rust we are forced to explicitely change the type from a String to a CString (vs the implicitness of PHP, if I have understood that correctly), and the proposed method does have the word lossy in its name. Maybe I lack programming experience but are there so many developers that perform safety checks before a lossy transformation? And if that is the case, what about the new suggested API forcing to explicitly choose to truncate ?

pub struct InnerNullTruncation<'a> {
  pub truncated: &'a CStr,
};

// impl CStr {
    fn from_bytes_may_clone (
        bytes: &'_ [u8],
    ) -> Result<Cow<'_, Self>, InnerNullTruncation<'_>>
    {
        /* ... */
    }
// }

// use case
let s: &str = "Hell\0 World!";
let c_str = CStr::
    from_bytes_may_clone(
        s.as_bytes()
    )
    .unwrap_or_else(|s| s.truncated)
;

This way, the API is more explicit about what it does (on top of a very explicit documentation), while being handy for those looking for the described use case.

Thoughts?

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.