Str vs slice APIs

When not looking too closely, slice and str have a similar API surface (as I also expected, especially considering how str is called a string slice). Unfortunately, they behave quite differently in practice, even surprising some very experienced Rust users who I just talked to about splitting a slice by a subslice. Is there a way forward where APIs can be more unified? I am thinking efficiency concerns can probably be alleviated by specialization while generic implementations for slices can be provided. Unfortunately, the naming conflict already exists in stable Rust, so the names would be different, but at least the functionality gap could be closed here I think.

My main motivation is that I have to work with byte slices quite a lot (I work with C FFI mostly). These are explicitly allowed to be invalid utf8 in almost all cases, so unfortunately just “converting” (using from_utf8()) &[u8] to str does not work at all here.

So, the reason why I’m posting this: Is anyone working on closing that gap? Is there something that fills the gap at least for byte slices? The only thing I’ve found is this bug report https://github.com/rust-lang/rust/issues/27721 but it seems to be stalled for a while now.

1 Like

This is something that I think would be fun to work on, but if the endgame is to get it into std, then it’s a monumental amount of work that I probably won’t get to until… I don’t know when. :slight_smile: However, as a member of the library team, this is an effort I would wholeheartedly support.

A small consolation prize for you is that the regex crate has a &[u8] interface, and the path I chose was “duplicate the entire public API.” This seems to be the path we are on in std as well. (Of course, the implementation may be reused.) The regex crate is a bit of a large dep to take on depending on what you’re doing, but it’s smart about handling simple literal cases, so if you can amortize compilation then searching/splitting/replacement with &[u8]/Vec<u8> can be done with the regex crate today.

2 Likes

I would love if there was a way we could ease this in. The Pattern overhaul has not been finished, but it would be great if we could just add substring search in for &[u8] gradually.

How do we go about deciding what to do with substring search for general &[T]? Algorithms exist for T: Ord, but they are likely to be slower. (Our two way algorithm works fine for any T: Ord in fact, but it has a byte-specific trick to speed it up in our impl.)

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.