Getting rid of String slices for better ergonomy

Hey,

Wild thought, what if we wanted to get rid of string slices. Lets ignore backward compatibility in this thread. I still don’t get why String and String slices have to be separate and I’m getting tired of explaining the difference to my colleagues. To me that’s a huge ergonomy issue in the language which is not only confusing new rustaceans but also annoys me as an experienced developer because i constantly have to convert from one to the other.

String slices are different from string mostly due to where they are stored, but I am questionning whether that could have been abstracted over. Java for example, even if the problem is different there, also has two types of string, and the string slices in Rust is similar to the String pool of java, except that Java makes it almost transparent. Haskell on its side has the overloaded string extension.

From my developer perspective a string slice and a String reference are the same. The fact that they are stored diffently and its performance implications is not something I want to have to think about. (I want zero cost abstraction here!)

Forgetting backward compatibility, is there anything preventing that?

1 Like

Rust String = Java/C# StringBuilder

Rust str = Java/C# string

It’s as simple as that. Don’t over complicate it.

2 Likes

String slices are different from string mostly due to where they are stored

That's not really an important difference. The important difference is ownership. String slices are non-owned and Strings are owned. That would be true no matter where they are stored. There isn't any fundamental reason why Strings couldn't allocate themselves on the stack or in static memory, aside from the fact that it would be inefficient to do it in a coherent and safe way.

From my developer perspective a string slice and a String reference are the same.

You shouldn't use &String, because it is basically useless. String and &str are very different and that difference is important to many of us.

Lets ignore backward compatibility

I wouldn't really advise anyone to participate in a discussion premised on ignoring very real and practical concerns. It's like asking people to come to a meeting to specifically discuss inactionable decisions...

I mean, if the question is "Is it possible to make a new language with strings that act like Java's or Haskell's", then obviously the answer is yes. If it's "could we do it in Rust", then also, yes. I don't see how either of these questions are interesting for Rust unless we're discussing them in the context of maintaining Rust's backward-compatibility guarantees.

3 Likes

If backwards-compat break was an option, perhaps some renaming could help:

  • String -> StringBuf
  • &str -> &String
  • Box<str> -> String

or maybe there could be a special syntax for “owned + growable” and “borrowed + fixed” alternatives of types ([] vs Vec, Path vs PathBuf, etc).

But I’d not get rid of the concept. It still has to be explained, but it’s a useful thing to have.

Languages that try to blur the distinction between owned string and a string “view” end up with problems:

  • Tiny substring of a large string ends up keeping the large string alive. You parse a line out of a file, and the whole file still sits in memory (in languages without proper immutable references there’s added bonus of surprise modifications of memory shared by strings).
  • Or you don’t share memory between strings, and end up wastefully allocating and copying data every time you take a slice, search, etc.
  • Or you try to have best of both worlds and the string type is a dynamically switched hybrid of anything from small string optimization to a rope with some clever tree of string of chunks.

In the end hiding the distinction between owned strings and views ends up with bad side effects and/or needing a complex string type with heroic optimizations to be acceptable in both roles.

In Rust you need to understand the distinction, and unfortunately people do stumble with that, but once you learn it you have super-simple String implementation without any magic and zero-cost safe views into it.

12 Likes

Rust String = Java/C# StringBuilder

Rust str = Java/C# string

It’s as simple as that. Don’t over complicate it.

That's over-simplistic. Java doesn't have the concept of mutability in the language so it has to offer n other type. Rust is very different in that sense. In other word Java doesn't have a &StringBuilder so it's not a good comparison.

The important difference is ownership.

I am not comparing String and &str here but &String and &str.

You shouldn’t use &String , because it is basically useless.

That's exactly my problem, if it's useless, can we get rid of it.

I wouldn’t really advise anyone to participate in a discussion premised on ignoring very real and practical concerns.

Backward compatibility is a huge subject here and I am interested in the concept first. Let see what works in theory and see if it can be done in practice.

&String doesn’t exist explicitly though. It exists only because & is a type compositor that can be applied to any type. Getting rid of it would add special handling code, it would not simplify anything.

(There are legitimate usecases for it as well. For instance, in generics or across abstract layers of design.)

4 Likes

In the end hiding the distinction between owned strings and views ends up with bad side effects

I think Rust is special here because it has the ownership concept. If ownership was more accurate (owning just a slice for example) then this could be handled like everything else in Rust.

But even to starts with, I don't think you need that. As of now, Slices are use almost entirely for static string, and when it's not the case I believe the entire String is borrowed, so I don't think we need to be that smart here.

&String doesn’t exist explicitly though. It exists only because & is a type compositor that can be applied to any type. Getting rid of it would add special handling code, it would not simplify anything.

I am not suggesting to remove this one, but to remove &str instead. Basically I am asking could &str become &String.

There are definitely other APIs plausible with a single unified string type, but they are problematic for a variety of reasons. This is an interesting subject and deserves serious consideration, though I don’t know what we could do.

You could imagine a language in which pointers to str are 3 words, instead of 2, carrying also a capacity. One might imagine this would allow String to go away, becoming just Box<str>. This introduces some problems, though, having to do with &mut str, presuming you want this type to have push APIs:

  • What happens when you pass &mut s[x..y]? How do you push to the end of that, if y is not the end of the string?
  • Even if that weren’t the case, this doesn’t work at all because if you push, you have to edit the len and capacity. But they’re copied with the pointer: pointers to the string in outer functions won’t have their len and capacity edited when you push.

This means the push APIs would have to take &mut Box<str>. At that point, have you really improved the situation? You’ve just renamed String to Box<str> and made references to str pass an extra word they have no use for.

9 Likes

As of now, Slices are use almost entirely for static string, and when it’s not the case I believe the entire String is borrowed, so I don’t think we need to be that smart here.

I've written a couple of parsers and text handling libraries, and I'm always slicing up strings. For example, when returning a syntax error, I'll return a slice over the part of the text that is wrong. Even when starting a parse, one doesn't need to own the string buffer, so &str is ideal. Using &String and getting a double indirection would be wasteful.

On the other hand, I suppose you could try to make a type that 'knows' that it is superfluous in in a non-mutable context, but I would be wary about discounting a use case like that. Someone is going to want to do the inefficient thing, if only to benchmark something or work with some archaic interface. It's very difficult to say that some code is 'wrong' only because it is inefficient or because it is redundant in one set of applications.

One could maybe try to design a string type that compiles differently depending on whether it's mutable or not? I'm skeptical, because I'm sure it would frustrate somebody who's trying to do legitimate work and doesn't want things slipping around under their fingers like that.

Probably worth mentioning that the GitHub - Storyyeller/easy_strings: Ergonomic, garbage collected strings for Rust crate exists.

EZString is similar to the strings in high level languages such as Python and Java. It is designed to be as easy to use as possible by always returning owned values, using reference counting and copy-on-write under the hood in order to make this efficient.

Though I agree that Rust's current design is the least problematic one available for a language where expensive string copies are considered a problem that should be solved optimally by the programmer rather than solved pretty-good-ly by a single heroic implementation.

2 Likes

To me changing str to something like StringSlice would be a big improvement. That way the difference between a string and a string slice would be obvious immediately if you have already learned the same for a vector and a slice. I get that this name is longer and that it would be hard to change (it could be done gradually by making str a type alias of StringSlice) but for me the difference was something I understood quite late in comparison to other things and I think the naming is the reason it isn’t intuitive. I always thought that the difference was more like the difference between a c_str (char array) and String in C++.

3 Likes

Sure. You just mentioned:

You could say goodbye to zero-cost read-only substring views if &str were removed. At that point, forming substrings would have to introduce either cloning or e.g. reference counting the parent string – neither of these options being zero cost. It's one of the chief reasons why &str exists in the first place.

Then you are probably not using the idioms and the power of the language to their fullest extent. &String coerces into &str implicitly in many contexts, which is probably the most frequent conversion one has to perform. The other direction is also trivial, using into(), to_owned() or to_string(). It's not implicit because it's potentially costly/slow, so being able to see it is a feature, not a bug.

3 Likes

Ignoring the naming scheme (which cannot really be changed at this point, and i don’t even think is bad), i think its actually a learning benefit for beginners.

Rust is all about ownership, and strings are most likely the first more complex data type a beginner encounters. So its natural that they are confusing. But we have to explain ownership and borrowing anyway, and strings are not a bad example for that. If we made strings “easier” through special casing, we would have the same questions about other types plus the confusion why strings are handled differently than every other data type.

5 Likes

Maybe it should be explicitly pointed out that everything being said here about &str / “string slices” versus String is pretty much equally true of regular [T] slices and Vec<T>. There is no zero-cost abstraction that hides the distinction between the two, every existing language I know of either already has that distinction or committed to a non-zero abstraction cost, and the distinction is so fundamental it’s impossible to write good code in languages that make the distinction if you don’t understand it.

IMO the only thing that’s special about Rust here is that it made string slices a distinct type from regular slices, so you get the UTF-8 guarantee enforced by the type system. AFAIK that change has been largely welcomed, and is unrelated to anything the OP was arguing.

6 Likes

In fact, AFAIK the only reason we haven't killed the str builtin and replaced it with struct str([u8]) is annoying issues with the module system and std::str. There is a closed PR floating around that goes into detail.

1 Like

This isn’t quite true. Rust’s str and String types are guaranteed to be valid UTF-8, so turning str into [u8] and String into Vec<u8> wouldn’t be correct, as it would permit arbitrary bytes.

@alilleybrinker String is defined as

pub struct String {
    vec: Vec<u8>,
}

yes a string guarantees valid UTF-8 but that can be done at the struct interface, note the private inner type. @mcy was talking about replacing the current magic str with a similar struct.

2 Likes

Do you have the link? Would be nice to read through that :slight_smile:

2 Likes