Make manipulating strings less painful

Summary

Some API additions/changes to make it cleaner and make handling strings less painful

Motivation

Currently when I want to take, let’s say, 10th character out of a string, I need to do something like this:

s.chars().skip(9).next().unwrap()

I think we can make it simpler.

Detailed design

What if we could do something like this:

s[Byte(0)] // first byte
s[Char(1)] // second character
s[Graph(2)] // third grapheme
s[Word(-1)] // last word
s[Line(-2)] // second to last line
s[Word(1)..] // everything except the first word
s[Char(1)..Word(1)] // first word without the first char
s[Line(0)][Word(-2)..] // last 2 words of the first line
s[Byte(n)..][Char(0)] // char at byte offset n, old indexing behavior
s.iter::<Graph>() // iterator over graphemes
s.iter::<Word>() // iterator over words

// get number of words in string:
let word_num = s.len::<Word>();

// erase everything between lines containing "begin" and "end"
let (a, b): (Line, Line) = (s.find("begin"), s.find("end"));
s2 = s[...a] + s[b..];

Basically, we could use few predefined newtypes to specify if we’re interested in characters, graphemes etc. I didn’t think about how to implement it yet, but I’m sure it wouldn’t be too hard.

Drawbacks

  • Adds some complexity to the std API
  • Encourages use of potentially more expensive operations

Alternatives

Do nothing. Force programmer to use iterators. This would make it more explicit about how much work must be done by the processor to get nth thing from the string.

Unresolved questions

  • Is [Char(a)..Char(b)] even possible? Should it be limited to [Char(a..b)]?
    • It seems that Char(a)..Char(b) would be possible, but not Char(a)..Graph(b)
  • What should be the return types of indexing operation?
    • Should result of [Char(n)] be char or &str like others?
    • Result of [Word(a..b)] could be a Vec<&str>
    • Everything must be a str slice
  • Isn’t syntax too verbose?
  • How to make it more clear when operation is significantly more expensive (nth byte vs nth anything else)
4 Likes

Very interesting, currently str's implementation of Index is impl Index<Range<usize>> for str, so we could change it to something like impl Index<Range<StrIndex>> for str with StrIndex being defined as:

enum StrIndex {
    Byte(isize),
    Char(isize),
    Grapheme(isize),
    Word(isize),
}

I agree that it’s interesting, but I don’t think string indexing is useful enough to warrant this complexity. It’s very rare that code needs to grab the second-to-last line or tenth character outside of a first-year CS assignment.

Code that does need that sort of thing, like parsers and regex engines and font renderers, either work with plain byte ranges, or are specialized enough that they’d implement their own logic anyway.

In addition, I tend to think of indexing as a relatively fast operation (at most O(log n)) so having it degrade into a linear scan is surprising.

Overall, I find this solution much too complicated for a problem which I’m not sure exists.

6 Likes

It’s actually a really interesting idea. The best thing you could do is make a crate implementing this and putting it on crates.io. Gauge if people actually want an API like this.

2 Likes

Nooooo nonononono.

Warning: ranting ahead.

The last thing the world needs is more programmers with the mistaken notion that random indexing into a string is in any way a good idea. It’s hard enough to break the notion that code unit = “character”, let alone code point = “character” or u16 = “character” or that [a-zA-Z] is in any way a sufficient “letter” pattern, or that all digits are in [0-9] or that you can substring at arbitrary positions or that \n is always a newline or that [ \t\r\n] is a comprehensive whitespace pattern or collapses gasping for air

I think what I’m trying to get at: there are enough really frustrating misconceptions about how text works that I don’t think we should be adding to or reinforcing them, even slightly.

People asking for the ability to random access a string for the “nth character” are almost always doing it wrong. They shouldn’t be enabled, they should be slapped on the hand. Heck, aside from text formats that are actually ASCII (and thus shouldn’t be stored as UTF-8 [1]), I don’t think I’ve ever seen a legitimate need for this sort of thing. “I want the 10th character from a string” Why?! I know I probably sound hysterical, but I would legitimately love to see an actual reason for wanting this. It would be like finding a real, live Unicorn.

Quite aside from that, I strongly believe that indexing should never be bound to any O(n) or worse operation, because you will get people who will assume it’s fast. It doesn’t matter how many warnings you put up, they will assume it anyway.

I say keep it inconvenient, because that should be a strong indication to the user that they’re doing it wrong. Because they are.


[1] Treating ASCII as UTF-8 is fine, it’s treating UTF-8 as ASCII that’s the problem. I have no issue with a hypothetical AsciiStr defining indexing, because that really is O(n) and correct.

15 Likes

For example when you’re reading making word wrap you need to access indexes. When you’re selecting a percentage of the text by user request, you want to go to the start of the word if you fell in the middle of one so you need to index backwards looking for space.

Just because you haven’t seen reasons in your daily life doesn’t mean they don’t exist, The api shown in the OP as the status quo is shocking, and you saying ‘so just use ascii for international text from a book’ is less than helpful. Not to say you shouldn’t use a custom array for large text (probably a gap vector) but pretending people will not want to take pieces of that (hopefully indexable without O(N) ) text and put in a string to use with regex etc is asking a bit much.

You are going to have one hilariously slow algorithm if you're seriously re-computing word offsets from a string every single time you need to look at a new word. It would be insane to do this: you'd split the line into words, then index on that. At that point, we're talking about a completely different problem.

Again, doing this by random indexing would be insanity. You would write or use code that does a backward scan from a starting code unit position, through the UTF-8 looking for an acceptable whitespace sequence. Unless it's in a more efficient data structure, but then we're no longer talking about strings!

I have never seen reasons for doing one-off random indexing into a string by chars, graphemes or words that was reasonable (because my assignment says so doesn't count). Every reason I've ever seen actually boiled down to "but a string is the wrong data structure; you should be using something else!"

I never said or implied that. I said that if you have some sort of text protocol defined entirely in terms of ASCII, then fine, use ASCII. But strings in Rust are UTF-8, and you can't just blithely assume they don't contain non-ASCII code points or composite characters.

I don't understand what you're trying to say here. The first part looks like you admitting that you shouldn't be using strings for these sorts of tasks (on which I agree!). The next bit confuses me: how has anything I've said imply that people should not be able to create strings, or feed them to regex? I can't imagine that any sane regex implementation would be using the proposed API; it'd have to be specialised repeatedly for different access patterns.

And anyway, the proposed API is not for gap vectors (which the std lib doesn't have), it's for strings.

7 Likes

I also think this would make it far too easy to do inefficient operations that are actually pretty rare anyway. Personally, I’ve never needed to get the nth character/grapheme from a string in Rust. My favourite character-accessing methods are char_range_at, char_range_at_reverse, and the iterators chars/graphemes and char_indices/grapheme_indices, and they have so far neatly provided efficient ways of accessing characters while still functioning well with non-ASCII-range characters. (I would like grapheme_[range_]_at[_reverse], though.) Almost always, that’s all you need. In the rare case that you ever do need the nth character in some string, n is normally relative to some other point in the string (other than the start)—in the simplest (and commonest) case, you just want the next character after one you already know the position of. In that case, I fear that easy character indexing will lure people into doing s[Char(n+1)], which is just a terribly inefficient way of writing s.char_range_at(n).next.

Also, s[Char(1)] couldn’t give a char unless we got some kind of IndexGet trait which takes &self but yields Output, because the current Index returns &Output, and there’s no way to get a &char from a &str.

...which is only slightly mired by the fact that neither of those return characters, they return chars which are mis-labelled code points. But honestly, I've given up trying to fight that battle... :stuck_out_tongue:

</insufferable-unicode-pedantry>

3 Likes

Technically, a char isn’t even a code point, it’s a Unicode scalar value (try std::char::from_u32(0xD800)). But I must say that writing char_range_at is a lot easier than unicode_scalar_value_at :stuck_out_tongue:

1 Like

I think this is better served by a separate datatype that wraps String / &str, that specifically makes random access operations easier.

For me it feels as though most of the applications come from totally made up toy examples like “reverse all words in the string”, things that you’ll find in an introduction to programming and you never really have any application for.

1 Like

The values are supposed to be separate types (single-value tuple structs), so functions like find() and len() can return them too.

// get number of words in string:
let word_num = s.len::<Word>();
// erase everything between lines containing "begin" and "end"
let (a, b): (Line, Line) = (s.find("begin"), s.find("end"));
s2 = s[...a] + s[b..];

I will add these examples to OP.


  • I needed it for my base85 converter, where I translate a character to it's numeric value based on it's position in a character_map string. I still have no idea how to convert byte offset returned by find() to a char offset. With my proposal it would be just find::<Char>(chr)

  • What if you have a string that is always formatted like {XXXX} and you always want to get rid of first and last character? Same thing with word/line offsets.

  • And how about all these cases where I just want to write small program without caring about what is the most cpu efficient way of doing this?


That is certainly a bad thing, but it's in no way enough reason to not provide such basic functionality IMO. Also, the current string API is a mess and this change would simplify it greatly, because we wouldn't need separate methods for operating on bytes/chars/graphs etc. It would be dictated by the argument/return types or ::<type> notation. If the programmer is smart, he will either use byte offsets or split relevant parts of string into vector of str slices where possible, but that's not always worth it, especially for small programs.

// Added note to OP about API clean-up as motivation

I'm with the nay-sayers. Every real string processing algorithm I've seen can easily be amended to use UTF-8 with byte offsets and/or appropriate iterators. Usually it also becomes more efficient, more general (e.g. comes to support code points outside the BMP with no extra effort), or both. See also: UTF-8 Everywhere. I'll pick on your examples to demonstrate:

I'm not sure I understand where those requirements cine from, is the code public? What do you need the char offset for, or in other words, why can't you use an iterator? Alternatively, if you have the byte offset, you can get the char at that offset, and the byte offset of the next char, with char_range_at. If you somehow need a char offset, you can keep track of that in a separate counter, or with enumerate() if you use the chars() iterator.

Coming to think of it, I also wonder why a base85 converter works with strings in the first place. Base85 its binary <-> ASCII. To encode a &str, I'd just encode its bytes (&[u8]). To decode a &str, decode to UTF-8 bytes and convert them to a &str with one of the from_* constructors.

fn strip<'a>(s: &'a str, front: &str, back: &str) -> &'a str {
    assert!(s.starts_with(front));
    assert!(s.ends_with(back));
    &s[front.len() .. s.len() - back.len()]
}

I haven't thought too hard about lines and words, but (1) those seem to be even rarer, in my experience, and (2) a Vec<&str> constructed from the right iterator (lines/words) allows an implementation very similar to the above.

I know that situation, and I have sympathy. But that alone should not dictate API design, especially not for something as fundamental as string manipulation. Doubly so when it makes the Right Way™ more annoying to follow. In addition, while Rust does fairly well for small programs, I don't think it is, or should be, actively optimized for quick and dirty scripts.

Finally, you (or anyone else who supports this proposal) is free to implement it in a library, put it on crates.io, and see how well it turns out in practice. You won't even need to define a different string type: If your crate defines Byte, Word, etc. then impl Index<Range<Word>> for str passes the orphan check just fine.

5 Likes

Let me be the first one to point out that it's not even implementable. The Index trait is defined in a way that it can only return a “loan” that refers to something that is owned by the receiver. chars on the other hand would have to be created on-the-fly. So, given a string slice, we are restricted to returning string slices or bytes. Of course, s[Char(1)] could return a slice referring to the UTF-8 encoded character but I don't think it would be that valuable. If you want to return something that you compute on-the-fly like a char based on a &str, the Index trait is not what you want/need.

I also don't see a way of changing Index into something more general that would support that. Maybe it'll be possible with HKT. Dunno.

2 Likes

I disagree; this unicode-pedantry is completely sufferable.

You hopefully do it in python or any other high-level unicode-aware language. No one here (besides you it appears) wants to turn Rust into a scripting language.

Please address the arguments instead. I think we need to be able to handle and discuss the tradeoffs between safety, convenience and performance. If we can't do that, we can't justify our safe abstractions when they have costs. (Note often the cost is very small. Bounds checking has a very little impact in most situations).

2 Likes

Sorry for that. I guess I just wanted to say that Rust does not need to be everything to everyone. The mission statement reads “a systems programming language that runs blazingly fast, prevents nearly all segfaults, and guarantees thread safety.” – this does not include scripting features (though as others have remarked, It would certainly be possible to add those features in a library).

1 Like

String supports by-byte slicing, and in most interesting cases you know the byte indices.

Could you find me a case where you don’t know them?

1 Like

I don’t think “scripting feature” is a terribly useful term. Most languages’ levels of adoption of “batteries included” are pretty independent of their "scripting"ness - compare Lua to Python, C++ to C#, Rust to Java.

That is not a particularly useful comparison. Java is twenty years old and has a mature, very large ecosystem. Lua is purposely kept small for easy embedding. Rust is a newcomer in the systems language space; also it's positioned as a systems language, which puts it more in the C++ camp than near the likes of python or java.

That said, there is nothing that prevents us to create libraries to make certain tasks easier. And perhaps, with the right tooling, rust may really become a language well suited to writing one-off scripts (if it isn't already).