`str` method for slicing code-point (i.e. `char`) ranges

Proposal

As the title says, I would like to plea the Rust standard library developers to add a “simpler” way to obtain code-point (i.e. char) ranges, like for example:

let string = "some-string-containing-UTF-8-characters-not-only-ASCII";
let substring = string.get_char_range(5..7).unwrap();

Previous discussions (with negative feedback)

I know that this has been discussed at lengths before in the following threads:

Moreover I understand that this operation is CPU intensive, and won’t ever be O(1). I even understand that given the complexities of Unicode, a “character” can mean many things… However in this thread I’m referring just to “code-points” as returned by str::chars.


Use-case

However, contrary to previous discussions, there are valid use-cases for such a slicing method, although uncommon.

For example I’m writing a Scheme interpreter in Rust, which by following the Scheme R7RS standard does require “character”-based indexing. (We can debate if that is wise or not, however the R7RS standard is a given, and a compliant implementation has to abide by it, therefore there I have no “choice”… And I bet there are other similar examples of users implementing some “required” interface.)


Naive implementation – a “gotcha” in disguise

So my first try was to use str::char_indices:

let string = ...;
let (char_start, char_end) = ...;
let (byte_start, byte_end) = {
    let mut indices = string.char_indices();
    let (byte_start, _) = indices.nth(char_start).unwrap ();
    let (byte_end, _) = indices.nth(char_end - char_start).unwrap();
    (byte_start, byte_end)
};
let substring = string[byte_start .. byte_end];

However there are a couple of problems… (besides the unwrap):

  • one has to take into account as a special case when the char_start (or char_end, or both) are equal to the number of characters in the string; (i.e. the resulting slice is always an empty slice;)
  • in order to detect if char_start or char_end is outside the “character” boundary of the string, one has to enhance that implementation as seen below;

“Complete” solution – A.K.A. “I can’t believe I have to write this in 2018…”

let (range_start, range_end) = {
	let mut indices = string.char_indices () .enumerate ();
	let mut byte_range_start = 0;
	let mut byte_range_end;
	let mut character_index_last = 0;
	let mut byte_index_last = 0;
	loop {
		let (character_index, byte_index, reached_end) = match indices.next () {
			Some ((character_index, (byte_index, _))) => {
				character_index_last = character_index;
				byte_index_last = byte_index;
				(character_index, byte_index, false)
			},
			None =>
				(character_index_last + 1, byte_index_last + 1, true),
		};
		if character_index == range_start {
			byte_range_start = byte_index;
		}
		if character_index == range_end {
			byte_range_end = byte_index;
			break;
		}
		if reached_end {
			fail! (0x22393af0);
		}
	}
	(byte_range_start, byte_range_end)
};
let substring = try_some! (string.get (range_start .. range_end), 0x5c4c5d20)

The “use a crate!” road

I do love Rust for it’s “no-batteries-included” approach as opposed to other languages (say Go), in which the user is provided by default to just the “bare minimum” functionality to get started, and for others to build upon.

However, in today’s world, strings (especially Unicode ones) are everywhere, and I feel that leaving out of the standard library the tools to easily manipulate strings on a “character” boundary is similar to not having a vector or hash-map data-structure…

One could always use such an external crate “that only handles these string-character-based operations”, but then once the user’s project has reached a certain development point, and enough dependencies have gathered, one starts to wonder: “why do I need these 20 crates (that is 75% of my dependencies) that provide only 0.1% of my overall functionality? do I still need this left_pad crate that has a single function called left_pad?”


Closing words

And related to this topic, and to my Scheme interpreter use-case, there are a few other missing functionality regarding the character-based addressing of strings:

  • str::get_at(char_index : usize) -> (Option<char>) – which could be a wrapper for str::chars().nth(index);
  • str::get_byte_boundary(char_index : usize) -> (Option<usize>) – which could be a wrapper for str::char_indices().nth(char_index);
  • the reverse of the above str::get_char_boundary(byte_index : usize) -> (Option<usize>) – which could be a wrapper for str::char_indices().find(|offset| byte_index == offset);
  • similar methods for getting the range start and end in one call;

As a bonus, one could hope that most of the str methods that take byte offsets would have a character offset counterpart.

1 Like

While I acknowledge there are use cases for slicing-by-codepoint, I would much rather see those operations put into a crate and let them evolve naturally in the ecosystem. After some period of evolution, we could re-evaluate whether they belong in std or not. Deciding the std question today without prior ecosystem experience with this sort of thing is, IMO, premature.

5 Likes

It’s weird that you enumerate the indices. You can use .skip(n) to rewind the iterator to the start of the range, and then end of the range.

This is as efficient as it is possible for arbitrary UTF-8 or UTF-16 (i.e. O(n)), and of course it’s incorrect, because a code point is not a character. There are many characters that are built from multiple code points (emoji are the most prominent example, but it also affects decomposed diacritics/combinators that are super common in some languages).

If you really need indexing by things humans would recognize as characters, you have to iterate over grapheme clusters.

extern crate  unicode_segmentation;
…
let mut start_iter = UnicodeSegmentation::grapheme_indices("foobar", true)
   .map(|(idx,_)| idx)
   .skip(start);
let start = start_iter.next().unwrap();
let end = start_iter.skip(len-1).next().unwrap();
1 Like

No, one can't use skip just as one can't use nth, for the simple reason, that if the "count" to be skipped by is larger than the actual number of characters, then the user won't be issued an error, thus hiding a potential buffer over-run, which although less likely in Rust, will still potentially cause a panic down the road...


Please believe me that I "understand" all this "theoretical correctness" mantra... However the "world" is not as "theoretically correct" as one would like it to be... And in the end one has to deal with actual problems... And -- as unlikely as it might sound -- in all my 15 years of programming I never stumbled upon an issue in which the string I was slicing and dicing contained a "weird character" that was composed of multiple code-points...

Moreover almost all programming languages I have used so far do support such code-point indexed operations:

Thus as astonishing as it may sound, I would consider such a facility to be a "core" one, if not by its theoretical correctness, then at least for its widespread usage.


Nop. I just need to implement an Scheme R7RS compliant builtin function to my interpreter, which requires one to be able to slice and dice a string on a code-point boundary.

But perhaps one day I'll implement a more advanced "unicode library" for my interpreter, and then I'll use the crate pointed by you. Thanks.

Let’s not devolve into a long and heated debate over the definition of “character” and the merits of codepoint-based processing. Needing to conform to implement an existing interface that includes indexing and slicing by code point index is sufficient motivation for indexing and slicing strings by code point index.

However, if that’s what one need to do, then UTF-8 is a pretty bad string representation to do it with. I don’t think an implementation of Scheme, for example, would be well-served with convenient but slow methods for code point based access. Rather, I’d expect such an implementation to store Scheme strings as an array of code points.

Indeed UTF-8 encoding is a bad choice in terms of efficiency, however the standard does clearly state that:

(string-ref string k) It is an error if k is not a valid index of string. The string-ref procedure returns character k of string using zero-origin indexing. There is no requirement for this procedure to execute in constant time.

Therefore if one knows it will do a lot of character-based slicing, one would better use string->vector to obtain a vector of characters. (Not as efficient as a proper string encoded as Vec<char>, but good enough.)

However most of the time the usage of such a builtin would be to obtain a sub-string based on some indices obtained from somewhere...

OK, so you could technically get away with it for Scheme. It's not clear whether it's a good choice though. And for other cases where one is forced to do code point indexing (e.g., implementing another language where there is an expectation of constant-time indexing), even that excuse doesn't apply.

Aside from that, I want to second that even if you use String, the implementation of code point indexing/slicing doesn't have to be anywhere near as complex as your earlier posts claim. In particular, the problem you pointed out here is quite surmountable:

For indexing, nth gives you None if the index is out of bounds. For slicing, you can use chars().skip().take() and check afterwards whether you got the expected number of code points. If you are fine with computing the number of code points beforehand, or are caching it for each Scheme string, you can also manually do bounds checking.

2 Likes

I have one in my last name (in NFD, which macOS uses in filenames), so I keep seeing broken software regularly. They are common in Indic scripts. In the west people tend to care more about flag emoji, which are also "weird".

2 Likes

I also regularly encounter filenames where the katakana ガギグゲゴ, commonly written with a codepoint each, are split into two (combining form, unlike what I’m going to write here for the effect:) カ゛キ゛ク゛ケ゛コ゛. MacOS default filesystem does this for filenames. Slicing at the wrong boundary is going to land you in the bug land.

2 Likes

At least one of iOS’s “this message crashes your phone” bugs was likely due to ObjC code splitting in-between codepoints such that the text layout engine for notifications crashed, due to a trailing incomplete cluster. Of course we don’t know the exact reason but a trailing zero width joiner sounds likely.

Proliferation of small libraries always feels strange, since you’re relying on more people’s code. Especially for something as difficult as string manipulation, it helps you have some authority behind the implementation you use.

But I think the correct direction is for String to remain a byte oriented UTF8 guaranteeing interface, and for external crates to provide windows for more convenient “character” based APIs for whatever definition of character is most fitting to a problem domain.

Text manipulation is hard. Basically the only guaranteed safe thing you can do is treat it as a binary blob: look it up in I18n and display it without modification. However, the real world is messy, and that’s what libraries that are more able to iterate than the standard are great for dealing with.

That said, though, the existing API isn’t that bad for working on codepoint indices. You just have to make the mental shift to using iterators instead of collections.

Yes it is... And it's not for the fact that one has to "wrap" his mind to use iterator operations like let mut inices = string.char_indices(); let start_byte = indices.nth(start).unwrap(); let stop_byte = indices.nth(stop - start).unwrap(), it's for the fact that the previous snippet -- as explained in my original post -- is incorrect, and the "correct" solution is almost a page long (and I'm sure I got some bugs in there...)


I promise this is the last time I make an attempt in "defending" "theoretically incorrect" API's, which when used wrong could lead to obnoxious bugs or even worse to crashes, but let's leave code-point boundaries for a minute and think about the following: I bet there are countless bugs caused by division by zero, however this didn't prompt any programming language (or standard library) from "forbidding" either the number zero or the division operation. Instead the developers have to check and sanitize their inputs, just like a developer that slices a string on a character basis should have in mind.


To put it from another perspective: from what I understand the main issue with code-point slicing is that there are "characters" that are made of multiple code-points, and slicing a string in the middle of a character like that would lead to bugs.

Then if this is the case then I bet such situations are easy to check and guard against, by changing the signature of a "would be" slicing method to str::get_char_boundary(Range) -> Result<str, CharBoundaryError>, where CharBoundaryError would be an enum that has arms like OutOfRange, CharacterIsMultiCodePoint, etc.

In this way we get a "correct" and useful API at the same time.

I previously explained how I think the error checking can be done in a much simpler way. If I missed something and was wrong about that, I'd like to hear the reason.

This isn't really a helpful analogy. Most everybody knows that division by zero is meaningless. But people regularly ask for ways to treat rust strings as an "indexable array of characters", while thinking they should get O(1) performance out of it. So if that's provided, people will undoubtably reach for it without the clear understanding of the tradeoffs that you have.

With that said, I don't see why a non-std crate can't be the solution here.

Please provide the code that correctly and efficiently implements the slicing with the bounds check (and does at most a single pass through the char or char_indices pass). I tried my best to implement it and the posted variant is the shortest and simplest one I found.

Indeed it can be done with a few calls to chars or char_indices, however those require multiple passes through the string.

I find it very curious that you’re hand-wringing over a constant factor here when the operation is fundamentally linear and could easily be constant time with a more appropriate data representation.

Admittedly, there is one tricky part I didn’t fully realize at first: you need the total code point count of the string (or at least up to the end point you’re considering) to distinguish “end point lands on end of string” from “end point is out of bounds”. However, this can be solved without any additional passes through the string by counting code points as they come out of the iterator. Here’s the implementation you requested: https://play.rust-lang.org/?gist=0434fb6f91de5435921569c3ac96f01f&version=stable&mode=debug

I did omit the start..end -> (start, count) conversion, but that’s because it’s trivial, independent of the core problem, and could be done in multiple different ways depending on what you want to do if start > end (e.g. error, empty substring, or swap start and end).

[I wanted to make the following observation in my previous reply, but I didn't think anyone would raise this false issue...]

So yes, indeed from a "theoretical" point of view both O(n) and O(c*n) where c is a known constant are equivalent. However in real life, one takes x seconds to execute, and the other one (if the compiler can't optimize it) roughly c times x seconds.


Let me paste your solution for the record:

fn code_point_slice(s: &str, code_point_start: usize, code_point_count: usize) -> Result<&str, SlicingError> {
    let mut seen_code_points: usize = 0;
    let (byte_start, byte_end) = {
        let mut iter = s
            .char_indices()
            .inspect(|_| seen_code_points += 1)
            .skip(code_point_start)
            .peekable();
        (iter.peek().map(|x| x.0), iter.nth(code_point_count).map(|x| x.0))
    };
    if code_point_start > seen_code_points {
        return Err(SlicingError::StartOutOfRange);
    }
    if code_point_start + code_point_count > seen_code_points {
        return Err(SlicingError::EndOutOfRange);
    }
    let byte_range = byte_start.unwrap_or(s.len()) .. byte_end.unwrap_or(s.len());
    return Ok(&s[byte_range]);
}

Anyone finds this as "a much simpler way" (@hanna-kruppe) or "[...] existing API isn’t that bad [...] just a mental shift [...]" (@CAD97)?

I definitively don't... Thus my proposal to include this into the standard library...

Take issue with framing it solely in terms of asymptotic complexity if you want, but at the end of the day any pass through the string takes x seconds (depending on the length of the string) while slicing a Vec<char> takes a few cycles, which certainly is a much larger gap than a factor of two or three except for the shortest of strings. If performance of this operation is important enough to worry about a factor of two, it would be a complete no brainer to use a different representation. Worrying about tuning the "pass through whole string" approach seems to miss the forest for the trees.

I don't deny it's tricky, so it should definitely be encapsulated in a library function somewhere. It would clearly be insane to open code that in any location where you need to slice a string. But at the same time, I've not seen any reason why it should be in the standard library rather than any other library. Furthermore, even if we all agreed this functionality should be in the standard library, @burntsushi's advice to first prototype it in a library and see how it work in practice would still apply.

Finally, keep in mind that this complexity is pretty much the worst case because you want all of the following:

  • the full feature set of slicing
  • on a representation completely unsuited for it
  • optimized as much as humanly possible

With a more suitable representation like Vec<char> it would be absolutely trivial. A slightly less efficient implementation, such as allowing yourself to count the code points separately is much simpler, comparable to byte-based slicing. Less general operations (e.g., only drop the first N code points) can also be much simpler, especially if they don't require much error checking (e.g., you only need to truncate to K code points if it's longer, and leave the string alone if it's shorter).

3 Likes

I really do agree that Vec<char> would be a better solution if one has to fiddle a lot with the characters in a string. However there are cases when one just needs a sub-string (for example as an argument to another function, and for reasons which we all agree might lead to buggy behavior). Therefore converting the string to a vector, slicing it, and then converting it back to a string is just painfully inefficient in so many ways...

For example I stumbled upon this use-case when I had to implement the (write-string <string> <port> <string-start> <string-end>) Scheme R7RS function. Thus I don't need to fiddle with the string, I just need to slice it. (Granted that internally I could have made so that my port trait should have a method that accepts character iterators, however at the moment -- due to simplicity -- it accepts only &str and &[u8]...)

I think that useful, but rarely functionality like this should live inside a crate. Otherwise the standard library just gets large in size and also difficult to browse. If you ask me, the library should only contain things that are either common, require standardization or cannot be implemented outside of it. This seems not to be the case here.

Having a bloated standard library is even worse. Also, having many crates as dependencies isn't a downside at all. Small focused crates are the way to go. They are easy to maintain and they can be easily swapped out if needed. BTW 20 crates are nothing. I've a Node.JS project with 1273 packages and npm installs them in 16s. A lot of modules in a few packages or a few modules in a lot of packages makes no difference to the module count. The only difference is that the former is hard to maintain. Maintaining a small package is much easier. And don't hate on "left-pad". It's a perfect example why small packages/crates are the way to go: The package is well tested and it has benchmarks that ensure that it is the fastest way to pad a string to a specified length in JavaScript.

So I guess you haven't done lately any software license compliance audits... :slight_smile: