Proposal
As the title says, I would like to plea the Rust standard library developers to add a “simpler” way to obtain code-point (i.e. char
) ranges, like for example:
let string = "some-string-containing-UTF-8-characters-not-only-ASCII";
let substring = string.get_char_range(5..7).unwrap();
Previous discussions (with negative feedback)
I know that this has been discussed at lengths before in the following threads:
- Make manipulating strings less painful
- (please point to others to add them here for reference);
Moreover I understand that this operation is CPU intensive, and won’t ever be O(1)
. I even understand that given the complexities of Unicode, a “character” can mean many things… However in this thread I’m referring just to “code-points” as returned by str::chars
.
Use-case
However, contrary to previous discussions, there are valid use-cases for such a slicing method, although uncommon.
For example I’m writing a Scheme interpreter in Rust, which by following the Scheme R7RS standard does require “character”-based indexing. (We can debate if that is wise or not, however the R7RS standard is a given, and a compliant implementation has to abide by it, therefore there I have no “choice”… And I bet there are other similar examples of users implementing some “required” interface.)
Naive implementation – a “gotcha” in disguise
So my first try was to use str::char_indices
:
let string = ...;
let (char_start, char_end) = ...;
let (byte_start, byte_end) = {
let mut indices = string.char_indices();
let (byte_start, _) = indices.nth(char_start).unwrap ();
let (byte_end, _) = indices.nth(char_end - char_start).unwrap();
(byte_start, byte_end)
};
let substring = string[byte_start .. byte_end];
However there are a couple of problems… (besides the unwrap):
- one has to take into account as a special case when the
char_start
(orchar_end
, or both) are equal to the number of characters in the string; (i.e. the resulting slice is always an empty slice;) - in order to detect if
char_start
orchar_end
is outside the “character” boundary of the string, one has to enhance that implementation as seen below;
“Complete” solution – A.K.A. “I can’t believe I have to write this in 2018…”
let (range_start, range_end) = {
let mut indices = string.char_indices () .enumerate ();
let mut byte_range_start = 0;
let mut byte_range_end;
let mut character_index_last = 0;
let mut byte_index_last = 0;
loop {
let (character_index, byte_index, reached_end) = match indices.next () {
Some ((character_index, (byte_index, _))) => {
character_index_last = character_index;
byte_index_last = byte_index;
(character_index, byte_index, false)
},
None =>
(character_index_last + 1, byte_index_last + 1, true),
};
if character_index == range_start {
byte_range_start = byte_index;
}
if character_index == range_end {
byte_range_end = byte_index;
break;
}
if reached_end {
fail! (0x22393af0);
}
}
(byte_range_start, byte_range_end)
};
let substring = try_some! (string.get (range_start .. range_end), 0x5c4c5d20)
The “use a crate!” road
I do love Rust for it’s “no-batteries-included” approach as opposed to other languages (say Go), in which the user is provided by default to just the “bare minimum” functionality to get started, and for others to build upon.
However, in today’s world, strings (especially Unicode ones) are everywhere, and I feel that leaving out of the standard library the tools to easily manipulate strings on a “character” boundary is similar to not having a vector or hash-map data-structure…
One could always use such an external crate “that only handles these string-character-based operations”, but then once the user’s project has reached a certain development point, and enough dependencies have gathered, one starts to wonder: “why do I need these 20 crates (that is 75% of my dependencies) that provide only 0.1% of my overall functionality? do I still need this left_pad
crate that has a single function called left_pad
?”…
Closing words
And related to this topic, and to my Scheme interpreter use-case, there are a few other missing functionality regarding the character-based addressing of strings:
-
str::get_at(char_index : usize) -> (Option<char>)
– which could be a wrapper forstr::chars().nth(index)
; -
str::get_byte_boundary(char_index : usize) -> (Option<usize>)
– which could be a wrapper forstr::char_indices().nth(char_index)
; - the reverse of the above
str::get_char_boundary(byte_index : usize) -> (Option<usize>)
– which could be a wrapper forstr::char_indices().find(|offset| byte_index == offset)
; - similar methods for getting the range start and end in one call;
As a bonus, one could hope that most of the str
methods that take byte offsets would have a character offset counterpart.