`str` method for slicing code-point (i.e. `char`) ranges

MajorBreakfast · May 25, 2018, 12:48pm

In JavaScript-land there's a package for that as well called "legally". It's a license checking tool. It gives you a list with all the different licenses which are usually just a handful.

gbutler · May 25, 2018, 12:51pm

Could this problem be addressed better by making the String and &str data-types generic over encoding? For example, what if you could do:

let x = String::<UTF16>::new();
let y = String::<UTF32>::new();
...
etc

If this were possible, and &str were similarly generic, could this solve this problem in the best, most efficient manner? Would this be a possible way forward?

comex · May 25, 2018, 1:38pm

Thing is, getting substrings is an operation one often performs in a loop – e.g. in order to split a string by a delimiter. If extracting the substring takes linear time, and the number of substrings you extract depends on the length of the string, then you end up quadratic, which can be really painful.

Anyway, in your average code, if you just "need a substring" (and are dealing with UTF-8 encoded strings), you should be using byte indices instead of codepoint indices. After all, they're both equally useless if you need a semantic notion of "character", so you may as well go with the metric that's more efficient. A Scheme interpreter may be a special case due to the requirements of the Scheme standard, but, well, special cases are special and don't justify adding easily-misused functionality to the standard library. And even then, you'd be better off using a non-UTF-8 string type throughout your interpreter (not just converting at the boundaries as you're suggesting).

system · March 25, 2019, 8:30am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What if strings were Code Point aware? language design	19	1639	March 30, 2023
Pre-RFC: Add len_utf8_at method to str libs	8	967	December 20, 2020
Wild idea: deprecating APIs that conflate str and [u8] libs	59	3440	November 12, 2020
Str vs slice APIs libs	3	1515	March 25, 2019
&str.is_substr(&str) -> Option<usize> libs	14	871	May 5, 2021

`str` method for slicing code-point (i.e. `char`) ranges

Related Topics