`str` method for slicing code-point (i.e. `char`) ranges

In JavaScript-land there's a package for that as well called "legally". It's a license checking tool. It gives you a list with all the different licenses which are usually just a handful.

Could this problem be addressed better by making the String and &str data-types generic over encoding? For example, what if you could do:

let x = String::<UTF16>::new();
let y = String::<UTF32>::new();
...
etc

If this were possible, and &str were similarly generic, could this solve this problem in the best, most efficient manner? Would this be a possible way forward?

Thing is, getting substrings is an operation one often performs in a loop – e.g. in order to split a string by a delimiter. If extracting the substring takes linear time, and the number of substrings you extract depends on the length of the string, then you end up quadratic, which can be really painful.

Anyway, in your average code, if you just "need a substring" (and are dealing with UTF-8 encoded strings), you should be using byte indices instead of codepoint indices. After all, they're both equally useless if you need a semantic notion of "character", so you may as well go with the metric that's more efficient. A Scheme interpreter may be a special case due to the requirements of the Scheme standard, but, well, special cases are special and don't justify adding easily-misused functionality to the standard library. And even then, you'd be better off using a non-UTF-8 string type throughout your interpreter (not just converting at the boundaries as you're suggesting).

4 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.