The str
type can only be directly sliced with octet positions (s[1..3]
, for example). The nearest to slicing with Code Point positions is something like s.chars().collect::<Vec<char>>()[1..3].to_string();
. In my language โ which is not yet buildable, sorry! โ I thought of the approach of indexing with either Code Points indexes (simply integers) or immutable StringIndex
objects, while maintaining an UTF-8 encoding. Here's more information:
https://violetscript.github.io/docs/language_reference/types/string_type.html
So in my language one will be able to do '\u{10F01}a'.substr(1, 1) == 'a'
and '\u{10F01}'.charCodeAt(0) == 0x10F01
(indexes are zero-based); and, yes, it'll still be encoded in UTF-8. For large strings and large indexes, specifying integers to these methods like substr()
can be inefficient, but if you get a StringIndex
object from an operation like indexOf()
, from a character iterator or from a RegExp capture, these operations are efficient. StringIndex
contains {default:Int, utf8:Int}
, where default
is the index in Code Points, and utf8
is the index in UTF-8 octets. StringIndex
contains these 2 for different purposes, but utf8
is always used for indexing.
This is flexible in that you can manipulate Code Points directly. It can be handy when the string is short or it is useful for prototyping purposes.
Have you considered something similiar in Rust? What if it were part of the standard library?