What if strings were Code Point aware?

The str type can only be directly sliced with octet positions (s[1..3], for example). The nearest to slicing with Code Point positions is something like s.chars().collect::<Vec<char>>()[1..3].to_string();. In my language โ€• which is not yet buildable, sorry! โ€• I thought of the approach of indexing with either Code Points indexes (simply integers) or immutable StringIndex objects, while maintaining an UTF-8 encoding. Here's more information:

https://violetscript.github.io/docs/language_reference/types/string_type.html

So in my language one will be able to do '\u{10F01}a'.substr(1, 1) == 'a' and '\u{10F01}'.charCodeAt(0) == 0x10F01 (indexes are zero-based); and, yes, it'll still be encoded in UTF-8. For large strings and large indexes, specifying integers to these methods like substr() can be inefficient, but if you get a StringIndex object from an operation like indexOf(), from a character iterator or from a RegExp capture, these operations are efficient. StringIndex contains {default:Int, utf8:Int}, where default is the index in Code Points, and utf8 is the index in UTF-8 octets. StringIndex contains these 2 for different purposes, but utf8 is always used for indexing.

This is flexible in that you can manipulate Code Points directly. It can be handy when the string is short or it is useful for prototyping purposes.

Have you considered something similiar in Rust? What if it were part of the standard library?

Two thoughts:

  • You can use char_indices() to get the offset of each codepoint, allowing you to slice into the string without having to do an expensive collect().
  • Operating on code points can be useful at times but in my experience what people usually want to be operating on are graphemes, not code points. IMO Rust has sufficient API coverage to support interacting with code points, and I would not encourage additional APIs to simplify this given that code points probably aren't what people want to be working with anyway.
18 Likes

As a member of libs-api, I can just put the kabash on this now: it will almost certainly never happen. And that it hasn't happen is no mistake and very intentional.

I strongly urge you to spend more time on the high level use cases here and what problems you're trying to solve.

As someone who has a lot of experience working with and implementing strings, I can say that working with codepoints directly is usually the wrong thing to do in some manner or other, unless you're working on the implementation of Unicode algorithms or otherwise doing something rather niche.

To get the heart of the matter, you'll want to spend more time writing about use cases than implementation details.

13 Likes

Indeed it'd be more useful if I were exploring use cases. Still, in my language I'd like to take this approach because I'm inspired by Python and also by EcmaScript 4 Draft Reference Implementation (where they're either Code Point or Scalar Value based).

It's more an aesthetic detail than an use-case.

I was responding to the bit about adding them to the standard library. Given that I don't know your language, its design goals or what problems you're trying to solve, I don't have much to add. If you're going for zero cost integration with Python, indeed, it might be appropriate to adopt its implementation design.

I don't know what you mean by aesthetic choice though. It seems like a category error to me, and I don't see any discussion of its aesthetics here. So I'll have to bow out of that one. But the idea of implementing a string type without any care about use cases seems like a red flag to me. I'm completely unfamiliar with that development style.

Well, I think it's going a bit off-topic, but there's a time since I'm developing this language. The goal is to be multi-purpose, but unfortunately I mostly spent time with parser and symbol solving rather than pratical usage, so I never got concrete on built-in objects (sometimes I did reach at least the bytecode generation, but I'm reworking in its verifier).

This forum's purpose is the discussion of development and implementation of the Rust programming language, so other languages are on-topic only insomuch as case studies in the choices others have made and the outcome of that. So if there's any part of the OP that's about the design of your language, that's not going to be touched on here.

For comparison purposes, Swift also actually has a separate StringIndex type for its strings, which help manage the fact that Swift actually provides a relatively high-level view of strings as a sequence of graphemes as part of the language.

The main problem with any sort of opaque index is twofold: a) indexing still has to be checked anyway, since there's no guarantee that an index is used on the same string that created it (and you don't really want that requirement anyway, as doing something at the same position in multiple strings is a reasonably common example string manipulation task); and b) after a string is changed, that changes what indices are "good" as the string representation shifts to accommodate the edit. (It's just a memcpy, but it still invalidates "known good" indices.)

Bytewise indexing offers the full benefit without any tangible costs. You can treat indices as opaque and get fast reindexing just fine. You can still count/search/slice by codepoints (though doing so is extremely rarely what you want[1]) even if it takes a bit more effort to do so.

The cost is that the developer not just treating indices as opaque needs to understand that String uses a variable length encoding. But papering over UTF-8 and presenting a [char]-like view doesn't remove that โ€” Unicode is still a variable-length encoding at the codepoint level because of grapheme clusters.

It's very unfortunate that string manipulation is such a common task to use as a beginner task, because text manipulation is anything but. Rust's String API is the best design I've personally seen which doesn't try to artificially simplify text by pretending some of the complexities don't exist.

This isn't to put down other stdlib String types, though โ€” just as an example, Java's String class API was designed when people legitimately thought UCS-2 was going to be the final form of Unicode. It just unfortunately wasn't a bet that paid off and UCS-2 isn't a thing that formally exists anymore.

If designing a new stdlib from scratch, I'd crib Rust's String practically wholesale (though maybe rename it StrBuf). The main difference is I'd consider making a higher level and more opaque Text type my default type, which treats text more like a localization system would; mostly opaque with only a limited amount of manipulation without dropping down an abstraction level.


  1. Unless you're specifically implementing a Unicode Algorithm and thus in that case hopefully understand the subtleties as a prerequisite. โ†ฉ๏ธŽ

10 Likes

I agree in different ways;

but in my language the assumption would be that StringIndex is used in the same string it was extracted from, so the only check would be decoding the code point at the UTF-8 index. About mutable strings, my language doesn't have these. And even if there was a mutable String type, the StringIndex is an assumption.

And for memory-efficiency reasons, StringIndex only contains 2 fields (the ones I mentioned earlier, default and utf8) and then the programmer uses assumptions. I don't know about Swift's one...

About graphemes, I must tell I've not much idea of what they are, maybe it's because English isn't my primary language. But by graphemes you mean a speaking-language-specific sequence that has to be decoded, right?

The "graphemes" being talked about are a Unicode concept. I would highly recommend reading up on the Unicode specification for your own language. E.g. see Annex #29.

8 Likes

Rust has that too; it's called "\u{10F01}".chars().nth(0) == '\u{10F01}'.

But it's almost never what you actually want, as other posts here have said.

If you're making a new language, rather than proposing a change to Rust, you might have better luck at somewhere like Programming Languages (reddit.com).

Hm... couldn't an opaque index remember from which string it came and skip the bounds check if it's used to index the same string, while still being usable to index other strings (with bounds checking)?

Isn't this iterator invalidation, or at least amenable to the same solution (lifetimes)?

The string could have changed. How do you make sure the string is still in its original state? If the answer is this:

Then if you link indexes to lifetimes why not use reference directly?

2 Likes

For some background reading on why code point indexing is usually not that useful, and languages relying on it instead of byte indexing (like Python 3) is often seen as a mistake: https://utf8everywhere.org/#myth.strlen

For someone implementing a new language, I would strongly recommend reading the entire document, as it analyses many API pitfalls around string encoding. (FWIW, it seems like whoever designed Rust's String did read the document, as Rust seems to get all of the points right.)

8 Likes

I was imagining something like struct StringIndex<'a> { origin: &'a str, byte_index: usize }. If "use reference" means not that (a struct containing a reference) but simply references &u8 to bytes of the string, the problem is

which, I think, would require communicating the origin somehow and computing the byte_index from the &u8 and the origin.

I've written about this before in Let's Stop Ascribing Meaning to Code Points: codepoints are not as useful an abstraction as people think they are, and Rust's choice to expose them through iteration APIs is a sound one since almost every useful thing you can do with them depends on context.

Almost any time someone wants to directly access codepoints and they're not implementing a Unicode algorithm the question I want to ask is: what do you actually want to do? Because 99% of such approaches are flawed when you start thinking about other writing systems.

18 Likes

I rather should not have created this topic, but it was here that I heard of graphemes... I think that people into English only aren't used to hear of graphemes and often think that characters can only be code points. But whether a language allows direct code point access or not doesn't matter much, I think. If one language allows, the user has to be aware of the cost. In case, in Rust, if you're dealing with code points, there is a difference from my language in that you're always aware about the cost of indexing a string with the language's built-in objects. In my language one has to be aware that large indexes that aren't StringIndex might be inefficient.

Take a look at how ghost_cell does it as an example.

There are multiple layers of complexity here:

  • Backwards-compatibility with 1990's programming languages is one thing. Early UCS-2 covered fewer languages, and wasn't as complex, so "code point == character" assumption was close enough. If you have to have zero-cost interoperability with Python, Java, or JS, you have to inherit their technical debt of exposing UCS-2 assumptions which are no longer true in modern Unicode. This gives you a particular sort-of-buggy flavor of UTF-16 which can't be interoperable at zero cost with UTF-8.

  • Unicode chose ability to round-trip every other encoding, so it inherited quirks for every other encoding (and added its own too). It has decomposed forms and combining marks (one character "รค" can be made from two separate code points "a" and "ยจ", or it can be one code point, it depends), code points for ligatures (one codepoint can mean multiple letters), control code points that set rendering style, and emoji that have things like code points for skin tone, flags made out of special country-code "letter" code points, and arbitrary emoji that can be composed from sequences of codepoints. All of these totally break "codepoint = character" assumption, and even in applications for only English speakers you'll get bug reports about broken emoji :slight_smile:

  • Human languages are complex, and Unicode inherits this complexity. It needs to support different writing directions and mixing them, it has various approaches to context-dependent letter shapes, and all kinds of quirks of whatever writing systems they've managed to capture. Not all languages even have a clear concept of a letter/character like latin (e.g. Korean has jamo that seem like letters, but groups them in blocks that are important, and Unicode has codepoints for both. Arabic doesn't have letters for vowels, only modifies consonants, etc.)

This gets even messier if you need to edit text, sort it, split it into words, etc. Unicode has specialized algorithms for these, but proper Unicode handling requires substantial amount of code and supporting data, which is too heavy for Rust's standard library.

In all of these cases, code points are not the correct level of abstraction, and are awfully misleading, because they may seem fine in basic cases. I think Rust made a terrible mistake naming Unicode Scalar Value a char.

20 Likes

Indexing by code point (that is what char is in Rust) is almost always a mistake, but it would be useful to have a string type that allowed for O(1) index on grapheme clusters. it would be something like

struct IndexedString {
    normalized_str: String,
    indices: Vec<usize>,
}

where the String is guaranteed to be passed through some Unicode normalization pass

but I'm not sure if this would be appropriate for the stdlib. for one thing, we would need a type for grapheme clusters too (mystr[i] would return a grapheme cluster)

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.