Wild idea: deprecating APIs that conflate str and [u8]

It can be more memory efficient than UTF-8 when dealing with e.g. text in east Asian languages, where the majority of characters take 3 bytes in UTF-8 but only 2 bytes in UTF-16.

Depending on what you consider a “character”, they aren’t in UTF-32 either. Sure, codepoints are, but when handling Unicode properly, you quickly have to deal with combining characters / diacritical marks, too.

If you construct an artificial example, yes, but:

  • most interchange text has markup as well, which typically falls in the ASCII range, and
  • if you're at the scale where this is actually meaningfully noticable, you should probably be using a streaming general-purpose compression algorithm which will definitely do better (and roughly equivalently well on UTF-8 and UTF-16).

If anyone here hasn't already, I'd suggest reading the UTF-8 everywhere manifesto, which makes a great argument for why UTF-8 should be the encoding of text without specific (typically, legacy) design constraints.

The Chinese translation of this manifesto takes 58.8 KiB in UTF-16, and only 51.7 KiB in UTF-8.

In overview:

  • In most cases, text is treated as a mostly-opaque blob that you read from one location and pass along to another API.
  • In the cases where it isn't completely opaque, command characters typically fall into the ASCII range.
  • Because (transitively) codepoint ≠ grapheme ≠ glyph ≠ user-perceived character ≠ user-perceived character (in a different locale), there is no such thing as a constant-time text algorithm.
8 Likes

Honestly I think encoding is beside the point. Rust uses UTF-8. That's fine. What's not so great is the API and the terminology it uses.

Sometimes a string is a UTF-8 byte buffer. Sometimes a string is a Unicode string (in the abstract sense). Which is which depends on the context. And this distinction is often not very clear to new users, despite heroic efforts by the book and documentation.

1 Like

Folks, this thread isn't about the relative merits of utf8 vs utf16.

9 Likes

Naive question: what practical difference is there between the two?

1 Like

I'd say it is both: It is a Unicode string, represented as a byte buffer. Its methods operate on char when the operation is about single elements of the string:

  • push, pop
  • insert, remove, retain

Most other methods operate on byte ranges. I think this is the correct and most efficient thing to do. The only exception is drain, which returns an Iterator<Item = char> but shouldn't IMO.

Would you perhaps elaborate what drain should do instead in your opinion? Or should it not exist at all?

@steffahn I mentioned it in this comment (sorry I was too lazy to link it).

2 Likes

I imagine APIs which provide substring indexing on Unicode strings provide codepoint-based indices, if they're meant to be used across languages and libraries which may not be consistent in their preferred byte encoding. To provide a concrete example, the Twitter API provides codepoint indices when referring to the position of hashtags or URLs in a tweet. In egg-mode, i convert these into byte-based indices when providing the strings back to Rust code. I do this because Rust uses byte-based indexing, and i didn't want users of my library to try to use codepoint indices to slice a string with.

2 Likes

Grapheme cluster indexing comes up all the time when you are trying to present text to the user and determine its length in terms of human-readable "characters". When printing text on a terminal, even Unicode double-width graphemes (e.g. some emojis) need to be handled. I'm currently writing a compiler in Rust, and counting grapheme clusters and their width is absolutely essential for user-friendly error reporting. If the compiler used byte offsets for error reporting, then the user would have to translate those to visible character position in their head, which is obviously and strongly undesirable.

That said, I find dealing with Unicode in Rust painless enough, thanks to crates like unicode_segmentation. I agree with burntsushi that deprecating byte APIs would have a disproportionately high cost, and it would get rid of genuinely useful functionality (since for lower-level operations, byte views are what one actually needs). I don't think we should make this step.

10 Likes

Really, the col:line you report an error at should be "whatever col:line the user's IDE uses" for that location. If you report an error at 177:23, I want to be able to "Go To Line > 177:23" to get to where the error is. I think most IDEs use UTF-16 code unit index for this, as unfortunate as that is.

2 Likes

I tested what I found on my system.

Seems many count codepoints, too. At least vscode does, as do the Rust and Haskell compilers, and Notepad++, and gedit.

Kate and QtCreator and Eclipse counts utf16 units it seems.

4 Likes

The actual column position of each character, computed properly, may be affected by things like: tab expansion, the choice of half-width versus full-width code point variant, text direction (LTR vs RTL, and whether the rendering device supports bidirectional text in the first place), the actual font metrics (how many columns it takes to render, say, U+FDFA) and what ligatures are supported (say, whether that particular emoji ZWJ sequence will render as a single two-column glyph or fall back to rendering as separate emoji). At least the latter two of these are impossible to know without access to the rendering side of things (wcswidth will sometimes do well enough, but it’s hardly ideal).

And I don’t think that counts as ‘indexing’ in the first place anyway – if it’s just for generating error messages and formatting text on the terminal, then all you really do is compute column numbers for output, without actually using them yourself to process text. It’s not like you’re doing .graphemes(true).count() or slicing strings from the j-th to the k-th grapheme cluster; you just iterate over the entire string left to right from the start to the end to appropriately measure it, and it seems like having indexing based on (whichever encoding’s) code units versus individual code points versus grapheme clusters makes a modest difference in this situation (if any at all).

So really, interoperability with languages using other indexing schemes seems to me like the only legitimate use case. (Well, when doing interactive text editing one may also need to traverse strings between grapheme cluster boundaries, but I don’t think that needs to be in a general-purpose crate used directly by applications.)

3 Likes

I disagree with that, because if the IDE gets it wrong, I shouldn't have to get it wrong. But even if I agreed, how am I supposed to know what IDE (and what settings) the user uses and how it interprets source text?

No. Ligatures surely don't affect how many characters the user perceives. Often, the combination ft is rendered as a single ligature, but they are still counted as "two letters". Even if one uses fancy coding fonts (there's one for example that contains >>= as a ligature for Haskellers), the user ought to be able to separate the consituting characters because they need to be able to position the cursor in between the individual components of the ligature.

Wrong. What if I want to generate the popular anchor-style marker to pinpoint the location of the error? The Rust compiler itself generates errors like this:

foo.bar()
    ^   ^
    +---+
move occurs here

In order to render these error messages correctly, one needs to take into account grapheme clusters and grapheme width. I don't care if it's called "indexing" or something else – being able to distinguish between bytes, code units, and grapheme clusters (at least) is absolutely essential.

Note that I'm still not advertising the deprecation of byte-based APIs, and I'm actually perfectly satisfied with the state of Unicode in Rust. I'm just trying to point out that Unicode processing is not a niche or insignificant detail that we should or could start ignoring and/or radically changing overnight.

3 Likes

Good question, and yet another reason why computing column positions is a problem that is practically impossible to solve correctly. But the issue is actually deeper: it’s not that most people get the answer wrong, it’s that there isn’t even a clear ‘right’ answer in the first place. This is the case with many issues concerning ‘user-perceived characters’.

Yes, they do. I mentioned emoji ZWJ sequences, which are implemented as ligatures in fonts. Take a sequence like U+1F469 U+200D U+2764 U+FE0F U+200D U+1F469: a font may contain a special ligature glyph for this sequence or not. On my machine, Firefox renders it as a single glyph, while VTE displays it as three partially overlapping, but otherwise separate two-column glyphs. In Firefox, if I copy-paste this sequence into an editable field and hit backspace, the last two code points are removed and the remainder splits into two separate glyphs, but I cannot put a caret between them even after I do so; in VTE I can always put the caret between each two of the three glyphs. How many ‘user-perceived characters’ are in this sequence? I don’t think there is an obvious answer.

By the way, the glibc wcswidth function returns 5 for this string: apparently it considers U+2764 U+FE0F to occupy a single column, despite actually rendering in two (which is probably why VTE renders it so badly). A great illustration of how you actually need to know your rendering device to measure the width of a piece of text.

(Later edit: if you consider emoji too outlandish an example, take instead Dutch ‘IJ’, which depending on who you ask may be either a digraph, a ligature or an independent letter. How many characters is ‘IJ’? There are a number of reasonable answers to this question, and none of them seems obviously better than the others.)

And to do that, it’s enough to split the line into three parts: prefix, the highlighted part and the suffix, and measure the widths of each in order to generate the appropriate amount of spacing. (Or properly: compute the spans of column indices of highlighted parts, because mere width may fail to handle mixed-directionality text.) The information where each part starts and ends comes from the parser, which usually works with byte or code point indices. You don’t actually need to address individual grapheme clusters. You may need to understand where grapheme cluster boundaries are, but that can be done with byte indices just as easily – which is probably what you started with in the first place.

I think you should care what words mean, because otherwise we might end up talking past each other, as I believe happened just now. I never claimed it is useless to process strings by grapheme cluster; I just don’t see the point of indexing strings by grapheme clusters – i.e. numbering positions in the string by counting grapheme clusters surrounding them and using such positions even in the short term, never mind in long-term storage.

10 Likes

They certainly do. For example, some coding fonts render -> as a single character. These are monospace fonts and that single character has a width of 1 unit. I had trouble in the past because this means text that is properly aligned (by manual indentation with spaces) for some people is misaligned for others.

String length in unicode is a very subtle topic and there often is no clear answer. If you are asking "how long is this string", you are asking the wrong question. You might be asking "how many spaces do I have to print to align this line with some character in the line before" -- that's the question you were getting at with the error message formatting -- and that's one notion of "string length", but certainly not the only one. And to answer this question without knowing the font and software rendering the output is not possible in general, I think.

3 Likes

Yes you do. When you have something like

something "😀😃😄😁" something
                     ^^^^^^^^^

You need to know how many grapheme clusters come before the highlighted span in the same line to insert the correct number of spaces. Furthermore, you need to know if the grapheme clusters are rendered as half-width or full-width characters in a typical terminal font.

I specifically addressed this issue in my previous reply.

I'm not interested in discussing definitions. You don't need to call it string length – call it whatever suits you, the point still is that it's a useful thing to ask, it comes up often, and it's a real problem to solve. And the same applies to the length of byte representations, etc. The argument of "you need to ask the right question" confirms exactly what I'm saying – we shouldn't suddenly cut off one kind of API that is being used all over the place, just because it's not the only possible interpretation.

1 Like

Read my message again.

First of all, ‘typical’ is doing a lot of work here, especially since half-width versus full-width rendering depends on the current locale. Second, if you’re weighing grapheme clusters by their expected column width, you’re not merely counting them or indexing by them. And third, counting grapheme clusters, even weighted, is not enough. Take this program:

#!/usr/bin/env python3
import wcwidth

def highlight(t0, t1, t2=''):
	print(t0 + t1 + t2)
	print(' ' * wcwidth.wcswidth(t0) + '^' * wcwidth.wcswidth(t1))
	print()

# noooo, you can’t just count grapheme clusters to measure the width of text on a particular display device, it will fail to handle bidirectional text correctly, nooooo
highlight("haha, ", "Latin-based assumptions", " go brrrr")
highlight("something \"😀😃😄😁\" ", "something")
highlight("I have read ‘吾輩は猫", "である", "’ recently.")
# keyword arguments added to alleviate direction confusion in the syntax-highlighted source
highlight(t0="The inscription read ‘מנא מנא תקל ", t1="ופרסין", t2="’; I didn’t understand what it meant.")

When I run the above example under a VTE-based terminal, the first three samples display reasonably, but in the last one the wrong fragment is highlighted:

mene-vte3

Under xterm (and uxterm) the Hebrew text is rendered backwards:

mene-xterm

It’s not enough to count grapheme clusters, or even weigh them by expected column width; to underline text correctly you need to know whether your output device can render bidirectional text in logical order, and basically implement Unicode bidirectional algorithms on your own (to convert text into left-to-right order and/or to compute proper column spans). At this point you’re enough removed from grapheme cluster indexes that using those doesn’t really afford you any advantages.

(Amusingly enough, rustc seems to use basically the same wrong algorithm I posted here. Also note that I haven’t even mentioned tabulation in this post, which is also relevant to this problem.)

Well, the caveats we keep pointing out do cast some doubt on the usefulness of certain string APIs, and given that such APIs’ presence would encourage devising half-, ahem, -hearted solutions to common problems, they provide some argument for those APIs to be considered harmful™ and therefore excluded.

6 Likes