To be fair, most other languages also don't return the count of code points/grapheme clusters/glyphs on their string length getter. Instead, they just returns the length of u16 slice as they store strings in UTF-16 encoding underneath. Should we adopt this behavior?
Python 3 uses UTF-32, so the length always matches the number of code points.
Should we adopt this behavior?
Returning the length in UTF-16 words makes no sense for Rust, since it is
- Undesired in 99,99% of cases
- Expensive to compute for UTF-8 strings
- Changing the representation of
Stringto UTF-16 is not possible
- Besides, why would we want to do that? UTF-16 is the worst of all worlds. It is not very memory efficient, it isn't compatible with ASCII, and characters aren't fixed width either
- Changing the representation of
It can be more memory efficient than UTF-8 when dealing with e.g. text in east Asian languages, where the majority of characters take 3 bytes in UTF-8 but only 2 bytes in UTF-16.
Depending on what you consider a “character”, they aren’t in UTF-32 either. Sure, codepoints are, but when handling Unicode properly, you quickly have to deal with combining characters / diacritical marks, too.
If you construct an artificial example, yes, but:
- most interchange text has markup as well, which typically falls in the ASCII range, and
- if you're at the scale where this is actually meaningfully noticable, you should probably be using a streaming general-purpose compression algorithm which will definitely do better (and roughly equivalently well on UTF-8 and UTF-16).
If anyone here hasn't already, I'd suggest reading the UTF-8 everywhere manifesto, which makes a great argument for why UTF-8 should be the encoding of text without specific (typically, legacy) design constraints.
The Chinese translation of this manifesto takes 58.8 KiB in UTF-16, and only 51.7 KiB in UTF-8.
- In most cases, text is treated as a mostly-opaque blob that you read from one location and pass along to another API.
- In the cases where it isn't completely opaque, command characters typically fall into the ASCII range.
- Because (transitively) codepoint ≠ grapheme ≠ glyph ≠ user-perceived character ≠ user-perceived character (in a different locale), there is no such thing as a constant-time text algorithm.
Honestly I think encoding is beside the point. Rust uses UTF-8. That's fine. What's not so great is the API and the terminology it uses.
Sometimes a string is a UTF-8 byte buffer. Sometimes a string is a Unicode string (in the abstract sense). Which is which depends on the context. And this distinction is often not very clear to new users, despite heroic efforts by the book and documentation.
Folks, this thread isn't about the relative merits of utf8 vs utf16.
Naive question: what practical difference is there between the two?
I'd say it is both: It is a Unicode string, represented as a byte buffer. Its methods operate on
char when the operation is about single elements of the string:
- push, pop
- insert, remove, retain
Most other methods operate on byte ranges. I think this is the correct and most efficient thing to do. The only exception is
drain, which returns an
Iterator<Item = char> but shouldn't IMO.
Would you perhaps elaborate what
drain should do instead in your opinion? Or should it not exist at all?
I imagine APIs which provide substring indexing on Unicode strings provide codepoint-based indices, if they're meant to be used across languages and libraries which may not be consistent in their preferred byte encoding. To provide a concrete example, the Twitter API provides codepoint indices when referring to the position of hashtags or URLs in a tweet. In egg-mode, i convert these into byte-based indices when providing the strings back to Rust code. I do this because Rust uses byte-based indexing, and i didn't want users of my library to try to use codepoint indices to slice a string with.
Grapheme cluster indexing comes up all the time when you are trying to present text to the user and determine its length in terms of human-readable "characters". When printing text on a terminal, even Unicode double-width graphemes (e.g. some emojis) need to be handled. I'm currently writing a compiler in Rust, and counting grapheme clusters and their width is absolutely essential for user-friendly error reporting. If the compiler used byte offsets for error reporting, then the user would have to translate those to visible character position in their head, which is obviously and strongly undesirable.
That said, I find dealing with Unicode in Rust painless enough, thanks to crates like
unicode_segmentation. I agree with burntsushi that deprecating byte APIs would have a disproportionately high cost, and it would get rid of genuinely useful functionality (since for lower-level operations, byte views are what one actually needs). I don't think we should make this step.
col:line you report an error at should be "whatever
col:line the user's IDE uses" for that location. If you report an error at
177:23, I want to be able to "Go To Line >
177:23" to get to where the error is. I think most IDEs use UTF-16 code unit index for this, as unfortunate as that is.
I tested what I found on my system.
Seems many count codepoints, too. At least vscode does, as do the Rust and Haskell compilers, and Notepad++, and gedit.
Kate and QtCreator and Eclipse counts utf16 units it seems.
The actual column position of each character, computed properly, may be affected by things like: tab expansion, the choice of half-width versus full-width code point variant, text direction (LTR vs RTL, and whether the rendering device supports bidirectional text in the first place), the actual font metrics (how many columns it takes to render, say, U+FDFA) and what ligatures are supported (say, whether that particular emoji ZWJ sequence will render as a single two-column glyph or fall back to rendering as separate emoji). At least the latter two of these are impossible to know without access to the rendering side of things (
wcswidth will sometimes do well enough, but it’s hardly ideal).
And I don’t think that counts as ‘indexing’ in the first place anyway – if it’s just for generating error messages and formatting text on the terminal, then all you really do is compute column numbers for output, without actually using them yourself to process text. It’s not like you’re doing
.graphemes(true).count() or slicing strings from the j-th to the k-th grapheme cluster; you just iterate over the entire string
left to right from the start to the end to appropriately measure it, and it seems like having indexing based on (whichever encoding’s) code units versus individual code points versus grapheme clusters makes a modest difference in this situation (if any at all).
So really, interoperability with languages using other indexing schemes seems to me like the only legitimate use case. (Well, when doing interactive text editing one may also need to traverse strings between grapheme cluster boundaries, but I don’t think that needs to be in a general-purpose crate used directly by applications.)
I disagree with that, because if the IDE gets it wrong, I shouldn't have to get it wrong. But even if I agreed, how am I supposed to know what IDE (and what settings) the user uses and how it interprets source text?
No. Ligatures surely don't affect how many characters the user perceives. Often, the combination
ft is rendered as a single ligature, but they are still counted as "two letters". Even if one uses fancy coding fonts (there's one for example that contains
>>= as a ligature for Haskellers), the user ought to be able to separate the consituting characters because they need to be able to position the cursor in between the individual components of the ligature.
Wrong. What if I want to generate the popular anchor-style marker to pinpoint the location of the error? The Rust compiler itself generates errors like this:
foo.bar() ^ ^ +---+ move occurs here
In order to render these error messages correctly, one needs to take into account grapheme clusters and grapheme width. I don't care if it's called "indexing" or something else – being able to distinguish between bytes, code units, and grapheme clusters (at least) is absolutely essential.
Note that I'm still not advertising the deprecation of byte-based APIs, and I'm actually perfectly satisfied with the state of Unicode in Rust. I'm just trying to point out that Unicode processing is not a niche or insignificant detail that we should or could start ignoring and/or radically changing overnight.
Good question, and yet another reason why computing column positions is a problem that is practically impossible to solve correctly. But the issue is actually deeper: it’s not that most people get the answer wrong, it’s that there isn’t even a clear ‘right’ answer in the first place. This is the case with many issues concerning ‘user-perceived characters’.
Yes, they do. I mentioned emoji ZWJ sequences, which are implemented as ligatures in fonts. Take a sequence like U+1F469 U+200D U+2764 U+FE0F U+200D U+1F469: a font may contain a special ligature glyph for this sequence or not. On my machine, Firefox renders it as a single glyph, while VTE displays it as three partially overlapping, but otherwise separate two-column glyphs. In Firefox, if I copy-paste this sequence into an editable field and hit backspace, the last two code points are removed and the remainder splits into two separate glyphs, but I cannot put a caret between them even after I do so; in VTE I can always put the caret between each two of the three glyphs. How many ‘user-perceived characters’ are in this sequence? I don’t think there is an obvious answer.
By the way, the glibc
wcswidth function returns 5 for this string: apparently it considers U+2764 U+FE0F to occupy a single column, despite actually rendering in two (which is probably why VTE renders it so badly). A great illustration of how you actually need to know your rendering device to measure the width of a piece of text.
(Later edit: if you consider emoji too outlandish an example, take instead Dutch ‘Ĳ’, which depending on who you ask may be either a digraph, a ligature or an independent letter. How many characters is ‘Ĳ’? There are a number of reasonable answers to this question, and none of them seems obviously better than the others.)
And to do that, it’s enough to split the line into three parts: prefix, the highlighted part and the suffix, and measure the widths of each in order to generate the appropriate amount of spacing. (Or properly: compute the spans of column indices of highlighted parts, because mere width may fail to handle mixed-directionality text.) The information where each part starts and ends comes from the parser, which usually works with byte or code point indices. You don’t actually need to address individual grapheme clusters. You may need to understand where grapheme cluster boundaries are, but that can be done with byte indices just as easily – which is probably what you started with in the first place.
I think you should care what words mean, because otherwise we might end up talking past each other, as I believe happened just now. I never claimed it is useless to process strings by grapheme cluster; I just don’t see the point of indexing strings by grapheme clusters – i.e. numbering positions in the string by counting grapheme clusters surrounding them and using such positions even in the short term, never mind in long-term storage.
They certainly do. For example, some coding fonts render
-> as a single character. These are monospace fonts and that single character has a width of 1 unit. I had trouble in the past because this means text that is properly aligned (by manual indentation with spaces) for some people is misaligned for others.
String length in unicode is a very subtle topic and there often is no clear answer. If you are asking "how long is this string", you are asking the wrong question. You might be asking "how many spaces do I have to print to align this line with some character in the line before" -- that's the question you were getting at with the error message formatting -- and that's one notion of "string length", but certainly not the only one. And to answer this question without knowing the font and software rendering the output is not possible in general, I think.
Yes you do. When you have something like
something "😀😃😄😁" something ^^^^^^^^^
You need to know how many grapheme clusters come before the highlighted span in the same line to insert the correct number of spaces. Furthermore, you need to know if the grapheme clusters are rendered as half-width or full-width characters in a typical terminal font.