Wild idea: deprecating APIs that conflate str and [u8]

Grapheme cluster indexing comes up all the time when you are trying to present text to the user and determine its length in terms of human-readable "characters". When printing text on a terminal, even Unicode double-width graphemes (e.g. some emojis) need to be handled. I'm currently writing a compiler in Rust, and counting grapheme clusters and their width is absolutely essential for user-friendly error reporting. If the compiler used byte offsets for error reporting, then the user would have to translate those to visible character position in their head, which is obviously and strongly undesirable.

That said, I find dealing with Unicode in Rust painless enough, thanks to crates like unicode_segmentation. I agree with burntsushi that deprecating byte APIs would have a disproportionately high cost, and it would get rid of genuinely useful functionality (since for lower-level operations, byte views are what one actually needs). I don't think we should make this step.

10 Likes

Really, the col:line you report an error at should be "whatever col:line the user's IDE uses" for that location. If you report an error at 177:23, I want to be able to "Go To Line > 177:23" to get to where the error is. I think most IDEs use UTF-16 code unit index for this, as unfortunate as that is.

2 Likes

I tested what I found on my system.

Seems many count codepoints, too. At least vscode does, as do the Rust and Haskell compilers, and Notepad++, and gedit.

Kate and QtCreator and Eclipse counts utf16 units it seems.

4 Likes

The actual column position of each character, computed properly, may be affected by things like: tab expansion, the choice of half-width versus full-width code point variant, text direction (LTR vs RTL, and whether the rendering device supports bidirectional text in the first place), the actual font metrics (how many columns it takes to render, say, U+FDFA) and what ligatures are supported (say, whether that particular emoji ZWJ sequence will render as a single two-column glyph or fall back to rendering as separate emoji). At least the latter two of these are impossible to know without access to the rendering side of things (wcswidth will sometimes do well enough, but it’s hardly ideal).

And I don’t think that counts as ‘indexing’ in the first place anyway – if it’s just for generating error messages and formatting text on the terminal, then all you really do is compute column numbers for output, without actually using them yourself to process text. It’s not like you’re doing .graphemes(true).count() or slicing strings from the j-th to the k-th grapheme cluster; you just iterate over the entire string left to right from the start to the end to appropriately measure it, and it seems like having indexing based on (whichever encoding’s) code units versus individual code points versus grapheme clusters makes a modest difference in this situation (if any at all).

So really, interoperability with languages using other indexing schemes seems to me like the only legitimate use case. (Well, when doing interactive text editing one may also need to traverse strings between grapheme cluster boundaries, but I don’t think that needs to be in a general-purpose crate used directly by applications.)

3 Likes

I disagree with that, because if the IDE gets it wrong, I shouldn't have to get it wrong. But even if I agreed, how am I supposed to know what IDE (and what settings) the user uses and how it interprets source text?

No. Ligatures surely don't affect how many characters the user perceives. Often, the combination ft is rendered as a single ligature, but they are still counted as "two letters". Even if one uses fancy coding fonts (there's one for example that contains >>= as a ligature for Haskellers), the user ought to be able to separate the consituting characters because they need to be able to position the cursor in between the individual components of the ligature.

Wrong. What if I want to generate the popular anchor-style marker to pinpoint the location of the error? The Rust compiler itself generates errors like this:

foo.bar()
    ^   ^
    +---+
move occurs here

In order to render these error messages correctly, one needs to take into account grapheme clusters and grapheme width. I don't care if it's called "indexing" or something else – being able to distinguish between bytes, code units, and grapheme clusters (at least) is absolutely essential.

Note that I'm still not advertising the deprecation of byte-based APIs, and I'm actually perfectly satisfied with the state of Unicode in Rust. I'm just trying to point out that Unicode processing is not a niche or insignificant detail that we should or could start ignoring and/or radically changing overnight.

3 Likes

Good question, and yet another reason why computing column positions is a problem that is practically impossible to solve correctly. But the issue is actually deeper: it’s not that most people get the answer wrong, it’s that there isn’t even a clear ‘right’ answer in the first place. This is the case with many issues concerning ‘user-perceived characters’.

Yes, they do. I mentioned emoji ZWJ sequences, which are implemented as ligatures in fonts. Take a sequence like U+1F469 U+200D U+2764 U+FE0F U+200D U+1F469: a font may contain a special ligature glyph for this sequence or not. On my machine, Firefox renders it as a single glyph, while VTE displays it as three partially overlapping, but otherwise separate two-column glyphs. In Firefox, if I copy-paste this sequence into an editable field and hit backspace, the last two code points are removed and the remainder splits into two separate glyphs, but I cannot put a caret between them even after I do so; in VTE I can always put the caret between each two of the three glyphs. How many ‘user-perceived characters’ are in this sequence? I don’t think there is an obvious answer.

By the way, the glibc wcswidth function returns 5 for this string: apparently it considers U+2764 U+FE0F to occupy a single column, despite actually rendering in two (which is probably why VTE renders it so badly). A great illustration of how you actually need to know your rendering device to measure the width of a piece of text.

(Later edit: if you consider emoji too outlandish an example, take instead Dutch ‘IJ’, which depending on who you ask may be either a digraph, a ligature or an independent letter. How many characters is ‘IJ’? There are a number of reasonable answers to this question, and none of them seems obviously better than the others.)

And to do that, it’s enough to split the line into three parts: prefix, the highlighted part and the suffix, and measure the widths of each in order to generate the appropriate amount of spacing. (Or properly: compute the spans of column indices of highlighted parts, because mere width may fail to handle mixed-directionality text.) The information where each part starts and ends comes from the parser, which usually works with byte or code point indices. You don’t actually need to address individual grapheme clusters. You may need to understand where grapheme cluster boundaries are, but that can be done with byte indices just as easily – which is probably what you started with in the first place.

I think you should care what words mean, because otherwise we might end up talking past each other, as I believe happened just now. I never claimed it is useless to process strings by grapheme cluster; I just don’t see the point of indexing strings by grapheme clusters – i.e. numbering positions in the string by counting grapheme clusters surrounding them and using such positions even in the short term, never mind in long-term storage.

10 Likes

They certainly do. For example, some coding fonts render -> as a single character. These are monospace fonts and that single character has a width of 1 unit. I had trouble in the past because this means text that is properly aligned (by manual indentation with spaces) for some people is misaligned for others.

String length in unicode is a very subtle topic and there often is no clear answer. If you are asking "how long is this string", you are asking the wrong question. You might be asking "how many spaces do I have to print to align this line with some character in the line before" -- that's the question you were getting at with the error message formatting -- and that's one notion of "string length", but certainly not the only one. And to answer this question without knowing the font and software rendering the output is not possible in general, I think.

3 Likes

Yes you do. When you have something like

something "😀😃😄😁" something
                     ^^^^^^^^^

You need to know how many grapheme clusters come before the highlighted span in the same line to insert the correct number of spaces. Furthermore, you need to know if the grapheme clusters are rendered as half-width or full-width characters in a typical terminal font.

I specifically addressed this issue in my previous reply.

I'm not interested in discussing definitions. You don't need to call it string length – call it whatever suits you, the point still is that it's a useful thing to ask, it comes up often, and it's a real problem to solve. And the same applies to the length of byte representations, etc. The argument of "you need to ask the right question" confirms exactly what I'm saying – we shouldn't suddenly cut off one kind of API that is being used all over the place, just because it's not the only possible interpretation.

1 Like

Read my message again.

First of all, ‘typical’ is doing a lot of work here, especially since half-width versus full-width rendering depends on the current locale. Second, if you’re weighing grapheme clusters by their expected column width, you’re not merely counting them or indexing by them. And third, counting grapheme clusters, even weighted, is not enough. Take this program:

#!/usr/bin/env python3
import wcwidth

def highlight(t0, t1, t2=''):
	print(t0 + t1 + t2)
	print(' ' * wcwidth.wcswidth(t0) + '^' * wcwidth.wcswidth(t1))
	print()

# noooo, you can’t just count grapheme clusters to measure the width of text on a particular display device, it will fail to handle bidirectional text correctly, nooooo
highlight("haha, ", "Latin-based assumptions", " go brrrr")
highlight("something \"😀😃😄😁\" ", "something")
highlight("I have read ‘吾輩は猫", "である", "’ recently.")
# keyword arguments added to alleviate direction confusion in the syntax-highlighted source
highlight(t0="The inscription read ‘מנא מנא תקל ", t1="ופרסין", t2="’; I didn’t understand what it meant.")

When I run the above example under a VTE-based terminal, the first three samples display reasonably, but in the last one the wrong fragment is highlighted:

mene-vte3

Under xterm (and uxterm) the Hebrew text is rendered backwards:

mene-xterm

It’s not enough to count grapheme clusters, or even weigh them by expected column width; to underline text correctly you need to know whether your output device can render bidirectional text in logical order, and basically implement Unicode bidirectional algorithms on your own (to convert text into left-to-right order and/or to compute proper column spans). At this point you’re enough removed from grapheme cluster indexes that using those doesn’t really afford you any advantages.

(Amusingly enough, rustc seems to use basically the same wrong algorithm I posted here. Also note that I haven’t even mentioned tabulation in this post, which is also relevant to this problem.)

Well, the caveats we keep pointing out do cast some doubt on the usefulness of certain string APIs, and given that such APIs’ presence would encourage devising half-, ahem, -hearted solutions to common problems, they provide some argument for those APIs to be considered harmful™ and therefore excluded.

6 Likes

On my screen the ^^^^^^^^^ are slightly right-shifted with respect to the something immediately above them. The obvious cause is that the widths of some glyphs on the first line, as rendered on my screen, are not an exact integer multiple of those on the second line, even though all the graphemes are being rendered in a nominally-monospaced font.

Until the late 1970s, computer printers were almost always line printers, producing only fixed-width upper-case English alphanumerics and punctuation. (We later sometimes punningly referred to this as "half-ASCII".) By 1980 low-speed computer output had progressed to daisy wheel printers that were capable of proportional-font output of upper- and lower-case characters. Readable full-ASCII output was available!

However, for visual alignment in their code, programmers mostly stayed with fixed-width fonts. The example quoted at the start of this post shows that those rules are changing even for code, when the code uses glyphs outside the basic ASCII character set. As pointed out by others, the problem of maintaining a monotonic-spaced character width becomes much more complicated when there are ligatures in Roman fonts, or non-Roman fonts such as Arabic that are typically rendered as connected cursive script.

IMO, much of this thread has been about counting graphemes as an approximant to computing the display width of rendered glyphs. If we posit that multi-lingual programming fonts are following the same evolutionary path as other non-programming computer output, then it will no longer be possible to preciely position pseudo-underlining to text by "counting characters" on a different line added under the text. The above example shows, on my screen, that this is already the case. The programming community needs to find a different approach that anchors added emphasis to the actual text to be emphasized, so that the emphasis renders properly aligned no matter the font or type of output device.

Edit: Corrected "monotonic" to "mono-spaced"

3 Likes

I wouldn't consider getting the buffer size of a string a half-assed solution, meanwhile the OP argues for its removal. This operation is very common because any relatively lower-level string processing such as serialization requires it. It's quite an extreme viewpoint that "it is possible to abuse this API therefore it must be removed no matter how high the ecosystem cost is".

No, the OP argues to make it "more explicit". Many have pointed out that the OPs way of doing that (using as_bytes indirection instead) is far from ideal for the reasons stated. But removal of len was a means to an end, not the end goal itself.

2 Likes

I'd like to pull this back to the original topic. It appears to be (nearly?) universally agreed upon that deprecating these methods is far too much churn and not worth it. If that is the case, should this topic be closed?

9 Likes

At first I didn't understood exactly what was being analyzed in the topic, but after take a look more or less I catch it. If it is not a distraction, and if do not mind, I would like to write this post for who could have interest in the matter, mostly for other visitors like me.

.

A code point is an abstracted instruction of values and/or behaviors, whom are determined by a standard of codes.

It means the code points are not characters, and treating them as such is an huge error that other languages has done trying to support Unicode.

In the Unicode standard, the code point is represented with the hexadecimal format U+HHHHHH. The first two positions determine a classification range called plane (with have 17 possible values, from 0-16)

The code unit is the unit of storage. In Unicode can be done in 8bits (UTF-8), 16bits (UTF-16 BE LE) or 32bits (UTF-32 BE LE). A code unit may represent a full code point, or an incomplete part of it.

The crab glyph [ :crab: ] in Unicode have one code point: U+1F980

with UTF-8     is stored in 4 code units F0 9F A6 80
with UTF-16 BE is stored in 2 code units D8 3E DD 80
with UTF-32 BE is stored in 1 code unit  00 01 F9 80

The hot coffee glyph [ :coffee: ] in Unicode have one code point: U+2615

with UTF-8     is stored in 3 code units    E2 98 95
with UTF-16 BE is stored in 1 code unit        26 15
with UTF-32 BE is stored in 1 code unit  00 00 26 15

.

A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system.

The glyph [ å ] in Unicode is a grapheme with one code point: U+00E5

with UTF-8     is stored in 2 code units       C3 A5
with UTF-16 BE is stored in 1 code unit        00 E5
with UTF-32 BE is stored in 1 code unit  00 00 00 E5

The glyph [ å ] in Unicode can be also a grapheme with two code points: U+0061 U+030A

[ a ] U+0061
UTF-8 		         61 	
UTF-16BE 	      00 61
UTF-32BE 	00 00 00 61

[ ̊  ] U+030A  'Combining Ring Above'
UTF-8 		      CC 8A
UTF-16BE 	      03 0A
UTF-32BE 	00 00 03 0A

with UTF-8     is stored in 3 code units                61 CC 8A
with UTF-16 BE is stored in 2 code units             00 61 03 0A
with UTF-32 BE is stored in 2 code units 00 00 00 61 00 00 03 0A

So, one Unicode grapheme -besides can be composed of more than one code point- can also employ more than one code unit, even if UTF-32 is used

The glyph [ ậ ] code point  U+1EAD
The glyph [ ậ ] code points U+0061 + U+0302 + U+0323  
The glyph [ ậ ] code points U+0061 + U+0323 + U+0302 

UTF-8:        E1 BA AD (3 bytes); UTF-32:                   00001EAD ( 4 bytes)
UTF-8:  61 CC 82 CC A3 (5 bytes); UTF-32: 00000061 00000302 00000323 (12 bytes)
UTF-8:  61 CC A3 CC 82 (5 bytes); UTF-32: 00000061 00000323 00000302 (12 bytes)

[ a ] U+0061
[ ̂  ] U+0302  'Combining Circumflex Accent'
[ ̣  ] U+0323  'Combining Dot Below'

.

Also exists code points that are never part of any grapheme, defining only behaviors, such as No-break space U+00A0, Soft hyphen U+00AD, Zero width space U+200B, Zero width joiner U+200D, Word joiner U+2060, Left-To-Right embedding U+202A, Right-To-Left embedding U+202B, Left-To-Right override U+202D, Right-To-Left override U+202E, etc

.

A glyph is an image, representing graphemes or parts of it, whom shape is determined by the selected font type and how are interpreted the code points.

.

assert!( 'Café' == 'Café' ); // homoglyph example, what would be expected?

'C' + 'a' + 'f' + 'e' + '́ ' << Unicode UTF-8 = 43 61 66 65 CC 81
'C' + 'a' + 'f' + 'é'       << Unicode UTF-8 = 43 61 66 C3 A9
6 Likes

Amendment: A glyph is a font shape representing one or more graphemes. A common case of glyphs representing several graphemes are ligatures: a special rendering of a claster of several letter, like common ligature fi. In this ligature a slightly elongated ' f ' can have its 'tail' go over the ' i ' The reason for existence of ligatures is that letters are not rectangles, so some overlap for their bonding rectangles is acceptable and sometimes even desired as long as the can be clearly recognized. Slight overlaps allows for more letters per line, increasing reading speed and preserving paper. Sometimes slight tweaks of letter shape allows for higher overlap, hence the ligatures.

Ligatures are rarely a concern except for text layout/rendering engines for proportional fonts, but their use can improve presentation of text considerably and it is the only place where text length in anything other then bytes matters.

Why do you say that? The example earlier in this thread of a mis-aligned (slightly offset to the right) pseudo-underscore demonstrates that the display width of rendered text does matter. Looking at that example, it appears that the underlying placement algorithm for the pseudo-underscore is measuring characters—perhaps at half-, full-, and double-width—and using that as an approximant for actualy glyph width.

I consider the example you point at as a subcase of text rendering, but yeah, seems like I was inaccurate in wording.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.