Consensus check: Asking the Unicode Technical Committee to revert their decision to change the preferred UTF-8 error handling


#1

(If you don’t care about the details of UTF-8 error handling, it’s safe to stop reading.)

In reference to https://hsivonen.fi/broken-utf-8/ , I think it would be appropriate to submit that post to the Unicode Consortium with a cover note asking the Unicode Technical Committee to revert their decision to change the preferred UTF-8 error handling for Unicode 11 and to retract the action item to draft corresponding new text for Unicode 11 for reasons given in the post.

I think it would be preferable to do this via Mozilla’s liaison membership of the Unicode Consortium rather than me doing it as a random member of the public, because submission via Mozilla’s liaison membership allows for visibility into the process and opportunity for follow-up whereas if I do it on my own, it’s basically a matter of dropping a note into a one-way black box. (It seems that this kind of thing is exactly what Mozilla’s liaison membership is for.)

However, submitting via Mozilla’s liaison membership raises the question of whether the submission would properly represent a Mozilla consensus. I estimate this to be noncontroversial, because deliberate effort has been expended to make the Mozilla-affiliated implementations that I am aware of (uconv, encoding_rs and the Rust standard library) behave according to the pre-Unicode 11 version of the guidance either directly by looking at the Unicode Standard or by the way of implementing the WHATWG Encoding Standard, which elevates the pre-Unicode 11 preferred approach into a requirement.

If I have mis-guessed that the above-contemplated submission should be non-controversial from the Mozilla perspective and you believe that the above-contemplated submission should not be made via Mozilla’s liaison membership, please let me know.

(My understanding is that a reversal of the decision is quite possible, but actually making the above-contemplated submission is a process prerequisite for a reversal to take place.)


#2

I read your post with much interest and very much agree with your position. I hope your appeal will be successful.


#3

Your opinion here seems right to me. I suggest making sure @simonsapin agrees, as we often defer to him on Unicode matters.


#4

I fully agree with Henri. A while ago I submitted https://github.com/rust-lang/rust/pull/35947 to make more std APIs align with this Unicode recommendation (some already did).


#5

Great survey, @hsivonen! Thanks for writing it up in extensive detail.

I have some comments/suggestions for submitting a proposal to the UTC.

  • The ICU becoming a Unicode project was actually a move towards making it more open. And a main incentive for it was the CLDR depending on ICU for generating data.

  • There’s sometimes an implicit assumption that ICU does the right thing when it gets to dealing with corner and optional cases. That’s not always the case, of course. In this case, it’s not clear why ICU has made this decision. Maybe there’s a good reason in the application side. I think that reason needs to be looked at and measured.

  • I think there’s room for showing how conformance matters in this case. The numbers are good, too, but a case of “X and Y need to agree, and therefore it’s important that they take the same approach” is also needed to show that a larger number of libraries agreeing is desired here. For example, a client and a server needing to agree on the characters’ count and indices, even if the bytes are malformed.

  • I would separate the Process Issues from the technical issue and post them as separate matters. I think it’s a good idea to raise both issues via the liaison membership. Maybe, in both areas, you can also expand on how the issues affected/affects Mozilla.

I hope you find this helpful.


#6

The explanation is given in the linked post.

See: Explaining the new preference in terms of implementation.

It’s an abstraction leak of ICU implementing decoding with bit shifts.


#7

Thank you for the feedback. The submission was already made, however.

There’s the claim that it makes sense for in-memory processing, but in the light of the Go iterator behavior, I don’t find the argument convincing.


#8

Cool! So it will be in the agenda for the August UTC meeting. Are you planning to call in for the discussion?

The Chrome bug is a good example of why it’s a real on technical/conformance side. I’m wondering if there was a similar reported or potential case in Mozilla/Rust. (Besides the higher-level standards aspect that’s already discussed.)


#9

FYI, another feedback on the subject matter: http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf


#10

I wasn’t aware that was a possibility. Looks like I missed the meeting. However, it appears that the consensus that I requested be retracted got retracted.

That’s the proposal that triggered the change (that now appears to be retracted) in the first place.


#11

Update from Markus Scherer on Unicode internal list:

FYI, I changed the ICU behavior for the upcoming ICU 60 release (pending code review).

Proposal & description: https://sourceforge.net/p/icu/mailman/message/35990833/

Code changes: http://bugs.icu-project.org/trac/review/13311