UTF-8 BOM Handling

While searching through some issues I noticed that at the moment there is no special treatment for BOM, which causes issues with parser libraries.

I would argue that Rust should strip the BOM sequences at the beginning when creating a string from bytes, as indicated by IETF RFC 3629 §6 RFC 3629 - UTF-8, a transformation format of ISO 10646.

When interpreted as a signature, the Unicode standard suggests than an initial U+FEFF character may be stripped before processing the text.

As its trivial to do, and the main purpose of a &str/String is to be used for string operations.

The main reason against handling BOM sequences is that this effectively changes the data, and when hashed for signatures can result in a different signature. But a string is not a containment for this kind of data at least from my pov.

But it's complicated to decide on, as it depends on the protocol, for example: a file system does not store encoding information, there it should be allowed, on the other hand some network protocols have strict encodings and therefore don't allow BOM, and the BOM must be handled as "ZERO WIDTH NO-BREAK SPACE".

There is also an old issue RFC: std::string::String could provide options about UTF BOM · Issue #2428 · rust-lang/rfcs · GitHub

That same RFC recommends against it:

It is therefore RECOMMENDED to avoid stripping an initial U+FEFF interpreted as a signature without a good reason, to ignore it instead of stripping it when appropriate (such as for display) and to strip it only when really necessary.

6 Likes

But it already got deprecated in 2003, and Unicode 3.2 advises against using it for any other purposes then as BOM.

And yea the RFC is vague about it. As it also describes its usages in protocols, and that protocols which already specify the charset should not strip the BOM.

ZERO WIDTH NO-BREAK SPACE is also not a whitespace?

What issues?

If a parser wants to allow BOM at the beginning of a file, it should specify it explicitly in its language syntax rather than rely on it being silently dropped earlier in the pipeline.

13 Likes

Regardless of the merits or lack thereof of this idea, it would break backwards compatibility to do this stripping automatically. So that would be a non-starter.

You could argue for a strip_left_bom method though if you believe that to be useful (rather than calling one of the existing stripping methods with the appropriate parameter). Not really a common need I would argue (I have never seen a file in utf8 and with bom, I have only seen that in utf16).

9 Likes

Yea agreed stripping is definitely not a solution.

And the JSON spec for example doesn't allow a BOM to be emitted for the sake of interoperability.

I don't think BOM makes sense in UTF8 anyway. Endianness for UTF8 is meaningless. The only reason it would end up in UTF-8 is because some software isn't written properly (which does happen of course).

What @tczajka suggested seems to be the best course forward for parsers that do have to deal with this. That or strip the string using strip_prefix first.

3 Likes

Such a method would be very useful. Some Windows programs unfortunately still create UTF-8 with a BOM by default for example, link

As for why Microsoft cares about saving UTF-8 with a BOM in Notepad? This explains it well; seems to be a specific requirement of Microsoft programming tools and not any other non-Microsoft tool out there:

“Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.”

I'm not too sure, but I think that among the Windows programs with this behavior, Visual Studio also saves files as UTF-8 with BOM by default, link. That would be really unfortunate.

The default “Save As” encoding for UTF-8 includes a BOM Signature - if source files are used for command line interface, the command prompt does not correctly translate the header.

5 Likes

Unicode String and string sequences usually doesn't mandate nor forbid the usage of U+FEFF characters. The (optional) usage of BOMs are part of the higher level protocol above casual Unicode text.

It's fine to skip the bom during the “reading -> decoding -> building”(read_to_string) process if you have the motivation to do this. (Also you might need to also deal with UTF-16 and UTF-32 BOMs and do some encoding conversion). But this functionality is really outside the scope of &str and String types themselves.

2 Likes

I’m not advocating for the proposal in this thread, but I wanted to highlight that, from experience, you need a BOM for Excel to properly parse CSVs as UTF-8. I’m sure there are other similar idiocies idiosyncrasies that force all sorts of software to insert BOMs for reliable interoperability.

3 Likes

I agree that the string data types should not be doing any kind of automatic stripping of the BOM.

Some data points:

A UTF-8 BOM is one of those cursed things that you usually don't plan around, but inevitably, someone will file an issue about it somehow sneaking into their data. It's usually pretty simple to just look for it and strip/ignore it if it's there. But I don't think it's a common enough issue for us to be doing something about it in std automatically. Futzing with the data inside a core primitive like &str is potentially very surprising behavior. It would also inhibit cases where you specifically want to keep the BOM around for whatever reason, maybe because you want to search something that you know is bit-for-bit identical to whatever data the user has.

10 Likes

It would be wrong to strip the BOM in String/str, as that would make them unusable to store arbitrary sequences of Unicode scalar values. What if I want a ZERO WIDTH NO-BREAK SPACE at the start of my string?

1 Like

Then you are a person with very niche requirements who potentially should have to use a different type (like BString).

Note: I don't think it's reasonable to change this behavior now, and I don't think it's reasonable for rust String to do much other than whatever the unicode standard says. But BOM for utf-8 only exists as a thing to strip out when it happens.

Unicode also discourages the use of "ZERO WIDTH NO-BREAK SPACE"

Unicode 3.2 adds a new character, U+2060 "WORD JOINER", with exactly the same semantics and usage as U+FEFF except for the signature function and strongly recommends its exclusive use for expressing word-joining semantics. Eventually, following this recommendation will make it all but certain that any initial U+FEFF is a signature, not an intended "ZERO WIDTH NO-BREAK SPACE".

But yea its also a very unexpected thing to happen (even in File read_to_string), its just for convenience, and that not everyone knows that a String can start with a BOM.

Ignoring all the prior discussion about back-compat and user confusion, which I also agree with; the RFC explicitly states

It is important to understand that the character U+FEFF appearing at any position other than the beginning of a stream MUST be interpreted with the semantics for the zero-width non-breaking space, and MUST NOT be interpreted as a signature.

A String is not a data stream, it's just a fragment of text, so that would be the wrong position to handle it. The correct place to handle BOM stripping would be somewhere around the IO streams themselves, you could have a File wrapper for known-to-be-text files that automatically detects it and skips it when reading the file in.

6 Likes

I could maybe see this being in OpenOptions, but honestly it's more a thing for a text encoding API to handle, eg encoding::UTF8.with_bom(Bom::Strip). I don't think Rust wants to handle encodings in std any time soon?