Proposed lint: `rustdoc::unprefixed_html_id`

  • Feature Name: check_unprefixed_html_id and check_unprefixed_html_class

Summary

Have rustdoc produce a warning when inline HTML has a class= or id= attribute that doesn't match the pattern CRATENAME_name, and isn't part of a list of exceptions that go through FCP to stabilize as an actual API.

Motivation

In an HTML page, id and class both form global namespaces [1]. Rustdoc already uses this namespace, which means that doc authors may inadvertently write code that conflicts with newer versions of rustdoc, which has sometimes introduced new classes and IDs when adding or refactoring features.

This can also cause problems with various forms of inlining. For example, impl blocks will inline the documentation of the trait, including the HTML, which could result in ID conflicts between two different crate authors, even if neither of them conflict with rustdoc itself. Inlining was also one of the original motivating examples for intra-doc links, because rustdoc needs to compute different relative URLs when inlining into different pages.

Guide-level explanation

HTML section on how to write documentation

Inline HTML

As a standard Commonmark parser with no special restrictions, rustdoc allows you to write HTML whenever the regular markup isn't sufficient, such as advanced table layouts:

<table>
    <thead>
        <tr>
            <th>Name</th>
            <th>Description</th>
            <th>Category</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>traits</td>
            <td>static and dynamic dispatch</td>
            <td>generics</td>
        </tr>
        <tr>
            <td colspan="3">enums</td>
        </tr>
        <tr>
            <td rowspan="2">doc comments</td>
            <td colspan="2">reference documentation</td>
        </tr>
        <tr>
            <td>markdown comments</td>
            <td>attributes with custom syntax</td>
        </tr>
    </tbody>
</table>

This will render the same way Markdown tables do, even though Markdown tables don't support colspan or rowspan:

This works fine in rustdoc, GitHub, and mdBook, but Discourse doesn't allow rowspan or colspan for some reason.

Name Description Category
traits static and dynamic dispatch generics
enums
doc comments reference documentation
markdown comments attributes with custom syntax

HTML is not sanitized when included in the documentation, though improperly-nested tags will produce a build-time warning. It can be turned into an error by adding #![deny(rustdoc::invalid_html_tags)] to your crate.

warning: unclosed HTML tag `h2`
  --> $DIR/invalid-html-tags.rs:19:7
   |
LL | ///   <h2>
   |       ^^^^

warning: unclosed quoted HTML attribute on tag `p`
  --> $DIR/invalid-html-self-closing-tag.rs:19:14
   |
LL | /// <p style="x/></p>
   |              ^

Emphasis added for this proposal.

Additionally, IDs and classes should be prefixed with your crate's name, followed by an underscore. Since rustdoc sometimes includes excerpts of the documentation of your dependencies in your crate's documentation, this ensures you don't conflict with them. It can be turned into an error by adding #![deny(rustdoc::unprefixed_html_class)] and #![deny(rustdoc::unprefixed_html_id)] (this lint is currently a nightly-only feature).

warning: unprefixed HTML class `entry`
 --> lib.rs:1:17
   |
 1 | /// <div class="entry">Hello there! This DIV will warn!</div>
   |                 ^ add prefix: `mycrate_`

warning: unprefixed HTML class `entry`
 --> lib.rs:2:14
   |
 2 | /// <div id="entry">Hello there! This DIV will warn!</div>
   |              ^ add prefix: `mycrate_`

warning: 2 warnings emitted

We also recommend the following additional restrictions:

  • Start all doc comments with a one-sentence summary that doesn't use inline HTML. This summary will be used in contexts where arbitrary HTML cannot, such as tooltips.
  • Though JavaScript is allowed, many viewers won't run it. Ensure your docs are readable without JavaScript.
  • Do not embed CSS or JavaScript in doc comments to customize rustdoc's UI. If you want to publish documentation with a customized UI, invoke rustdoc with the --html-in-header command-line parameter to generate it with your custom stylesheet or script, then publish the result as pre-built HTML.

Reference-level explanation

New rustdoc-specific lints

unprefixed_html_class

This lint is warn by default. It detects inline HTML classes that do not start with the crate name, followed by an underscore _, and that haven't been stabilized as CSS classes available to doc authors.

For example, if your crate is named mycrate, then the first DIV will produce a warning, but the second DIV will not:

#![warn(rustdoc::unprefixed_html_class)]

/// <div class="entry">Hello there! This DIV will warn!</div>
///
/// <div class="mycrate_entry">This code is okay.</div>
warning: unprefixed HTML class `entry`
 --> lib.rs:1:17
   |
 1 | /// <div class="entry">Hello there! This DIV will warn!</div>
   |                 ^^^^^

warning: 1 warning emitted

The following classes are considered stable features that doc authors can use, and are exempt from this warning:

Name Recommended use Example
stab Mark up information about stability, platform support, and deprecation. <div class="stab portability">Linux only</div>
portability Use with stab for portability notes.
deprecated Use with stab for deprecation notes. <div class="stab deprecated">Use <code>foobar</code> instead.</div>

unprefixed_html_id

This lint is warn by default. It detects inline HTML IDs that do not start with the crate name, followed by an underscore _.

For example, if your crate is named mycrate, then the first DIV will produce a warning, but the second DIV will not:

#![warn(rustdoc::unprefixed_html_id)]

/// <div id="entry">Hello there! This DIV will warn!</div>
///
/// <div id="mycrate_entry">This code is okay.</div>
warning: unprefixed HTML class `entry`
 --> lib.rs:1:14
   |
 1 | /// <div id="entry">Hello there! This DIV will warn!</div>
   |              ^^^^^

warning: 1 warning emitted

This lint only checks places where IDs are defined, such as the id= attribute and the deprecated name= attribute on <a> tags. Attributes where IDs are used, such as <a href=> and aria-labeledby, are not checked.

Drawbacks

This is going to require extending rustdoc's HTML parser in rust/html_tags.rs at master · rust-lang/rust · GitHub. That ad-hoc parser is pretty complex, and given some of the features it already has (like sniffing for Rust paths) and its relationship to the Markdown specification (which has slightly stricter restrictions on what counts as HTML compared to HTML5 itself) an off-the-shelf parser won't be a complete solution in any case.

The bigger question is how much rustdoc should actually support inline HTML at all. The purpose of the invalid_html_tags lint was partially to help people out who never intended to use inline HTML. This lint is strictly intended for people who wrote something that looks like an ID or CLASS attribute, which means they were almost certainly doing it on purpose.

Rationale and alternatives

Why not rewrite the HTML id= attribute?

Rustdoc avoids generating conflicting IDs for headers and other sections by keeping track of all IDs on a page and adding suffixes. Further stability is achieved with a name-mangling-like scheme of adding extra information to the generated ID to avoid producing conflicts at all.

While it wouldn't actually be that hard to rewrite the id= attribute itself, making it work would also require rustdoc to rewrite all the places where IDs are referenced. I don't actually know all the places they can be used in an HTML document, but I know there's a lot:

  • CSS and JS
  • SVG <use>
  • aria-labeledby= and <label for=>
  • <button form=>
  • Anchor references, like https://doc.rust-lang.org/stable/std/index.html#keywords

Since anchor references can be outside the page, we would also want to try to keep the generated IDs as stable as possible, probably by adding more information rather than just adding numbers to the end. This guarantees that all the places where the ID is referenced need to be rewritten, though.

Also, this approach only works for IDs, not classes.

Why not have rustdoc prefix all its classes and IDs, and let the doc author do whatever they want?

This only solves the problem for conflicts between doc authors and rustdoc. It does not help with conflicts between two different doc authors getting inlined into a single page.

The other, bigger problem is that it breaks most existing anchor links. For example, https://doc.rust-lang.org/stable/std/index.html#keywords works today, and it's a rustdoc-generated ID. If all rustdoc IDs were prefixed, this link would break.

The third downside is that rustdoc classes and IDs are very common, so adding a prefix to all of them would bloat the page heavily. Most doc authors don't use inline HTML, and would not appreciate large amounts of overhead for a feature they don't use.

Why not shadow dom?

There are two main reasons why Shadow DOM is not a realistic solution here:

  • It requires JavaScript. While rustdoc does use JavaScript, it should be readable without it. Conflicting classes and IDs can result in unreadable docs.
  • It doesn't help with scrolling to a fragment.

Why CRATENAME_ and not CRATENAME-?

Rustdoc-builtin IDs and classes don't include underscores. They always use hyphens for separators, so if user-defined ones always contain an underscore, they can't conflict.

Rustdoc generates two other kinds of IDs:

Prior art

This is mostly formalizing a convention that's already being used to avoid conflicts. It's why hljs classes always start with hljs-, and fontawesome classes start with fa-.

The design choice that ensures user-defined IDs and classes in raw HTML always match the regex /^[[:ident:]]_/, while rustdoc never generates IDs that match this pattern, is a bit weird and subtle, but it's similar to the rule that Custom HTML Elements always contain a hyphen, while builtins never do.

Unresolved questions

This won't help with people including things like Mermaid JS and KaTeX. If people are going to use them, we'll need to just make sure we don't conflict with the classes generated by those JavaScript add-ons. Luckily, both of them are designed to be included into existing pages with minimal conflicts, but what sort of recommendation should we provide to anyone trying to make reusable JS files for rustdoc?

Rustdoc has poor support for two different crates in the same dependency tree with the same name. This can happen, though, for several reasons:

  • Two different major versions of the same crate might be transitive dependencies. This could easily be fixed by requiring the major version number to be part of the prefix, but in practice that would make bumping the major version of a crate annoying if the author made heavy usage of custom classes or IDs.
  • A Cargo package can rename its crate using lib.name (forked crates do this a lot). If a crate transitively depends on both the fork and the original, it can result in a conflict. Forcing someone to rename all the classes and IDs when forking a crate would also be annoying (and rustdoc currently doesn't know the Cargo package name anyway).
  • Two different crates within different cratespaces can have the same name. Whatever solution the Rust language itself arrives at for this will probably inform rustdoc's solution.
  • It can also have a binary and a library in the same workspace with the same name. That one probably isn't too difficult to cope with. Two crates in the same workspace might as well be one crate, since the problem can be fixed as long as it can be detected without having to worry about forking, [patch]ing, or otherwise removing dependencies that break the docs. Unfortunately, while it would be relatively easy to detect ID conflicts by scanning the document and warning on any duplicates, classes are supposed to allow duplicates, so there's no easy way to detect this.

Currently, running cargo doc on a crate that transitively depends on two crates with the same name will cause whichever one gets documented last to "win" and clobber the other. Coming up with a solution to that problem will inform the solution to the ID/class prefix problem as well.

Future possibilities

Currently, rustdoc checks for imbalanced HTML, unclosed attribute quotes, and, if this lint is written and merged, IDs and classes that are likely to create conflicts.

Should rustdoc check for HTML ID conflicts (presumably to warn about them)?

Should rustdoc warn on deprecated HTML, like <keygen> (which doesn't work in Firefox or Chrome) and <plaintext> (which is basically just going to break the page)?

Rustdoc has poor support for a lot of things people might want to do in HTML, like images and free-text pages that aren't intended to document a particular item (think overviews and tutorials). What goes between the quotes in <img src="">?


  1. see "alternatives" for why Shadow DOM isn't a solution ↩︎

7 Likes

(CSS does have a feature called namespaces and the @namespace at-rule, but this is for XML tag namespaces, not namespacing of CSS rules.)

Since a goal of this is to avoid conflicts, does rustdoc do mangling of crate names, or are crate name conflicts just considered to be rare enough in practice to ignore? Along the same line, are we using the library crate name, or the package name? Are binary crates required to use their name instead of using the same prefix as used by the library crate?

The output of cargo doc already doesn't handle multiple versions of the same package in the dependency tree; when this happens, only one version of the documentation ends up available, since they're put at the same path. I don't know exactly what the result is -- probably overlaying the two documentation builds arbitrarily based on what order they're performed. This is also probably the case if documenting two packages which have distinct names but both set lib.name[1] to the same identifier.

Using the crate name matches the path that rustdoc puts the built documentation under, including binary crate documentation when a package includes both bin and lib targets with different names. The multiple crates situation isn't just limited to binaries, either; in a workspace, multiple library crates exist, and there's imho no good reason to push them to use separate prefixes when they're developed together as a collaborative unit.

This seems tangentially related to the "cratespace" proposals; when a crate is a blessed member of a cratespace, it should probably be able to use that cratespace's prefix.

Is the intent for this to be added without intent to stabilize? If not, RFCs typically just describe the feature as if it's been stabilized, with the implicit understanding that all features go through a stabilization period. Any specific stabilization path notes are discussed separately, if there's specific information to add (e.g. portions which aren't intended to be stabilized, or if the described changes are expected to stabilize incrementally rather than as a unit).

Not directly related to the RFC (and an unresolved question), but this remains unfortunate. It would be nice to be able to publish stylesheets which apply only to your crate with the crate, such that downstream uses of cargo doc will see your crate documentation with the additional styling.

Thankfully, the solution is actually fairly simple in theory: as a cargo feature, allow specifying rustdoc configuration in the manifest which are used by cargo doc when documenting that package, similar to how docsrs allows RUSTDOCFLAGS today.


  1. It's very easy to forget that the library crate name used in code is allowed to be set independently of the package name used in the manifest. This is necessary for the automatic mapping of some-name to some_name, but packages don't have to use the automatic mapping and are allowed to set whatever unrelated value of lib.name they want to. ↩︎

If we're checking id/style prefixes, it probably also makes sense to do the same for the prefixes of any custom elements, since pseudonamespacing is actually a part of the HTML spec there.

Though since custom elements currently still require JS to utilize, rustdoc probably doesn't want to do anything that could be construed as encouraging their use, since nonreliance on JS execution is a deliberate feature of rustdoc.

I like this idea a lot, but why does the lint need to be restricted to nightly? I don't see any harm in letting people use it on stable, we can always remove it later if we choose to.

None of the hardcoded IDs found in the default map contain any underscores, so they won't be a problem.

Some IDs get inserted into the id_map when derived from headers. Currently, underscores are passed through when header names are slugified. That should be fixed to turn the underscores into dashes, to easily ensure these two things can't conflict.

The other IDs that rustdoc derives are related to associated items. These conventionally start with a single-word item kind (which contains no underscores) followed by a . period (which can't be part of a rust identifier, so there's no way it could be in a crate name). So there's no conflict there, either.

I'm imagining this lint being based on the --crate-name that gets passed to rustdoc (or, if none is passed, the one that gets inferred by the file name). This is the name that rustdoc actually has, unlike the package, workspace, or cratespace name, which only exist within Cargo AFAIK.

A new CLI parameter could be added to communicate these other names. What other things would rustdoc use this kind of information for?

You're right. I'll go ahead and fix that.

I'm going to tweak the proposal to not mention whether it's available on stable or nightly. This part can probably be discussed on the PR.

In any case, don't lints usually get merged as nightly-only first? invalid_html_tags did.

Because underscores are valid in crate names, if crates foo and foo_bar are being documented, then an ID like foo_bar_interaction is valid for both crates, under this lint rule. Some further protection is needed.

One possible solution would be to prohibit underscores that are not part of the crate name prefix. Another would be to require an explicit separator/namespace that is more verbose than one punctuation character.

4 Likes

Wouldn't any one punctuation character other than _ and - also work? / and : come to mind, although I recall that : would need to be escaped in CSS.

Rustdoc lints usually do, but not any other kind of lint, and I'm not sure why rustdoc should be special. Clippy handles this with "nursery" lints that are allow-by-default but can be enabled on any channel, which IMO seems like a better model: it avoids issues like the recent one where rustdoc::all enabled a nightly lint, and allows people on stable to give us feedback.

In general, I think the question we should be asking for nightly-only features is "what do we gain by preventing people from using it on stable"? In the case of doc(cfg), we avoid people adding doc(cfg) where they could potentially use doc_auto_cfg in the future; in the case of normalize_docs we avoid people enabling a flag that will be removed in the future, so we don't have to keep around a useless flag forever.

I'm not sure what we gain by making lints nightly-only, since we'll need to call register_removed for them even if they're removed before being turned on by default.

1 Like

To be clear, I'm not referring to crate names conflicting with IDs in use by rustdoc — I'm referring to two crates having the same --crate-name.

This occurs when using cargo doc where

  • Two versions of the same package are in the dependency tree, e.g. radium = "0.7" and radium = "1.0"
  • Two different packages set the same lib.name, e.g. forked-oldlib sets lib.name = "oldlib" to make migration just a manifest change
  • A local workspace crate has both a library and a binary target without custom names

Other than for this lint, the main thing rustdoc would utilize the other identifying information for is disambiguation when there's a crate name conflict. docsrs does its postprocessing to separate out multiple versions of package docs, putting docs at /package/version/crate/.

Rustdoc probably doesn't need to do anything like this, and cargo doc might take some of the responsibility (since cargo knows the package tree but rustdoc doesn't), but it's certainly beneficial for the package and version number to be available in the docs, even if they're not used to detect conflicts.

If documenting a library crate's dependency graph and that crate has multiple documented binary targets, it's probably more helpful to present the binaries differently than the libraries; perhaps nested under the library crate they're attached to, but at least somehow differentiated from the library crates, since they have no public API. (Binaries get documented with --document-private-items by default. cargo doc will document at least both the library and default binary target if they have distinct names.)

If cratespaces become a thing, that absolutely will need to be visible in the docs.

2 Likes

I don't like the situation here, either, but rustdoc already has bigger problems with handling these, and forcing people to include version numbers as prefixes in their inline HTML would be annoying (forcing you to update a potentially widely-used custom class whenever you bump the major version) and probably excessive (after all, this is only a problem if you not only have both crates in your dependency graph, but have them both inlined into a single HTML file).

Two crates within a shared workspace might as well be one crate — we give them a warning, and they fix it.

The situation with two versions of the same crate, or a crate that changes its lib.name, is not so easy. Sure, rustdoc can scan the HTML it produces and warn whenever it sees an ID conflict, but it might be buried somewhere deep in the dependency hierarchy with an unresponsive crate author, so fixing it would require a lot of forking, [patch]ing, or replacing dependencies.

We also can't really produce a warning for class conflicts.

1 Like

The complexities of possible name conflicts here suggest to me that it might be wise to take a totally different strategy: instead of requiring the author to provide a namespace prefix, automatically add namespace prefixes. Note that in particular, an intra-doc link foo::Foo#some-section already has to be resolved to a specific crate to generate the correct URL, so it's possible to add the ID prefix without needing any more name analysis than is already done.

There might need to be a syntax to distinguish links to rustdoc-generated sections (e.g. some_module#traits) from user-defined sections, though.

This is more work for rustdoc, but makes it easier and less verbose for authors to write correct documentation.

2 Likes

If I'm not mistaken, the current only cases where user documentation from one crate gets inlined into another is

  • the whole documentation for #[doc(inline)];
  • trait body documentation, with some limitations; and
  • the summary paragraph for type reëxports (which are recommended to not contain inline HTML).

Intra-doc links themselves only reference other crates. The main motivation of intra-doc links AIUI was less about inlining docs across crates and more just about making doc references position-independent. (This applies equally as much within a crate and includes cross-crate inlining as a case of general position independence.)

An interesting observation to be made is that even if inlined documentation doesn't cause any name conflicts, the host crate is generally unlikely to have the extra --html-in-header that the inlined crate has, meaning the documentation will still render wrong (although ideally it will degrade gracefully to unstyled but semantic HTML).

What about things like JavaScript enhanced rendering/behavior? Do we automatically provide a root .crate_name class that crates are expected to check for, or are we relying on crates manually adding a predicate prefixed name to every documented item to scope JS processing to in order to avoid JS conflicts?

And what about code blocks? ```mermaid,js produces <code class="lang-mermaid,js"> or similar; are unrecognized code block language tags also going to be linted against unless they're crate_name prefixed?

Perhaps the better behavior then is to "fix" inlining when extra --html-in-header is used, either by just disabling it (sad) or doing smarter inlining — namely, giving inlined doc pages the --html-in-header of the crate they're inlined from. If we do this, there's still a choice to be made between merging the two crates' --html-in-header (requiring some sort of cooperation to prevent conflicts) or only using the source crate's when documenting that crate's item, omitting the inliner crate's. This can't address trait function documentation or any future sub-page inlining (e.g. delegation), though; those pages do need to be able to fully merge extra --html-in-header relied on by inlined portions.

Giving rustdoc more information will of course be required in order to do anything like this. Even having a concept of crate-specific --html-in-header will require adding extra crate metadata to the manifest and then getting that metadata to rustdoc. Command-line --html-in-header without a crate association would of course continue to apply to all doc generation, being merged with any crate-specific --html-in-header.

Some sort of prefixing scheme is still necessary to separate rustdoc's namespace from the crates', but perhaps it's not even necessary to try to have rustdoc address intercrate conflicts. It should be good enough if rustdoc just enforces its own namespace and leaves handling crate conflicts to convention (and/or isolation). Some effort to encourage conflict avoidance is better than no conflict avoidance, even if it's not perfect, but given the drawbacks of encouraging specific prefixes (difficulty sharing between cooperating crates or using 3rd party resources), the limited potential for conflict (inlined docs where both have separate added --html-in-header without the outer being aware of the inner's), that the inlined documentation still degrades even without conflicts since its without its --html-in-header (since the outer doesn't know about it to provide it), and that we're only really addressing anchor and CSS conflicts but not JS, it seems sufficient to leave dealing with crate conflicts to crate convention.

Another namespacing option less restrictive on crates which also allows crates to utilize 3rd party sources using their own prefixes (e.g. fontawesome) is to limit rustdoc to using some fixed set of hyphen-prefixes (e.g. doc- and keyword- for any Rust keyword) for extensibility and generated names, along with reserving all unprefixed names without hyphens/underscores[1]. Rustdoc needs to move any existing kebab-case identifiers to being doc- prefixed, but this leaves crates free to utilize whatever prefixing scheme they want, so long as they use one (that wasn't one of the set rustdoc claimed).

I think this version of the lint keeps the spirit of the original formulation, if we treat addressing intercrate conflicts as a nongoal (i.e. leave it to crate convention). This could be made an opt-in (edition?) switch if it's considered too breaking of a restructure, or it could be done just as a part of the update which promises some stability in what names rustdoc uses. (I believe all the names used by rustdoc are currently formally unstable but defacto relied on?) All the standard tradeoffs of course still apply.

TL;DR: I have a small personal preference for having rustdoc claim a set of namespaces it will stick to, encouraging crates to avoid those (unless added to a stable list), and letting intercrate cooperation be handled purely by convention, instead of encouraging crates to stick to specifically crate_name_ prefixes.


  1. My rough expectation is that rustdoc would use doc- for anything unstable, unprefixed names for stable styles/anchors, and keyword- for generated anchor names for items defined with that keyword. ↩︎

I think there might be a bit of misunderstanding about what this point means:

  • Do not embed CSS or JavaScript in doc comments to customize rustdoc's UI. If you want to publish documentation with a customized UI, invoke rustdoc with the --html-in-header [command-line parameter] to generate it with your custom stylesheet or script, then publish the result as pre-built HTML.

If you are using custom classes or IDs in your doc comments themselves, those aren't actually part of rustdoc's UI. They're part of the doc author's UI, and their scripts should probably be embedded in the doc comment itself so that it gets inlined along with the documentation itself.

Does this new version of the bullet point make the idea a bit clearer?

  • Embedded CSS and JavaScript in doc comments should touch classes and IDs that are declared in the doc comment itself, and should not be used to customize rustdoc's UI. If you want to publish documentation with a customized UI, invoke rustdoc with the --html-in-header [command-line parameter] to generate it with your custom stylesheet or script, then publish the result as pre-built HTML.
2 Likes