Pre-RFC: Stable rustdoc URLs

jyn514 · September 18, 2020, 10:32pm

Summary

Make the URLs that rustdoc generates stable relative to the docs being generated, not just relative to the rustdoc version.

Motivation

Rustdoc generates a separate HTML page for each item in a crate. The URL for this page is currently stable relative to rustdoc; in other words, Rustdoc guarantees that updating rustdoc without changing the source code will not change the URL generated. This is a 'de facto' guarantee - it's not documented, but there's been no breaking change to the format since pre-1.0.

However, Rustdoc does not currently guarantee that making a semver-compatible change to your code will preserve the same URL. This means that, for instance, making a type an enum instead of a struct will change the URL, even if your change is in every other way semver-compatible. After this RFC, Rustdoc will guarantee that the URL would stay the same.

The primary motivation for this feature is to allow linking to a semantic version of the docs, rather than an exact version. This has several applications:

docs.rs could link to /package/0.2/path instead of /package/0.2.5/path, making the documentation users see more up-to-date (rust-lang/docs.rs#1055)
blogs could link to exact URLs without fear of the URL breaking (rust-lang/rust#55160 (comment))
URLs in the standard library documentation would change less often (rust-lang/rust#55160)

Note that this is a different, but related, use case than intra-doc links. Intra-doc links allow linking consistently in the presence of re-exports for relative links. This is intended to be used for absolute links. Additionally, this would allow linking consistently outside of Rust code.

Guide-level explanation

Rustdoc will make the following changes to URL structure:

Item pages will be dependent only on the namespace, not the type of the item.

Consider the struct std::process::Command. Currently, the URL for it looks like std/process/struct.Command.html. This RFC proposes to change the URL to std/process/t.Command.html. Pages named kind.name.html would still be generated (to avoid breaking existing links), but would immediately redirect to the new URL.
Re-exports will generate a page pointing to the canonical version of the documentation.

Consider the following Rust code:
```
pub struct Foo;
```
Rustdoc currently generates a page for this at struct.Foo.html. Now, consider what happens when you move the struct to a different module and re-export it (which is a semver-compatible change):
```
pub mod foo { pub struct Foo; }
pub use foo::Foo;
```
This generates a page at foo/struct.Foo.html, but not at struct.Foo.html. After this change, rustdoc will generate a page at the top level which redirects to the version nested in the module.

Reference-level explanation

Item pages will be dependent only on the namespace

Rust has three namespaces. For simplicity, this will only consider items that can be at the module level, since function locals cannot be documented.

The value namespace. This includes fn, const, and static.
The type namespace. This includes mod, struct, union, enum, trait, and type.
The macro namespace. This includes macro_rules!, attribute macros, and derive macros.

Rust does not permit there to be overlaps within a namespace; overlaps in globbing cause the glob import to be shadowed and unusable. This means that a name and namespace is always sufficient to identify an item.

Rustdoc will use the following links, depending on the namespace:

Name.html for values
t.Name.html for types
m.Name.html for macros

Rustdoc will continue to use directories (and index.html) for modules.

Re-exports will generate a page pointing to the canonical version

The redirect page will go in the same place as the re-export would be if it were inlined with #[doc(inline)] after this RFC.

There will not be a page generated at kind.name.html at the level of the re-export, since it's not possible for there to be any existing links there that were not broken.

Drawbacks

Rust is case-sensitive, but some filesystems (especially on Windows) are not, so there are naming collisions in the files Rustdoc generates (#76922). If Rustdoc combines several 'kinds' into one namespace, there will be more conflicts than currently:

struct Command; // page generated at `t.Command.html`
enum command {} // page generated at `t.command.html`

@nemo157 has kindly conducted a survey of the docs.rs documentation and found that there are about 700,000 items that currently overlap. After this change, that would go up to about 850,000 items that overlap. docs.rs has 308,064,859 total items in the inventory, so previously 0.23% files conflicted and after this RFC 0.28% files will conflict.

In the opinion of the author, since this is an existing problem, it does not need to be solved in order to go forward with the RFC.

Rationale and alternatives

How were the URLs chosen?

There were three main criteria for choosing the URLs (in vauge order of priority):

They should be based on the namespace, not the 'kind' of the item. Otherwise there's not much point to the RFC, because the URLs won't be stable.
They should make sense when viewed; for example a, b, c would be bad choices for the names.
They should be fairly short, so they're easy to type; for example type_namespace. would not be a great choice.

t. and m. were partly chosen based on precedent in #35236 (but see #naming-alternatives below for the main reason).

Naming alternatives

Note that these names are easy to 'bikeshed' and don't substantially change the RFC.

Rustdoc could add a v. prefix for items in the value namespace. This would be more consistent with the other namespaces, at the cost of making the URLs for functions slightly confusing (favoring criteria 2 over criteria 3).
Rustdoc could lengthen the prefixes to type. and macro.. This makes the URLs easier to read, at the cost of making them more confusing for traits (consider type.Trait.html).
Rustdoc could use the existing specific names only when there is no risk of a semver-compatible change being able to change the kind. This would need careful inspection to make sure there is in fact no risk. It would also be slightly inconsistent with other URLs.

Alternatives

These alternatives are substantial changes to the RFC.

Rustdoc could stabilize the links it uses, but without keeping backwards compatibility by not generating kind.name.html. This has little benefit over the RFC, other than slightly less disk space used and implementation complexity.
Rustdoc could keep the status quo. This can cause no naming conflicts on Windows, but has the drawback that links could silently break even for semver-compatible changes.
Rustdoc could choose to make URLs stable neither across rustdoc versions nor the version of the code being documented, for example by using kind.name.SHA256SUM(rustdoc version).html. This makes it more clear that the URLs are not intended to be stable, at the cost of breaking links across much of the ecosystem.

Prior art

go doc generates all documentation on one page and uses URL hashes, without namespacing. This causes conflicts when two items from different namespaces are in the same package.
java only allows classes at the top-level, so javadoc has no need for namespacing. To distinguish between methods and fields, javadoc includes () in the URL fragment for methods.
Racket only allows functions at the top-level, and so has no need for namespacing.
doxygen names HTML pages after their C++ source files, and appends a random hash in the URL fragment to avoid namespace conflicts.

Unresolved questions

Is there a way to resolve the naming conflicts on Windows? If not, is that worth blocking the RFC, given there are existing conflicts?

Future possibilities

Rustdoc could stabilize page hashes:

Associated items for traits will use the same hash as for types, unless there is a conflict with the hash for a type.

A change from
```
struct S;
impl S { fn f() {} }
```
to
```
struct S;
trait T { fn f(); }
impl T for S { fn f() {} }
```
is semver compatible, but currently breaks the hash (it changes from #method.f to #tymethod.f). Rustdoc could change it to use #method.f when there is no conflict with other traits or inherent associated items. For example, the second version of the code above would use #method.f, but the code below would use #tymethod.f for the version in the trait:
```
struct S;
impl S { fn f() {} }

trait T { fn f(); }
impl T for S { fn f() {} }
```
This matches Rust semantics: S::f() refers to the function for the type if it exists, and the method for a trait it implements if not.
Associated items for traits will contain the name of the trait if there is a conflict.

Currently, the from function in both of the trait implementations has the same hash:
```
enum Int {
	A(usize),
	B(isize),
}
impl From<usize> for Int {
	fn from(u: usize) {
		Int::A(u)
	}
}
impl From<isize> for Int {
	fn from(i: isize) {
		Int::B(i)
	}
}
```
This means it is impossible to refer to one or the other (which has caused trouble for intra-doc links). Rustdoc could instead include the name and generic parameters in the hash: #method.from-usize.from and method.from-isize.from. It is an unresolved question how this would deal with multiple traits with the same name, or how this would deal with types with characters that can't go in URL hashes (such as ()). Rustdoc could possibly use percent-encoding for the second issue.
All other URL fragments would be kept the same:
- #variant.{name} for enum variants
- #structfield.{name} for struct fields
- #variant.{parent}.field.{name} for anonymous structs in enums (enum Parent { A { field: usize }}). This may require redesign to avoid conflicts in fields between different variants.
- #associatedconstant.{name} for associated constants in traits. This may require redesign when RFC 195 is implemented.
- #associatedtype.{name} for associated types (same as above)

elidupree · September 18, 2020, 11:13pm

It's definitely possible.

We could technically encode the capitalization in the URL. For example, struct Name could become t.Name.clll.html (c for capital, l for lowercase, for each letter). Of course, this would lead to ugly URLs.

An adaptation would be to only do this encoding for items with nonstandard capitalization, which means that in practice, almost all items would have normal-looking URLs. The only catch is modules, which have a different capitalization convention than the other items in their namespace. Maybe modules could simply have a separate prefix. (Is it ever semver-compatible to replace a module with a struct/trait, or vice versa?)

There's also plenty of room for bikeshedding on how to represent the encoding.

jyn514 · September 18, 2020, 11:17pm

We could technically encode the capitalization in the URL. For example, struct Name could become t.Name.clll.html (c for capital, l for lowercase, for each letter). Of course, this would lead to ugly URLs.

Oof, this solution might be worse than the problem Ideally the links would be easy to type, not just easy to copy.

(Is it ever semver-compatible to replace a module with a struct/trait, or vice versa?)

I think so - if the module only exposed one type you could replace it with a struct that had that associated type.

mod M {
   type T = usize;
}

->

struct M;
trait MyTrait { type T; }
impl MyTrait for M {
    type T = usize;
}

elidupree · September 18, 2020, 11:23pm

You can't do the replacement in that direction, because use M::T; is allowed for a module, but not for a struct.

In fact, I'm pretty sure you can't ever replace a module with a struct or vice versa. If you have a module, the user is always permitted to say

use M::*;

which is never permitted for a struct. And if you have a struct, the user is always permitted to say

type K = M;

which is never permitted for a module.

EDIT: The one case that still has me concerned is, you can say use M::*; for an enum, and I don't know anything that would prevent you from replacing mod M {} with enum M {}.

EDIT 2: Of course, it seems thoroughly improbable that someone would need to include a module with no items in it in their API, and then have links to it be forever backwards compatible after they change it to an enum.

camelid · September 19, 2020, 2:11am

Looks like a neat idea! What do you mean by this though?

Did you mean something like this?

i.Name.html for items

camelid · September 19, 2020, 2:12am

I understand why this is necessary, but it's unfortunate that it will generate so many HTML files. Although I guess they will be pretty small.

elidupree · September 19, 2020, 3:07am

After thinking mode over more, I think the only reasonable cases of a semver-compatible changes are:

switching between struct, enum, union, and type.
switching whether an item is a pub use or the original.
MAYBE converting a static to a const, if your documentation warned users not to rely on it to have a single address.

For the other conversions:

You essentially can't convert between modules and types/traits, as above.
You can't convert a trait to anything else (someone could have used it as a bound).
You can only convert a type to a trait if you're willing to make your users use deprecated trait objects without dyn.
You can't convert a function to a const or static (someone might have called it).
You can't convert a const to a static (someone might have used it in a const context).
…I don't technically see why you couldn't convert a const or static to a function, but just like with the "empty module to enum" case, any const or static that could be converted to a function would be useless.

This means we can mostly keep the current URLs, only unifying struct, enum, union, and type. That should address the main issue, while avoiding the confusion of "Why is my fn item written with a v?"

So my proposal would be:

struct, enum, union, and type all use the format type.[Name].html. (With redirects from [kind].[Name].html for backwards compatibility.).
MAYBE unify const and static the same way (to value.[NAME].html, I guess?)
pub use declarations will generate redirects.
All other URLs remain the same as at present.

As for the name collision issue… It turns out I was wrong above, we can't just mangle names with nonstandard capitalization, because Took and ToOk are both perfectly reasonable CamelCase names. So it's either mangle all CamelCase names ever, or only mangle names in case of collisions (which sacrifices URL stability in that case, because it means that when you add a colliding name - which is a semver-compatible thing to do - you change the URL of the item that was there before). I'm guessing that "mangle all names ever" would be a nonstarter at this point, so it's just a question of – in the case of collisions, which would we rather sacrifice, "working correctly on Windows" or "having the URLs look normal and be stable"? I'd personally be inclined not to touch that decision in this RFC (and given that my proposal is slightly more limited than yours, it would probably cause a smaller increase in collisions)

jyn514 · September 19, 2020, 3:38am

No - everything rustdoc generates documentation for is an item. The v is to distinguish the namespace that it's in; if rustdoc doesn't distinguish namespaces there could be naming collisions.

jyn514 · September 19, 2020, 3:40am

So my proposal would be:

struct , enum , union , and type all use the format type.[Name].html . (With redirects from [kind].[Name].html for backwards compatibility.).

MAYBE unify const and static the same way (to value.[NAME].html , I guess?)

pub use declarations will generate redirects.

All other URLs remain the same as at present.

This seems a lot more inconsistent. The rule is now "usually type. or value., except in edge-cases rustdoc thought wouldn't be important". I'm not sure what benefit that brings over the scheme I proposed.

jyn514 · September 19, 2020, 3:43am

Rustdoc actually used to generate this many files until recently: https://github.com/rust-lang/rust/pull/70563. So in itself it doesn't seem a giant drawback. If after a while the old URLs fall out of use, rustdoc could think about removing the redirects to save space, but that's definitely a future extension and I don't propose it here.

elidupree · September 19, 2020, 3:51am

The benefit would be that there are fewer collisions (e.g. mod foo and struct Foo do not collide), and---

Wait a minute, silly me - modules aren't called mod.name.html in the first place because they are actually subdirectories instead.

Well, it'd still avoid collisions between fn foo and const FOO. And avoid the confusion of why functions are v when one doesn't normally think of a function as a value. But I agree that these benefits aren't enormous and there are reasons to prefer your way.

ogoffart · September 19, 2020, 8:26am

Regarding the case conflict, rustdoc could generate mangled version only in the unlikely event a conflict actually occurs for this name. When that happens, it could generate a "disambiguation" page for the un-mangled name that would have links to the different options. (similar to wikipedia's disambiguation page)

jyn514 · September 19, 2020, 11:52am

Nice, I like that! Then the URL is still 'stable' (because there's a link to the correct page on the disambiguation page) and in the vast majority of cases when it's not an issue, the URLs look pretty

dhm · September 19, 2020, 12:29pm

It generally isn't.

Obviously replacing a type with a module is not semver compatible (e.g., mod String { ... }), since people could use that name where types were expected, which for a module it cannot be done.
And replacing a module with a type is not semver compatible either since we lose the ability to use name::item.

I feel like some of these things won't read "very well". Take, for instance, v.read:

https://doc.rust-lang.org/std/fs/v.read.html

I have thus two suggestions / remarks:

It feels to me like one of these three namespaces could be "promoted" to a non-prefix version. And I am, of course, mainly thinking of function items, which although they occupy the value namespace, technically, it is not something "obvious" for many Rustaceans.
- https://doc.rust-lang.org/std/fs/read.html
As for the other two namespaces, do we explicitly need them to be shorthands? What about:
- https://doc.rust-lang.org/std/macro.panic.html as is currently the case
- https://doc.rust-lang.org/std/path/type.Path.html

These are just my .02, regarding my "first impression" and where to nitpick. I may very well be the only one feeling this way, so some poll could be advisable

Also, for the initiative as a whole. Being able to provide URLs to items that are less likely to get "not up to date" (click here for the latest version banner on top), is an awesome thing to have!

elidupree · September 19, 2020, 12:40pm

The trouble with type is that a trait is not a type.

This is essentially the rationale for my proposal above. A type is a type, and a trait is a trait, a fn is a fn, and a macro is a macro; those names reflect how they're used, they'll make sense to readers, and you would never need to swap one of them for a different one. Conversely, it DOES make sense to swap the different categories of type (struct, enum, union, type alias) with each other.

The only complication is const and static because both of those are used as values, so it seems vaguely plausible that a library might document "this is currently a static but I might change it to a const later".

jyn514 · September 19, 2020, 12:51pm

I think I've even used type.* by accident in one of my comments so I'm happy to expand it out from t in the RFC.

It feels to me like one of these three namespaces could be "promoted" to a non-prefix version. And I am, of course, mainly thinking of function items, which although they occupy the value namespace, technically, it is not something "obvious" for many Rustaceans.

I like this! I agree 'functions are in the value namespace' is not super intuitive.

Also, the initiative as a whole. Being able to provide URLs to items that are less likely to get "not up to date" ( click here for the latest version banner on top), is an awesome thing to have!

(edit: whoops, posted too early, will be editing a lot)

jyn514 · September 19, 2020, 12:57pm

The trouble with type is that a trait is not a type.

I don't think the names need to match exactly how they're used in Rust code. The criteria I used when coming up with the URLs is:

They should be based on the namespace, not the 'kind' of the item. Otherwise there's not much point to the RFC, because the URLs won't be stable.
They should be fairly short, so they're easy to type (I'm thinking of poor @RalfJung writing LaTeX by hand).
They should make sense when viewed. Choosing a, b, c for the namespaces is out for this reason; macro, value, and type are a pretty clear improvement over m, v, and t for this reason.

I think marking 'traits' as type in the URL might not be perfectly clear, because you don't normally think of them as types, but it's more consistent than 'only some things in the type namespace are marked as types'. 'type' is also consistent with how rustc talks about this internally. In general I don't think the URL needs to match your intuition perfectly because most of the time you won't be looking at it.

elidupree · September 19, 2020, 1:14pm

The scenario I'm imagining is that someone will be looking for a trait, click to the trait page, glance at the URL bar, think "oh wait, I was looking for a trait but I accidentally clicked to a type", and go back to look elsewhere for the trait. To be fair, I don't think this will happen very often, but I think it's worth worrying about for the design.

This sounds like a great idea!

While we're talking about name mangling, though… do we need to think about how this might interact with non_ascii_idents? From the non_ascii_idents RFC:

This RFC keeps out-of-line modules without a #[path] attribute ASCII-only. The allowed character set for names on crates.io is not changed.

Note: This is to avoid dealing with file systems on different systems right now. A future RFC may allow non-ASCII characters after the file system issues are resolved.

It seems plausible that we might ultimately want to mangle Unicode names in documentation URLs in order to be compatible with old filesystems; if we're going to do that, it might make sense for the "case insensitive name collision" mangling to use the same mangling scheme. But that's still an open question, and I don't know if we'd want to block this RFC on making that decision, and I also don't know if we'd want to pick a mangling scheme for this RFC if it might make us end up with 2 different mangling schemes later. So I feel like the pragmatic thing is for this RFC to leave filesystem compatibility as a question to resolve later.

jyn514 · September 19, 2020, 1:27pm

I updated the RFC with some of the discussion. I'm still not convinced that type.Trait.html is so confusing it should be changed, but I added it to the unresolved questions section.

jyn514 · September 19, 2020, 1:29pm

I'm also interested in opinions on the 'URL fragments' section - does it seem fleshed out enough to add to the main RFC? Or should I keep it under 'Future Extensions'?

My main use case is that I really want to make intra-doc links work for this (https://github.com/rust-lang/rust/issues/76895), but I wouldn't want to block the rest of the RFC on it.

Topic		Replies	Views
Relative URL in doc comments documentation	4	723	May 21, 2023
Pre-RFC: standard form to reference things	19	3191	March 25, 2019
Preserve links to the first edition of the Book documentation	2	937	March 25, 2019
Pre-RFC: rustdoc-specific feature flags tools and infrastructure	7	1556	March 25, 2019
Doc toolchain forwarding tools and infrastructure	1	256	July 15, 2024