Deadlock about fixing outdated documentation links in search engines


#1

There are two serious problems with Rust’s documentation:

  1. Rust-related search queries are polluted by outdated copies of old books, old libstd, unfinished content that shipped with docs in 2015, and a ton of second edition redirects and stubs lacking content.

  2. There’s some deadlock or organizational blind spot about this issue. It’s been reported many many times, but hasn’t resulted in any meaningful action.


The SEO problem:

Many Rust-related queries in search engines bring the old first edition of the book, and stubs from a removed copy of the second edition that was meant only for offline doc users. Occasionally, also libstd links from historical Rust versions pop up.

In some cases the situation is so bad that the current version of the book isn’t even in the search results at all! I presume it’s being incorrectly removed as duplicated content, losing to the “there’s no book here” stubs.


The unfixability problem:

One way to solve this problem would be to block undesirable copies in robots.txt (there are also other ways like redirects and canonical links, but these solutions ended up being rejected/posponed for various reasons).

However, the robots.txt file is not in any public repo. I’ve heard on discord that it’s manually uploaded to an S3 bucket. Are there procedures how to change it? If not, could we create some?

Having exact same copy of the book for packaging (offline use) and online version makes sense in general, but in case of the book reorganization it has created an absurd situation that #1 thing you find online is a message that there’s no book for offline use.

Responses I’ve heard to this imply that doc deployment is unfixable, and docs are doomed to use the most inflexible primitive hosting that can’t do anything. Really? Can’t online deployment do rm -rf of the folder intended for offline use only? Can I set up a server for you that supports 301 redirects? Can I give you a new robots.txt file to upload? Maybe Rust needs SEO-WG?


#2

We’d have to ask @rust-lang/infra about it; I thought it would have been in src/doc, but it appears not! That seems bad.

That’s a mis-characterization, and continuing to be so aggressive doesn’t help your cause.

What I said was, there are a number of constraints on the design, and many of the suggestions have ignored those constraints. We need either a design that fits into those constraints, or an RFC that changes the constraints.

I am not 100% sure what exactly this proposal is suggesting.

Doing this is part of the yet-unimplemented https://github.com/rust-lang/rust/issues/44687


#3

Sorry.

Can’t online deployment do rm -rf of the folder intended for offline use only?

I am not 100% sure what exactly this proposal is suggesting.

This issue is about the book online being confusing, because users find these pages:

The 2018 edition of the book is no longer distributed with Rust’s documentation. If you came here via a link or web search, you may want to check out the current version of the book instead. If you have an internet connection, you can find a copy distributed with Rust 1.30.

This text displayed online on doc.rust-lang.org doesn’t make any sense. The book is online. The user clearly is online. It’s information that’s technically-correct on file://localhost, but being on doc.rust-lang.org makes it irrelevant and confusing.

One possible fix for this would be to do rm -rf book/*-edition when uploading documentation from Rust’s distribution to doc.rust-lang.org. This way users would not stumble upon “the book is no longer distributed with Rust’s documentation” placeholders. If these pages didn’t exist, search engines wouldn’t be picking them and unhelpfully directing users to them.


#4

It’s all good.

Okay, so I disagree with your characterization here, but I can also appreciate that people may find it confusing. We need something that makes sense in both places, and it is accurate in both places. Maybe two sentences would be better. What do you think about that?

This is generally considered not acceptable because it would break all previous links. And there are a lot of these links. That’s the whole root of this problem! Getting a 404 would be even more upsetting than the current page.


#5

Getting a 404 would be even more upsetting than the current page.

I admire that you don’t want to break old links. If these links were handled with a 301 redirect that would be perfect.

However, the HTML based solution links only to the Foreword in the current edition, and Google is too confused about these stubs to find chapters of the current edition, so even though it’s not technically a 404, I still can’t find the content I was looking for, and I’m upset when I land on these URLs. If you don’t want to delete them, then hiding them with robots.txt would be fine too.

The more text there is on these pages, the more they look like real content, so it only makes sifting through search results take longer.

Apart from 301 redirect, the next best thing would be something like:

<meta name=refresh content="/new url; 0">
<a href="/new url">Moved</a>

Nothing to read and misunderstand, and too little text to confuse search engines into returning this page instead of the up to date book.


#6

Yep! This is why it’s a bunch of inter-twined issues; if we could 301, then it’d be no big deal, but since we can’t right now, this is what we did.

Yes, this was a mistake, basically. Some people stepped up and sent in PRs, and so this will eventually ride to stable and be fixed!

That said, I agree that a robots.txt seems good in this situation.

So, the reason we didn’t go with this is basically an accident of history: when we had the first, second, and 2018 editions, we wanted to point the first edition to either second or 2018. That meant two different URLs, which meant that picking this one wouldn’t work. Furthermore, not every page has an exact correspondence in the new book either, as their contents are not identical.

Now that we’d be going to one single URL, this is more feasible, I think.


#7

I see that there are plans and ongoing work on proper long-term improvements and fixes in these areas. But I’m worried that this also expands the scope and adds dependencies to the work beyond directly addressing the urgent SEO problem (for example, https://github.com/rust-lang/rust/issues/44687 is taking time to develop a version selection widget), so Rust will remain without Googlable documentation for months or years.