Full text search for rustdoc and doc.rs

Vorpal · March 3, 2024, 12:05am

Consider this post a way to gather information before proposing a change (pre-pre-pre rfc?)

Is there any reason to not have full text search for rustdoc/docs.rs? I imagine client side FTS (I know that exists, mdbook supports it for example).

This would help in many cases where the term you are looking for isn't used in the name of a type but only in the documentation text itself. Most recent example for me was looking for how to "replace" using regex-automata (since regex supports this). Turns out the word I should have used was "interpolate". I ended up asking the maintainer in github discussions, taking up part of their valuable time. I have had similar issues with crates like bpaf and clap before (trying to search their tutorial chapters).

If we want to keep server bandwidth down, we could not enable FTS by default but require a prefix symbol in the search for it. Then the majority of searches wouldn't use it.

So my questions really are:

What is the reason this hasn't been done already? Is it just no one got around to it or is there a good technical reason.
What is the correct process for proposing something like this? I'm not a rustc developer and I haven't really contributed anything beyond some bug reports and some discussions so far.
I assume this would need to be implemented on the client side (js) which isn't my expertise at all, but I could probably implement the generation of indexes etc on the rust side. But perhaps the logic can be mostly lifted wholesale from mdbook.

drmason13 · March 3, 2024, 7:16pm

I can't comment on reasoning or process but I think GitHub - meilisearch/meilisearch: A lightning-fast search API that fits effortlessly into your apps, websites, and workflow could be a good fit for the problem at hand

Vorpal · March 3, 2024, 7:31pm

Thank you.

I took a look earlier today to see what rustdoc used. Which was some library that was in maintenance mode, and recommended another one should be used for new things. Went and looked and it too was now in maintenance mode with a few recommendations again. Some of those seemed interesting, but didn't spend too much time on that. At least a few used rust + wasm.

I think it is perhaps too early to discuss tech stack choices though, and that should be decided upon once we know that we want to have this at all. But I'll keep your suggestion in mind for when that point comes.

Well, I know I want it and I also got a few likes on my post above. But I would really like the input from someone on the rustdoc and/or docs.rs teams, I have this nagging feeling there must be more to it or someone would already have implemented this long ago.

Nemo157 · March 3, 2024, 8:22pm

Rustdoc vs docs.rs would be very different features for such a thing. For rustdoc it seems like it should be possible to build an expanded full-text search-index, download that and query it client-side; for docs.rs doing such a thing site-wide would result in far too big an index to do client-side. Doing just the per-crate index in rustdoc doesn't require any docs.rs input (except maybe ensuring we support the appropriate mime-types if it's not using js+json).

Vorpal · March 3, 2024, 8:43pm

EDIT: misread you, too late here. We basically agree I think.

Summary

I thought the search in the generated documentation on docs.rs was per-crate anyway? The site wide search is just for crate names and crate metadata, right?

I was only talking about the in-crate search. Then whatever solution works for rustdoc should work for the per-crate search on docs.rs. We just need to ensure that the downloaded index isn't needlessly large:

Maybe we should pre-compress it with gzip, and use a compact format in the first place (not json).
We could shard the index. Off the top of my head (and I'm sure there are smarter ways to do this, but I'm not an expert on this, yet at least): You have one file with n-gram to page id index, then you shard the full text data that you use to confirm your n-gram matches. This could make sense for large crates (tokio for example). Sharding should be based on size.

I don't believe it makes sense to make a docs.rs specific server side solution for the in-crates search.

Lonami · March 3, 2024, 11:03pm

Not a full solution, but a doc alias would help in this situation, without the complexity and size cost of full-text search:

#[doc(alias = "replace")]
pub fn interpolate(...);

IMO a controlled list of well-known aliases is better than a full-text search bringing everything and polluting the results.

Vorpal · March 4, 2024, 7:31am

I know of this feature. But:

It requires extra effort from the maintainer of the code, effort that often will not be put in I suspect. It would be interesting to get some statistics on how often the alias feature is used in practice.
It also require them to try to think outside their own mindset ("I think A is the most logical name, but some people may think B makes more sense").
It seems impractical for crates using a module as a tutorial chapter (such as clap, which has many tutorial and reference chapters using modules built just on docs.rs). There would be a large list of keywords to try to keep in sync manually. (Perhaps we could get @epage input on this one.)

What could be done is to require a specific prefix (such as !f, or maybe I should suggest !bikeshed instead for right now) to enable full text search. This avoids the full text matches by default for both the reason you specified as well as (potential) performance benefits of not having to download the index.

(We could even make it an option, stored on the client side, rustdoc already has those for theme and a few other things.)

epage · March 4, 2024, 2:40pm

A pretty big, more general limitation with alias is that you only get suggestions for aliases on exact word matches. For example, if you go to winnow - Rust and type in many, you won't see any aliases but change it to many0 and you'll see it.

Another limitation for this use case is an alias can only exist on an API item, like a mod. For longer-form documentation, like in Clap or Winnow, you can't have an alias to a specific point in the document. Using an alias for this is kind of a "here is the general area, good luck finding what you are looking for".

I also feel like using an alias like this kind of messes with its intent which can negatively impact people finding what they are looking for (which is also related to the next item).

I do think some UX would need to be worked out as full-text search results are a different category than named / aliased search results. If I type in a generic word that is also an item in the API, I don't want the generic entries flooding out the API item.

Vorpal · March 4, 2024, 3:49pm

This is a very good point. How about having the result in two sections. API matches on top, a clear horizontal divider, followed by FTS results.

Add suitable heading to each section of course. And messages for when one or the other section has no results.

We could even deduplicate the results, so don't show full text matches for things that were already found with API search. Not sure if that one would hurt or help though.

Nemo157 · March 4, 2024, 4:23pm

Or it could be a fourth tab ("In Description" maybe) next to the three that already exist

Vorpal · March 4, 2024, 4:24pm

I had not even noticed that feature. I should probably take a closer look at the UI if I'm considering changing things.

notriddle · March 8, 2024, 9:00pm

The team has talked about this before, but I'm only really speaking for myself here.

https://rust-lang.zulipchat.com/#narrow/stream/266220-t-rustdoc/topic/full.20text.20search.20in.20rustdoc.20.2F.20doc.2Ers

It's size and quality.

Size

github.com/rust-lang/rust

rustdoc's search-index.js file is huge for large projects

opened 03:13PM - 03 Feb 16 UTC

jandem

T-rustdoc C-enhancement

See https://bugzilla.mozilla.org/show_bug.cgi?id=1245213#c3 We should fix Spide…rMonkey / the profiler to be smarter, but 15.2 MB is a _lot_ of JS to load on every page load. Looks like this file will also create a ton of JS objects/strings. Can we load this file only when we're using the search bar? Maybe we can split it up somehow and load only the relevant parts? Or come up with a more efficient format for it?

On most crates, the largest part of the search index is the truncated descriptions. Except windows-rs, which is an outlier and should not be counted. Full text search would undoubtedly make the total size much bigger, because it would need to store the whole description, and not just a truncated one. mdBook has not solved this issue: The Book's search index is bigger than the standard library's.

We really should implement sharding at some point (at least for truncated descriptions), but that would only help with some of the problems discussed in the above issue. It wouldn't reduce the total size of the generated docs, the time spent compiling them, or the worst-case amount of data that the client downloads.

Quality

Do you actually use the built-in mdBook search? I rarely find what I'm looking for with it. I've spent ten minutes trying to get it to find what I expected to be there, only to instantly bring up the right page in DuckGuckGo.

It's instructive to look at other PL communities. Elixir's ExDoc has a client-side-js search engine, but they're talking about building a second search engine using a server and sqlite3 instead because of scaling issues (they want to be able to search across all of hex.pm).

the8472 · March 9, 2024, 9:23pm

One could apply database techniques such as building a keyword index, match on that and only then load complete descriptions from sharded files. Though file:/// support would make this difficult.

Vorpal · March 10, 2024, 1:32pm

You make some good points, that will definitely have to be considered for a concrete RFC. Some short notes though:

I find per crate full text search more useful than ecosystem wide. I want to know if thisLi rary has what I'm looking for, not F if someone mentioned somewhere else. I don't know of any easy and reliable way to make duckduckgo or Google search only a single crate (and a specific version of said crate). Especially if it is a recently released crate that has yet to be fully indexed.
We definitely need efficient storage formats and sharding. I don't believe this is insurmountable though. As part of making an RFC on this I would definitely look into what the best existing options are.
A server based search would be suboptimal. I like that I can run cargo doc locally and use it even if I don't have Internet currently. And I'm not up for trying to implement two different searches.
Quality: this could need some work, but it isn't obvious to me that a server based FTS would give better results than a pre-generated one. I have not used FTS in the official rust book (not do I remember googling because I couldn't find things there), but I have in other mdbook and it worked fine. I have also used it with things based on mkdocs and there too it worked fine.

As for worst case download size: you are right. But that is just it, worst case. If we design things properly worst case should be rare. A bit subpar search experience for some particular searches (results might appear slowly and more results appear as you look).

And if we are worried about malicious actors there are much easier ways to DoS than to perform searches in browsers. (I hear botnets are pretty effective, but don't tell any bad guys. )

As for server load, I like @Nemo157 's suggestion of adding an extra tab to the search. We could opt to not show number of matches until you click that tab (not perform full text search in the background). We should also not make full text search the default tab. API search is better when it works, and this way users would only go look at the FTS results when they don't find what they are looking for in the main results.

notriddle · March 10, 2024, 11:55pm

When I speak of the "total size," I don't actually mean the amount of data that's downloaded by someone running a search on docs.rs.

I'm talking about the complaints coming from people who push their docs to GitHub Pages, the S3 bill that docs.rs has to pay, and the rust-docs.tar.gz bundle that you download with rustup. Those groups are hurt when the total amount of rustdoc output grows in size, regardless of whether the web browser has to load it.

If you didn't notice the tab bar, then a hundred other people probably didn't notice it either. It isn't discoverable enough. If it's not the default tab, it won't be helpful to the new users who need it the most.

system · June 8, 2024, 11:55pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search Rust Book documentation	2	1618	March 25, 2019
Documentation for rustdoc search? tools and infrastructure	5	507	July 24, 2023
Rustdoc - use Stork search index?	25	2093	July 21, 2020
Search docs by function type signature tools and infrastructure	4	1127	January 12, 2021
A handy browser extension to search crates and official docs in address bar (omnibox) community	10	2869	March 25, 2019

Full text search for rustdoc and doc.rs

The team has talked about this before, but I'm only really speaking for myself here.

Size

Quality

Related topics