Ideas to enhance rustdoc search speed


#1

The search field in rustdoc is quite slow, and grows slower the larger the project is. It would be really cool if the speed could be improved.

E.g. take some larger projects like rustc, or servo to see the very slow speed when searching something:

ATM it seems to iterate over all words in the search index and check whether the searched word matches based on substring and levenshtein criteria for every single word. This might be okay for small projects, but not for huge projects like above.

Maybe we could implement fuzzy search via tries or similar means? Or maybe even switch to a fully fledged search engine that we compile to wasm, like tantivy?


#2

A link to the code in question, and one to the core algorithm.

Another link to an awesome crate by @burntsushi, maybe it is useful: https://github.com/BurntSushi/fst


#3

Another approach that may be worth considering is using multiple web workers to parallelize the search.


#4

I imagine there’s room to come up with a very simple index here that doesn’t use a full blown search engine. The core ideas in tantivy (and even Lucene) are deliciously simple. Here’s a brief sketch. Please ask questions if anything is unclear!

  • rustdoc can build an n-gram index (probably with n={2,3,4}). This index is a simple map from every n-gram to a list corresponding to the words/functions/types that contain that n-gram.
  • A search is broken into three phases. The first phase extracts all of the n-grams from the query. The second phase uses the aforementioned index to find all words/functions/types corresponding to each n-gram in the query. The third phase executes the same search that is done today on the results from the second phase.

If you stopped right there, I bet you’d wind up with a nice win over the current state. Further tweaking may be desirable:

  • If there is an n-gram that occurs in the vast majority of all possible search results, then a query that contains that n-gram will cause this search technique to degrade to the current implementation of search. (This is typically where a search engine will use something called stop words. The cost is a drop in recall, but it’s usually worth it.)
  • If you wanted to get really fancy, you could add the documentation text itself to the n-gram index and then apply tf-idf to rank the search results.
  • Experiment with tokenizing schemes other than n-grams, particularly if you’re indexing natural language rather than just programming language identifiers. (A search engine will use stemming or lemmatization, depending on your sophistication level.)

I think this could be a fun project for anyone who wanted to take it on. I’d be happy to help mentor it.

Of course, parallelizing the search might go a long way to solving the most pressing problem. :slight_smile: And also, FWIW, the searches in the servo docs aren’t that slow for me. There’s clearly a little bit of a delay, but it’s not awful.


#5

In Javascript? Or would we accept reading from an index on disk? If so, sqlite has a full text search system so people can search from their local installs. Or move to use a documentation server which can use rg as a library.


#6

Yes. The point of my post is that there is a simple indexing scheme that someone can implement. That does not rule out use of heavier tools, but it’s not at all obvious that they are necessary. Of course, I always encourage anyone to do their own experiments and form their own conclusions. :slight_smile:


#7

Shot in the dark: Wouldn’t this be the perfect for creating a rust to wasm showcase? Using all the existing high performance crates implementing an optimized search would be very easy, and wasm will give an additional performance boost for free. The current search could stay as a fallback for older browsers.