I imagine there’s room to come up with a very simple index here that doesn’t use a full blown search engine. The core ideas in tantivy (and even Lucene) are deliciously simple. Here’s a brief sketch. Please ask questions if anything is unclear!
-
rustdoc can build an n-gram index (probably with n={2,3,4}). This index is a simple map from every n-gram to a list corresponding to the words/functions/types that contain that n-gram.
- A search is broken into three phases. The first phase extracts all of the n-grams from the query. The second phase uses the aforementioned index to find all words/functions/types corresponding to each n-gram in the query. The third phase executes the same search that is done today on the results from the second phase.
If you stopped right there, I bet you’d wind up with a nice win over the current state. Further tweaking may be desirable:
- If there is an n-gram that occurs in the vast majority of all possible search results, then a query that contains that n-gram will cause this search technique to degrade to the current implementation of search. (This is typically where a search engine will use something called stop words. The cost is a drop in recall, but it’s usually worth it.)
- If you wanted to get really fancy, you could add the documentation text itself to the n-gram index and then apply tf-idf to rank the search results.
- Experiment with tokenizing schemes other than n-grams, particularly if you’re indexing natural language rather than just programming language identifiers. (A search engine will use stemming or lemmatization, depending on your sophistication level.)
I think this could be a fun project for anyone who wanted to take it on. I’d be happy to help mentor it.
Of course, parallelizing the search might go a long way to solving the most pressing problem.
And also, FWIW, the searches in the servo docs aren’t that slow for me. There’s clearly a little bit of a delay, but it’s not awful.