I've invested some time to research this, mostly by comparing with the Go version, that uses Rabin-Karp: https://cs.opensource.google/go/go/+/refs/tags/go1.18.3:src/internal/bytealg/bytealg.go;l=129
In order to compare the results, I've created haystack strings of different sizing, having from 500 to 1365470 characters, containing the MDs from the Rust book and the nomicon. I've also created different sizes of needles, having from 20 to 50 characters. I've searched for needles existing and not existing within the haystack.
Here is the Rust code:
use std::time::Instant;
const EXISTING_NEEDLE: &str = "Example: Implementing Vec";
const MISSING_NEEDLE: &str = "this definitely does not exist in the rust book";
fn find_needle(haystack: &str) -> bool {
haystack.contains(&EXISTING_NEEDLE)
}
fn search_missing_needle(haystack: &str) -> bool {
haystack.contains(&MISSING_NEEDLE)
}
fn print_duration_of_f(name: &str, f: impl FnOnce() -> bool) {
let now = Instant::now();
let found = f();
println!("{name} took {} nanoseconds and returned {found}", now.elapsed().as_nanos());
}
fn main() {
let haystack = include_str!("../data/text.md");
print_duration_of_f("existing needle", || { find_needle(haystack) });
print_duration_of_f("missing needle", || { search_missing_needle(haystack) });
}
And the Go code:
package main
import (
"fmt"
"os"
"strings"
"time"
)
var EXISTING_NEEDLE = "Example: Implementing Vec"
var MISSING_NEEDLE = "this definitely does not exist in the rust book"
func main() {
data, err := os.ReadFile("../data/text.md")
if err != nil {
fmt.Println("Could not open file:", err)
os.Exit(1)
}
data_string := string(data)
start := time.Now()
fmt.Print("Finding the needle returned ",
strings.Contains(data_string, EXISTING_NEEDLE),
" in ")
fmt.Print(time.Since(start).Nanoseconds(), " nanoseconds\n")
start = time.Now()
fmt.Print("Searching a missing needle returned ",
strings.Contains(data_string, MISSING_NEEDLE),
" in ")
fmt.Print(time.Since(start).Nanoseconds(), " nanoseconds\n")
}
For small haystacks, there is close to no difference between the two languages, Rust being in some cases a bit faster. The bigger the haystack gets, the faster Go gets. Another very important aspect that influences the speed difference is the number of collisions between the substrings that are a prefix of the needle and the substrings of the haystack. For example, this string will be a lot faster to find using Rabin-Karp: This means that our current design will always
. The reason is a haystack containing This
and This means
and This means that
will trigger a lot of collisions when bruteforcing, causing the Rust algorithm to check all chars for nothing. The string 99999999999999
does not contain prefixes that can trigger any partial match, so the brute force will not get slowed down by false positives on collisions for substrings of the needle.
When testing for a haystack of 1365470 characters and a needle with a lot of collisions, Go was up to 70 times faster, since that is the perfect scenario for Rabin-Karp.
We should also take into consideration that the implementation of Rabin-Karp would also speed up any situation when we are searching for a small array of bytes in a bigger array of bytes, so this implementation is not limited only to strings.
Using Rabin-Karp (or any rolling hash algorithm) for small strings is a bad decision, so if we want to add this to str.contains
or String.contains
we should add a limit of max characters for brute force, like this.
My personal opinion is that this should be implemented in the Searcher as an alternative to the brute-force, and we should also offer the method that allows rust to automatically decide between brute-force and a rolling-hash algorithm.