Using a more efficient string matching algorithm

cuishuang · May 30, 2022, 3:14pm

fn main(){

    let str = "hello";

    println!("{}",str);

    let rs = str.contains("el");

    println!("{}",rs);


}

It seems that rust's string matching is done by a brute force loop, it is recommended to use a more efficient way, such as Boyer-Moore and Rabin-Karp.

When the string and the matched string are very large, it can effectively improve the efficiency

zackw · May 30, 2022, 3:30pm

Why do you think string matching is done by a brute force loop?

Inspecting the code, it appears to me that string matching is supposed to use the Two-Way Algorithm, which is quite efficient in general. (See rust/pattern.rs at 946a88a989acdcc3b0d05a666eaac0db414ec2cd · rust-lang/rust · GitHub et seq.) It's possible that this algorithm is not being used because of a bug somewhere; it's also possible that it has large constant factors that make your very simple test case slower than ideal. We need more detail about what you observed to be slow, and under what conditions.

simonbuchan · June 1, 2022, 3:18pm

As with most advanced string-search algorithms, the naïve implementation may be more efficient on small-enough instances

Is my wild guess.

burntsushi · June 1, 2022, 3:51pm

Yup, that is something I'd agree with (although unsure if it applies to the OP's case specifically, since they didn't give enough details). The principle also generalizes fairly well. It's reasonably common to have to sacrifice latency (or something) to get throughput improvements. See also: memchr - Rust

hiram-abif · June 16, 2022, 4:40pm

I've invested some time to research this, mostly by comparing with the Go version, that uses Rabin-Karp: https://cs.opensource.google/go/go/+/refs/tags/go1.18.3:src/internal/bytealg/bytealg.go;l=129

In order to compare the results, I've created haystack strings of different sizing, having from 500 to 1365470 characters, containing the MDs from the Rust book and the nomicon. I've also created different sizes of needles, having from 20 to 50 characters. I've searched for needles existing and not existing within the haystack.

Here is the Rust code:

use std::time::Instant;

const EXISTING_NEEDLE: &str = "Example: Implementing Vec";
const MISSING_NEEDLE: &str = "this definitely does not exist in the rust book";

fn find_needle(haystack: &str) -> bool {
    haystack.contains(&EXISTING_NEEDLE)
}

fn search_missing_needle(haystack: &str) -> bool {
    haystack.contains(&MISSING_NEEDLE)
}

fn print_duration_of_f(name: &str, f: impl FnOnce() -> bool) {
    let now = Instant::now();
    let found = f();
    println!("{name} took {} nanoseconds and returned {found}", now.elapsed().as_nanos());
}

fn main() {
    let haystack = include_str!("../data/text.md");
    print_duration_of_f("existing needle", || { find_needle(haystack) });
    print_duration_of_f("missing needle", || { search_missing_needle(haystack) });
}

And the Go code:

package main

import (
	"fmt"
	"os"
	"strings"
	"time"
)

var EXISTING_NEEDLE = "Example: Implementing Vec"
var MISSING_NEEDLE = "this definitely does not exist in the rust book"

func main() {
	data, err := os.ReadFile("../data/text.md")
	if err != nil {
		fmt.Println("Could not open file:", err)
		os.Exit(1)
	}

	data_string := string(data)

	start := time.Now()
	fmt.Print("Finding the needle returned ",
		strings.Contains(data_string, EXISTING_NEEDLE),
		" in ")
	fmt.Print(time.Since(start).Nanoseconds(), " nanoseconds\n")

	start = time.Now()
	fmt.Print("Searching a missing needle returned ",
		strings.Contains(data_string, MISSING_NEEDLE),
                " in ")
	fmt.Print(time.Since(start).Nanoseconds(), " nanoseconds\n")
}

For small haystacks, there is close to no difference between the two languages, Rust being in some cases a bit faster. The bigger the haystack gets, the faster Go gets. Another very important aspect that influences the speed difference is the number of collisions between the substrings that are a prefix of the needle and the substrings of the haystack. For example, this string will be a lot faster to find using Rabin-Karp: This means that our current design will always. The reason is a haystack containing This and This means and This means that will trigger a lot of collisions when bruteforcing, causing the Rust algorithm to check all chars for nothing. The string 99999999999999 does not contain prefixes that can trigger any partial match, so the brute force will not get slowed down by false positives on collisions for substrings of the needle.

When testing for a haystack of 1365470 characters and a needle with a lot of collisions, Go was up to 70 times faster, since that is the perfect scenario for Rabin-Karp.

We should also take into consideration that the implementation of Rabin-Karp would also speed up any situation when we are searching for a small array of bytes in a bigger array of bytes, so this implementation is not limited only to strings.

Using Rabin-Karp (or any rolling hash algorithm) for small strings is a bad decision, so if we want to add this to str.contains or String.contains we should add a limit of max characters for brute force, like this.

My personal opinion is that this should be implemented in the Searcher as an alternative to the brute-force, and we should also offer the method that allows rust to automatically decide between brute-force and a rolling-hash algorithm.

burntsushi · June 16, 2022, 4:56pm

Try a comparison with memchr::memmem. It uses Two-Way (the same thing std uses), but also has SIMD acceleration in a large number of common cases that tend to speed it up considerably. It's what ripgrep uses for, what I would guess, to be the vast majority of all searches and is crucial to its apparent speed.

burntsushi · June 16, 2022, 5:13pm

Also note that saying "Go uses Rabin-Karp" is somewhat misleading. Like memchr::memmem, Rabin-Karp is just a component of the strategy, not the strategy. See the broader impl here: https://cs.opensource.google/go/go/+/refs/tags/go1.18.3:src/strings/strings.go;l=1097

And in particular, it has a SIMD accelerated skip loop (effectively a memchr) here: https://cs.opensource.google/go/go/+/refs/tags/go1.18.3:src/strings/strings.go;l=1125

The substring search in Rust core doesn't have anything like an AVX accelerated skip loop because the infrastructure hasn't been built to enable such shenanigans in core yet unfortunately. Hence, that's partially why I wrote memchr::memmem.

hiram-abif · June 16, 2022, 5:27pm

@burntsushi Congrats on your crate. I will test it out and report back.

I think this functionality is required in the Rust core, though.

CAD97 · June 16, 2022, 5:35pm

Everyone agrees here that this is desirable, but

(It's potentially worth testing with -Zbuild-std -Ctarget-cpu=native to see exactly how well the std contains does or doesn't perform if optimizations have access to SIMD. It's quite possible that this doesn't have any impact, but it very possibly does as well.)

burntsushi · June 16, 2022, 5:43pm

I'd be mildly curious as well, but it's worth noting that the memchr::memmem search uses explicit SIMD. So "simple" autovectorization is unlikely to help much. A quick browse of the code in core suggests there's not much to be gained by -C target-cpu=native. You might get a little something, but I'd be super shocked if it was much.

@hiram-abif And yes, I don't know of anyone who is actively arguing against adding SIMD acceleration to important primitives like substring search. It's a matter of someone championing the cause, understanding the problem, coming up with a solution to the problem and getting buy-in from relevant stakeholders.

hiram-abif · June 16, 2022, 6:24pm

I've added this as follows:

fn find_needle_memmem(haystack: &str) -> bool {
    memmem::find_iter(haystack.as_bytes(), &EXISTING_NEEDLE).next().is_some()
}

fn search_missing_memmem(haystack: &str) -> bool {
    memmem::find_iter(haystack.as_bytes(), &MISSING_NEEDLE).next().is_some()
}

However it is in most all slower than str.containts():

existing needle took 3762367 nanoseconds and returned true
missing needle took 4139124 nanoseconds and returned false
existing needle MEMMEM took 7182597 nanoseconds and returned true
missing needle MEMMEM took 6921075 nanoseconds and returned false

Am I using the crate correctly?

burntsushi · June 16, 2022, 6:25pm

For the problem of using SIMD in core? I personally can't, no. Be warned, as far as I know, this is a pretty open ended problem. And it might require collaboration with multiple teams, depending on the solution. It's a tough nut to crack, which is why it hasn't been done yet. There is no clear path, as far as I know, that has been laid out yet.

In fact, I can't even find a good issue for it. The best one I can find is this one about speeding up UTF-8 validation with SIMD. You can see that a good bit of the issue is how to actually get SIMD into core: SIMD-enabled utf-8 validation · Issue #68455 · rust-lang/rust · GitHub

burntsushi · June 16, 2022, 6:26pm

I don't know. I need a program I can build and run to test it. You also should disclose your environment. For example, if you aren't on x86_64, things might not be as fast.

hiram-abif · June 16, 2022, 6:35pm

I am running on AMD Ryzen 9 3900X 12-Core Processor with 64GB of memory, on archlinux, on x86_64. memchr::memmem is for me twice as slow as str.contains(). Go is still multiple times faster.

burntsushi · June 16, 2022, 6:45pm

Well, where's the program you're running? Give me a repro. Like this:

$ cat Cargo.toml
[package]
name = "adhoc-substring"
version = "0.1.0"
edition = "2021"

[dependencies]
memchr = "2.5.0"

[[bin]]
name = "adhoc-substring"
path = "main.rs"
$ cat main.rs
// See:
// https://internals.rust-lang.org/t/using-a-more-efficient-string-matching-algorithm/16719/4

use std::time::Instant;

use memchr::memmem;

const EXISTING_NEEDLE: &str = "Example: Implementing Vec";
const MISSING_NEEDLE: &str = "this definitely does not exist in the rust book";

fn find_needle(haystack: &str) -> bool {
    haystack.contains(&EXISTING_NEEDLE)
}

fn find_memmem_needle(haystack: &str) -> bool {
    memmem::find(haystack.as_bytes(), EXISTING_NEEDLE.as_bytes()).is_some()
}

fn search_missing_needle(haystack: &str) -> bool {
    haystack.contains(&MISSING_NEEDLE)
}

fn search_memmem_missing_needle(haystack: &str) -> bool {
    memmem::find(haystack.as_bytes(), MISSING_NEEDLE.as_bytes()).is_some()
}

fn print_duration_of_f(name: &str, f: impl FnOnce() -> bool) {
    let now = Instant::now();
    let found = f();
    println!("{name} took {} nanoseconds and returned {found}", now.elapsed().as_nanos());
}

fn main() {
    let haystack = include_str!("text.md");
    print_duration_of_f("existing needle", || { find_needle(haystack) });
    print_duration_of_f("missing needle", || { search_missing_needle(haystack) });

    print_duration_of_f("existing memmem needle", || { find_memmem_needle(haystack) });
    print_duration_of_f("missing memmem needle", || { search_memmem_missing_needle(haystack) });
}
$ cargo build --release
$ ./target/release/adhoc-substring
existing needle took 855929 nanoseconds and returned true
missing needle took 3354082 nanoseconds and returned false
existing memmem needle took 11434 nanoseconds and returned true
missing memmem needle took 98650 nanoseconds and returned false

And I generated text.md like so:

$ git clone https://github.com/rust-lang/book
$ git clone https://github.com/rust-lang/nomicon
$ find ./book ./nomicon -name '*.md' -print0 | xargs -0 cat > text.md

So in my case, memmem is an order of magnitude faster in both benchmarks.

I don't know why your results are different. Sorry. It's worth noting that there is a lot of noise (your benchmark is pretty weird to be honest, usually you do things in a loop or use a bigger haystack at least, because these ops are pretty fast):

$ ./target/release/adhoc-substring
existing needle took 5900145 nanoseconds and returned true
missing needle took 748813 nanoseconds and returned false
existing memmem needle took 57574 nanoseconds and returned true
missing memmem needle took 78040 nanoseconds and returned false

$ ./target/release/adhoc-substring
existing needle took 4405502 nanoseconds and returned true
missing needle took 798790 nanoseconds and returned false
existing memmem needle took 59305 nanoseconds and returned true
missing memmem needle took 78671 nanoseconds and returned false

$ ./target/release/adhoc-substring
existing needle took 6920965 nanoseconds and returned true
missing needle took 4676532 nanoseconds and returned false
existing memmem needle took 312380 nanoseconds and returned true
missing memmem needle took 367378 nanoseconds and returned false

$ ./target/release/adhoc-substring
existing needle took 4706928 nanoseconds and returned true
missing needle took 766127 nanoseconds and returned false
existing memmem needle took 56876 nanoseconds and returned true
missing memmem needle took 94669 nanoseconds and returned false

However, memmem is still an order of magnitude faster in this specific example.

See also, benchmark runs that are logged in the memchr repo: memchr/bench/runs/2022-04-29_wasm-changes at master · BurntSushi/memchr · GitHub

burntsushi · June 16, 2022, 6:57pm

If I bump up the corpus to 100x its size, here are my numbers:

$ ./main
Finding the needle returned true in 215283 nanoseconds
Searching a missing needle returned false in 167364593 nanoseconds

$ ./target/release/adhoc-substring
existing needle took 4371582 nanoseconds and returned true
missing needle took 77984320 nanoseconds and returned false
existing memmem needle took 107877 nanoseconds and returned true
missing memmem needle took 12966119 nanoseconds and returned false

I increased the corpus by doing this:

$ for ((i=0; i<100; i++)); do cat text.md; done > text.x100.md

And then updating the programs to use text.x100.md instead of text.md.

It's also worth noting that measuring a substring algorithm with a single benchmark is pure insanity. There is no substring algorithm in existence, to my knowledge, that will beat every other substring algorithm on al benchmarks. That's what memchr::memmem's benchmark suite is so freaking huge. It takes hours to run.

hiram-abif · June 16, 2022, 7:00pm

Yea, I found where it's slow:

➤ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/rust-string`
existing needle took 8684726 nanoseconds and returned true
missing needle took 7700827 nanoseconds and returned false
existing memmem needle took 11906207 nanoseconds and returned true
missing memmem needle took 11123639 nanoseconds and returned false
➤ ./target/release/rust-string
existing needle took 2377698 nanoseconds and returned true
missing needle took 796844 nanoseconds and returned false
existing memmem needle took 43382 nanoseconds and returned true
missing memmem needle took 41639 nanoseconds and returned false

Why is is to slow with cargo run?

burntsushi · June 16, 2022, 7:03pm

Because cargo run compiles in debug mode. You need cargo run --release.

go build has no analog to this split. Which makes it a bit of an interesting oddball, and reflects its priority towards maintaining a fast compiler. But most compilers, including Rust, have a "fast compilation but slow program" mode and a "slow compilation but fast program" mode.

hiram-abif · June 16, 2022, 7:04pm

There is no need to be offensive here. When some tests are 70 times faster, it's not really relevant if you run 1 test or 2 hours of test.

I understand you consider this "weird" and you would probably like to see something fancy with criterion, but I think at this point there creating a CI/CD that runs batches of tests is out of the scope of this ticket.

zackw · June 16, 2022, 7:09pm

Can I suggest that some test cases should be written specifically trying to elicit worst-case behavior from both Two-Way and Rabin-Karp? I see that "This means that our current design will always" has been called out as bad for Two-Way^[1], but I don't see anything tailored as worst-case for Rabin-Karp^[2]. I'd do this myself if I wasn't completely snowed under with day-job responsibilities this week.

I'm not sure this is accurate, though. It seems to me that the backward-going phase of Two-Way would weed out partial matches on "This", "This means", etc. ↩︎
e.g. searching for "aaa" in "aaaaaaaaaaaaaaa...a" ↩︎

Topic		Replies	Views
What if match statetement could generate perfect hash function language design	18	1692	October 30, 2023
Why is String's equality comparison slower than `&[u8]` compiler	3	643	November 15, 2024
Faster PartialEq for small arrays of non 2^n length compiler	25	754	September 14, 2024
Pre-RFC: New generic string pattern API for `&str` libs	8	2586	March 25, 2019
`str` method for slicing code-point (i.e. `char`) ranges libs	23	2978	March 25, 2019

Using a more efficient string matching algorithm

Related topics