Constant slices deduplication

Hi! Sorry if I chosen inproper category.

I am want to talk about constant slices deduplication. Consider this code (playground link ).

fn main() {
    let arg = std::env::args().next().unwrap();
    if arg == "Hello"{
        println!("Hello");
    }
    else if arg == "World"{
        println!("World");
    }
    else{
        println!("Hello World!");
    }
}

We have here three str constants: "Hello World!", "World", "Hello". Having str reference be something like pair of pointers to the beginning and the end (or pointer to beginning or size), I think it is possible to deduplicate all these contants into the first. So let be reference to first be (ptr1, ptr1+12), reference to second be (ptr1 + 6, ptr1 + 11) and reference to last be (ptr1, ptr1 + 5). It would allow result binaries to be more compact at least.

Currently I see that assembly has independed blocks for each value:

.L__unnamed_2:
	.ascii	"Hello"

.L__unnamed_11:
	.ascii	"Hello\n"

.L__unnamed_6:
	.quad	.L__unnamed_11
	.asciz	"\006\000\000\000\000\000\000"

.L__unnamed_5:

.L__unnamed_3:
	.ascii	"World"

.L__unnamed_12:
	.ascii	"World\n"

.L__unnamed_7:
	.quad	.L__unnamed_12
	.asciz	"\006\000\000\000\000\000\000"

.L__unnamed_13:
	.ascii	"Hello World!\n"

Is such kind of optimization desirable and possible?

This is a possible optimization, yes. &str is a (ptr, len) pair, so nothing prevents this; you could do it manually with a static and slicing.

But this problem is a hard problem. In fact, it's NP-hard in the general case.

But even in the simple case of string literals that are substrings of other string literals, checking this rapidly gets more expensive as you put more string literals in your source. The compiler would like to avoid doing these things that take strongly nonlinear effort for typically marginal gain, as it's a large investment of compile time for a small binary size improvement.

1 Like

There are representations that would potentially allow, at a minimum, simple tail-merging of strings. For instance, "xyz" and "abcxyz" could result in the single string "abcxyz" in the binary. Rust can do better than C here, since it uses lengths rather than NUL-termination; that increases both the potential for string merging and the complexity of doing so.

I think the biggest question to ask is how much benefit this would provide, for most programs. This could be reasonably answered by scanning the binaries of existing programs on Linux distributions, as well as by scanning the source code of Rust programs for string literals and checking for potential merging opportunities.

I think the first step for considering any optimization like this would be some statistics that suggest how much benefit it could provide for various Rust code.

(On a different note, similar algorithms also allow tail-merging of identical code, which some compilers already do.)

7 Likes

I think Java compiler might be doing some string "interning" - merging string literals that are fully identical, essentially via a hashmap.

I am not agree with this. I wrote sample implementation:

use fxhash::FxHashMap;

#[derive(Clone, Copy)]
struct DuplicatedFragment<'a>{
    owner: &'a [u8],
    offset: usize,
}

fn dedup_slices(mut slices: Vec<&[u8]>)->FxHashMap<&[u8], DuplicatedFragment>{
    if slices.is_empty() {
        return Default::default();
    }
    // Adding empty string to any case should be trivial later
    assert!(slices.iter().all(|x|x.len() > 0), "This algorithm can handle only nonempty strings.");
    
    slices.sort_unstable_by_key(|&x|std::cmp::Reverse(x.len()));
    // I use min_length because this allows to avoid iterating over a lot of short subslices
    let min_length = slices.last().unwrap().len();

    let mut result = FxHashMap::default();
    let mut dedup: FxHashMap<&[u8], DuplicatedFragment> = FxHashMap::default();
    
    for current in slices.iter().copied(){
        if let Some(d) = dedup.get(current){
            result.insert(current, d.clone());
            continue;
        }
        dedup.insert(current, DuplicatedFragment{owner:current, offset: 0});
        result.insert(current, DuplicatedFragment{owner:current, offset: 0});
        
        for offset in 0..current.len()-min_length{
            let offsetted = &current[offset..];
            for subslice in (min_length..offsetted.len()).rev().map(|x|&offsetted[..x]){
                match dedup.entry(subslice) {
                    // Other string already has same subslice so there is no need to go deeper here
                    std::collections::hash_map::Entry::Occupied(_) => break,
                    std::collections::hash_map::Entry::Vacant(entry) => {
                        entry.insert(DuplicatedFragment{owner:current, offset});
                    }
                }
            }
        }
    }
    result
}

This algorithm has O(N*L*L) (N - number of slices, L - len of longest slice) for both memory and time usage which could be a lot but it is polynomial, not NP-hard.

And I think that my algorithm solve the deduplication task successfully because it puts all smaller strings into bigger ones if bigger ones can contain them (if they can't there is no need).

Tail merging (which was suggested by josh) would have O(N*L) complexity. However, it is not necessary to merge tails, we can do "head merging" too.

1 Like

AFAIK, it doesn't deduplicate string subslices into bigger strings but just make same strings to have same String instance. Rust compiler do this already.

I would look what I can collect in this weekend, probably.

I support continuing this exploration. But just a couple notes, @CAD97 did acknowledge a difference between this approach (using preexisting string literals) and the optimal benefit approach (minimal storage size). Also I think your algorithm is O(N*L*L*L) time since you hash the subslice inside your double loop.

The NP-hard solver stores "abc kmn" and "kmn xyz" both as substrings of "abc kmn xyz". Not that that is likely to be a huge win. A simple solver will get you 99.9% of cases, and people encoding arbitrarily hard problems in their static strings are just going to be out of luck.

I believe LLVM is already doing interning for identical strings.

1 Like

A O(NLL) approximation to a NP-hard problem can be said good, but I think it is already questionable to belong into a compiler. I have not found whether there is already a policy about time complexity on rustc, but I have found one example issue of a quadratic time: https://github.com/rust-lang/rust/issues/7462

It is possible that there is a good enough loglinear algorithm. It appears to me as too much work for little benefit. Although I may be underestimating the effect.

And one small note, a time complexity of O(NL) could appear to be lineal on the whole size. However, in the case of having N-1 strings of size 1 and a single string of size N, we get a total size of 2N-1 but a O(NN) complexity.

1 Like

It's definitely scary to see O(L²) for things that could be coming from include_bytes! and similar.

That said, there might be reasonable options like building a trie of prefixes and suffixes to catch the easy cases. Would need a demonstration that it comes up enough to be worth bothering to try, though.

I've implemented such optimization (in a rough brute-force way) for deunicode crate, but it wasn't a big win.

The biggest optimization for me was reducing 64-bit string lengths to 16-bit. Of course that needed custom representation in memory.

But it's worth keeping in mind that just storing the length of "Hello" takes more than the string itself! Maybe Rust could do something clever about the lengths?

2 Likes

It looks like we're doing ok on this: https://rust.godbolt.org/z/9qMdTj. The 5 for "hello" lives as an immediate in the code, not as a separate constant. It's only if you &"hello" that there's an extra 2×usize constant.

Relatedly, I've been thinking of things like having a ShortVec that would have size 2×usize instead of 3×usize, too, by keeping the length and capacity under √usize::MAX. That's probably too niche for alloc, though.

I had in mind this monstrosity that is an array of strings. Last time I checked (admittedly a while ago) it seemed to store each &str as 2×usize in the executable.

Ah, yeah -- it's [&str], so it gives out &&strs and thus has the problem I mentioned.

And oof, it didn't seem that bad originally, but looking around more I ended up finding "\u{7b}\u{42}\u{69}\u{73}\u{6d}\u{69}\u{6c}\u{6c}\u{61}\u{68}\u{20}\u{41}\u{72}\u{2d}\u{52}\u{61}\u{68}\u{6d}\u{61}\u{6e}\u{20}\u{41}\u{72}\u{2d}\u{52}\u{61}\u{68}\u{69}\u{6d}\u{69}\u{7d}" which is less fun :fearful:

(Edit: fixed Rust algorithm)

I found a paper (PS format) (Google Scholar) with a relatively simple approximation algorithm for computing the smallest common supersequence:

Algorithm Majority-Merge

  1. Input: n sequences, each of length n.
  2. Set supersequence s := null string;
  3. Let a be the majority among the leftmost letters of the remaining sequences. Set s := sa and delete the front a from these sequences. Repeat this step until no sequences are left.
  4. Output s.

Translated to Rust (not tested):

fn common_supersequence(mut sequences: Vec<&[u8]>) -> Vec<u8> {
    let mut first_byte_counts = [0usize; 0x100];
    sequences.drain_filter(|seq| {
        if let Some(&first) = seq.first() {
            first_byte_counts[first as usize] += 1;
            false // keep
        } else {
            true
        }
    });
    let mut retval = Vec::new()
    while !sequences.is_empty() {
        let byte = first_byte_counts.iter().copied().enumerate().max_by_key(|v| v.1).unwrap().0 as u8;
        retval.push(byte);
        sequences.drain_filter(|seq| {
            let (&first, rest) = seq.split_first().unwrap();
            if first != byte {
                return false; // keep
            }
            first_byte_counts[first as usize] -= 1;
            *seq = rest;
            if let Some(&first) = seq.first() {
                first_byte_counts[first as usize] += 1;
                false // keep
            } else {
                true
            }
        });
    }
    retval
}

Though, now that I wrote that all out, I don't think the shortest common supersequence is what we're looking for, since b"abcd" is the shortest common supersequence generated by the above algorithm with input vec![b"ac", b"ab", b"a", b"b", b"c", b"bd"] yet neither b"ac" nor b"bd" appear literally in abcd. Either that, or the above paper is incorrect, which seems unlikely since it has been peer-reviewed.

If you read the Wikipedia link that @CAD97 provided above carefully you will notice that the main topic of that article, the “shortest common supersequence” problem is not actually the one that’s relevant to this thread, but instead the “shortest common superstring” problem. Perhaps the link would’ve been less ambiguous if it directly linked to that relevant part of the Wikikpedia article.

2 Likes