Constant slices deduplication

AngelicosPhosphoros · November 19, 2020, 10:57pm

Hi! Sorry if I chosen inproper category.

I am want to talk about constant slices deduplication. Consider this code (playground link ).

fn main() {
    let arg = std::env::args().next().unwrap();
    if arg == "Hello"{
        println!("Hello");
    }
    else if arg == "World"{
        println!("World");
    }
    else{
        println!("Hello World!");
    }
}

We have here three str constants: "Hello World!", "World", "Hello". Having str reference be something like pair of pointers to the beginning and the end (or pointer to beginning or size), I think it is possible to deduplicate all these contants into the first. So let be reference to first be (ptr1, ptr1+12), reference to second be (ptr1 + 6, ptr1 + 11) and reference to last be (ptr1, ptr1 + 5). It would allow result binaries to be more compact at least.

Currently I see that assembly has independed blocks for each value:

.L__unnamed_2:
	.ascii	"Hello"

.L__unnamed_11:
	.ascii	"Hello\n"

.L__unnamed_6:
	.quad	.L__unnamed_11
	.asciz	"\006\000\000\000\000\000\000"

.L__unnamed_5:

.L__unnamed_3:
	.ascii	"World"

.L__unnamed_12:
	.ascii	"World\n"

.L__unnamed_7:
	.quad	.L__unnamed_12
	.asciz	"\006\000\000\000\000\000\000"

.L__unnamed_13:
	.ascii	"Hello World!\n"

Is such kind of optimization desirable and possible?

CAD97 · November 19, 2020, 11:15pm

This is a possible optimization, yes. &str is a (ptr, len) pair, so nothing prevents this; you could do it manually with a static and slicing.

But this problem is a hard problem. In fact, it's NP-hard in the general case.

But even in the simple case of string literals that are substrings of other string literals, checking this rapidly gets more expensive as you put more string literals in your source. The compiler would like to avoid doing these things that take strongly nonlinear effort for typically marginal gain, as it's a large investment of compile time for a small binary size improvement.

josh · November 19, 2020, 11:39pm

There are representations that would potentially allow, at a minimum, simple tail-merging of strings. For instance, "xyz" and "abcxyz" could result in the single string "abcxyz" in the binary. Rust can do better than C here, since it uses lengths rather than NUL-termination; that increases both the potential for string merging and the complexity of doing so.

I think the biggest question to ask is how much benefit this would provide, for most programs. This could be reasonably answered by scanning the binaries of existing programs on Linux distributions, as well as by scanning the source code of Rust programs for string literals and checking for potential merging opportunities.

I think the first step for considering any optimization like this would be some statistics that suggest how much benefit it could provide for various Rust code.

(On a different note, similar algorithms also allow tail-merging of identical code, which some compilers already do.)

atagunov · November 20, 2020, 12:14am

I think Java compiler might be doing some string "interning" - merging string literals that are fully identical, essentially via a hashmap.

AngelicosPhosphoros · November 20, 2020, 8:51am

I am not agree with this. I wrote sample implementation:

use fxhash::FxHashMap;

#[derive(Clone, Copy)]
struct DuplicatedFragment<'a>{
    owner: &'a [u8],
    offset: usize,
}

fn dedup_slices(mut slices: Vec<&[u8]>)->FxHashMap<&[u8], DuplicatedFragment>{
    if slices.is_empty() {
        return Default::default();
    }
    // Adding empty string to any case should be trivial later
    assert!(slices.iter().all(|x|x.len() > 0), "This algorithm can handle only nonempty strings.");
    
    slices.sort_unstable_by_key(|&x|std::cmp::Reverse(x.len()));
    // I use min_length because this allows to avoid iterating over a lot of short subslices
    let min_length = slices.last().unwrap().len();

    let mut result = FxHashMap::default();
    let mut dedup: FxHashMap<&[u8], DuplicatedFragment> = FxHashMap::default();
    
    for current in slices.iter().copied(){
        if let Some(d) = dedup.get(current){
            result.insert(current, d.clone());
            continue;
        }
        dedup.insert(current, DuplicatedFragment{owner:current, offset: 0});
        result.insert(current, DuplicatedFragment{owner:current, offset: 0});
        
        for offset in 0..current.len()-min_length{
            let offsetted = &current[offset..];
            for subslice in (min_length..offsetted.len()).rev().map(|x|&offsetted[..x]){
                match dedup.entry(subslice) {
                    // Other string already has same subslice so there is no need to go deeper here
                    std::collections::hash_map::Entry::Occupied(_) => break,
                    std::collections::hash_map::Entry::Vacant(entry) => {
                        entry.insert(DuplicatedFragment{owner:current, offset});
                    }
                }
            }
        }
    }
    result
}

This algorithm has O(N*L*L) (N - number of slices, L - len of longest slice) for both memory and time usage which could be a lot but it is polynomial, not NP-hard.

And I think that my algorithm solve the deduplication task successfully because it puts all smaller strings into bigger ones if bigger ones can contain them (if they can't there is no need).

Tail merging (which was suggested by josh) would have O(N*L) complexity. However, it is not necessary to merge tails, we can do "head merging" too.

AngelicosPhosphoros · November 20, 2020, 8:58am

AFAIK, it doesn't deduplicate string subslices into bigger strings but just make same strings to have same String instance. Rust compiler do this already.

I would look what I can collect in this weekend, probably.

quinedot · November 20, 2020, 9:25am

I support continuing this exploration. But just a couple notes, @CAD97 did acknowledge a difference between this approach (using preexisting string literals) and the optimal benefit approach (minimal storage size). Also I think your algorithm is O(N*L*L*L) time since you hash the subslice inside your double loop.

toc · November 20, 2020, 5:59pm

The NP-hard solver stores "abc kmn" and "kmn xyz" both as substrings of "abc kmn xyz". Not that that is likely to be a huge win. A simple solver will get you 99.9% of cases, and people encoding arbitrarily hard problems in their static strings are just going to be out of luck.

I believe LLVM is already doing interning for identical strings.

nakacristo · November 20, 2020, 9:15pm

A O(NLL) approximation to a NP-hard problem can be said good, but I think it is already questionable to belong into a compiler. I have not found whether there is already a policy about time complexity on rustc, but I have found one example issue of a quadratic time: Match checking has quadratic average complexity · Issue #7462 · rust-lang/rust · GitHub

It is possible that there is a good enough loglinear algorithm. It appears to me as too much work for little benefit. Although I may be underestimating the effect.

And one small note, a time complexity of O(NL) could appear to be lineal on the whole size. However, in the case of having N-1 strings of size 1 and a single string of size N, we get a total size of 2N-1 but a O(NN) complexity.

scottmcm · November 20, 2020, 11:07pm

It's definitely scary to see O(L²) for things that could be coming from include_bytes! and similar.

That said, there might be reasonable options like building a trie of prefixes and suffixes to catch the easy cases. Would need a demonstration that it comes up enough to be worth bothering to try, though.

kornel · November 20, 2020, 11:07pm

I've implemented such optimization (in a rough brute-force way) for deunicode crate, but it wasn't a big win.

The biggest optimization for me was reducing 64-bit string lengths to 16-bit. Of course that needed custom representation in memory.

But it's worth keeping in mind that just storing the length of "Hello" takes more than the string itself! Maybe Rust could do something clever about the lengths?

scottmcm · November 20, 2020, 11:37pm

It looks like we're doing ok on this: Compiler Explorer. The 5 for "hello" lives as an immediate in the code, not as a separate constant. It's only if you &"hello" that there's an extra 2×usize constant.

Relatedly, I've been thinking of things like having a ShortVec that would have size 2×usize instead of 3×usize, too, by keeping the length and capacity under √usize::MAX. That's probably too niche for alloc, though.

kornel · November 21, 2020, 12:39am

I had in mind this monstrosity that is an array of strings. Last time I checked (admittedly a while ago) it seemed to store each &str as 2×usize in the executable.

scottmcm · November 21, 2020, 12:59am

Ah, yeah -- it's [&str], so it gives out &&strs and thus has the problem I mentioned.

And oof, it didn't seem that bad originally, but looking around more I ended up finding "\u{7b}\u{42}\u{69}\u{73}\u{6d}\u{69}\u{6c}\u{6c}\u{61}\u{68}\u{20}\u{41}\u{72}\u{2d}\u{52}\u{61}\u{68}\u{6d}\u{61}\u{6e}\u{20}\u{41}\u{72}\u{2d}\u{52}\u{61}\u{68}\u{69}\u{6d}\u{69}\u{7d}" which is less fun

programmerjake · November 22, 2020, 10:51pm

(Edit: fixed Rust algorithm)

I found a paper (PS format) (Google Scholar) with a relatively simple approximation algorithm for computing the smallest common supersequence:

Algorithm Majority-Merge

Input: n sequences, each of length n.

Set supersequence s := null string;

Let a be the majority among the leftmost letters of the remaining sequences. Set s := sa and delete the front a from these sequences. Repeat this step until no sequences are left.

Output s.

Translated to Rust (not tested):

fn common_supersequence(mut sequences: Vec<&[u8]>) -> Vec<u8> {
    let mut first_byte_counts = [0usize; 0x100];
    sequences.drain_filter(|seq| {
        if let Some(&first) = seq.first() {
            first_byte_counts[first as usize] += 1;
            false // keep
        } else {
            true
        }
    });
    let mut retval = Vec::new()
    while !sequences.is_empty() {
        let byte = first_byte_counts.iter().copied().enumerate().max_by_key(|v| v.1).unwrap().0 as u8;
        retval.push(byte);
        sequences.drain_filter(|seq| {
            let (&first, rest) = seq.split_first().unwrap();
            if first != byte {
                return false; // keep
            }
            first_byte_counts[first as usize] -= 1;
            *seq = rest;
            if let Some(&first) = seq.first() {
                first_byte_counts[first as usize] += 1;
                false // keep
            } else {
                true
            }
        });
    }
    retval
}

Though, now that I wrote that all out, I don't think the shortest common supersequence is what we're looking for, since b"abcd" is the shortest common supersequence generated by the above algorithm with input vec![b"ac", b"ab", b"a", b"b", b"c", b"bd"] yet neither b"ac" nor b"bd" appear literally in abcd. Either that, or the above paper is incorrect, which seems unlikely since it has been peer-reviewed.

steffahn · November 23, 2020, 12:22am

If you read the Wikipedia link that @CAD97 provided above carefully you will notice that the main topic of that article, the “shortest common supersequence” problem is not actually the one that’s relevant to this thread, but instead the “shortest common superstring” problem. Perhaps the link would’ve been less ambiguous if it directly linked to that relevant part of the Wikikpedia article.

system · February 21, 2021, 12:22am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
A little proposal for string concatenation language design	13	1465	March 25, 2019
Pre-RFC: Constant slicing language design	3	978	March 25, 2019
Const slicing	10	7491	August 4, 2020
Slice equality? libs	7	6316	August 26, 2021
More musings about slices and arrays language design	13	1453	January 3, 2021

Constant slices deduplication

Related topics