A case for rank-2 polymorphism

This post describes a use case for higher ranked polymorphism for which I was unable to find a workaround without sacrificing performance or flexibility.

In RustCrypto we have traits which look roughly like this (all snippets are significantly simplified):

trait Encrypt {
    const BLOCK_SIZE: usize;
    const PAR_BLOCKS: usize;
    
    fn encrypt(&self, block: &mut [u8; Self::BLOCK_SIZE]);

    fn par_encrypt(
        &self,
        block: &mut [[u8; Self::BLOCK_SIZE]; Self::PAR_BLOCKS],
    );
}

This trait allows you to encrypt data and exposes an ability of some implementations to efficiently encrypt several blocks in parallel using ILP and SIMD. It works fine when we have one backend, but, unfortunately, it is quite sub-optimal when we have several backends and switch between them at runtime (e.g. we have different backends for SSE2 and AVX2).

This approach has the following issues:

  • PAR_BLOCKS is often different for underlying backends, i.e. it can not be an associated constant. Using a "wrong" value for PAR_BLOCKS results in inefficient code.
  • We perform CPU target features check during each method call and it's currently not possible to force compiler to optimize this check.
  • The aforementioned check prevents compiler from applying some important optimizations, e.g. it results in data spilling to stack while it could've stayed in registers.

This trait has to be composable with other algorithms, e.g. block ciphers can be plugged into blocks modes, which in turn get combined with MACs or universal hashes into AEAD modes. Each algorithm may or may not support parallel block processing and not only number of blocks processed in parallel may be different, but also block sizes.

If we are to take a higher-level look at the problem, we want to generate several variants of a program for each supported backend and do runtime target feature check only once before calling the best variant. And if we are combining several algorithms, then ideally we want to generate a program variant for each combination of backends. In other words, if cipher supports 3 backends and MAC two then we want to have 6 variants of a program.

There is an elegant way of solving this problem using rank-2 polymorphism without sacrificing neither performance, nor flexibility:

trait Backend {
    const BLOCK_SIZE: usize;
    const PAR_BLOCKS: usize;

    fn encrypt(&self, block: &mut [u8; Self::BLOCK_SIZE]);
    fn par_encrypt(
        &self,
        block: &mut [[u8; Self::BLOCK_SIZE]; Self::PAR_BLOCKS],
    );
}

trait Encrypt {
    const BLOCK_SIZE: usize;

    fn encrypt_with_backend<F>(&self, f: F)
    where F: for<B: Backend<BLOCK_SIZE = Self::BLOCK_SIZE>> FnOnce(&B);
}

High-level cipher implementation code:

impl Encrypt for Aes1128 {
    const BLOCK_SIZE: usize = 16;
    
    fn encrypt_with_backend<F>(&self, f: F)
    where F: for<B: Backend<BLOCK_SIZE = Self::BLOCK_SIZE>> FnOnce(&B)
    {
        if aesni_available() {
            // This backend uses AES-NI intrinsics, so we have to enable
            // the required target features
            #[target_feature(enable = "aes")]
            unsafe fn inner(f: impl FnOnce(&AesniBackend)) {
                f(&AesniBackend::from(self))
            }

            // SAFETY: we have checked that AES-NI is available
            unsafe { inner(f) }
        } else {
            f(&SoftwareBackend::from(self))
        }
    }
}

Note that a concrete backend type gets selected by the method itsets, not by its user. Also AesniBackend and SoftwareBackend can and usually should be private.

User code:

let cipher = Aes128::new(key);
cipher.encrypt_with_backend(|backend| {
    // in some cases it's important to be able to create such temporaries
    let mut temp_block = [0u8; backend::PAR_BLOCKS];
    let (chunks, tail) = blocks.as_chunks_mut::<backend::PAR_BLOCKS>();
    for chunk in chunks {
        backend.par_encrypt(chunk);
    }
    for block in tail {
        backend.encrypt(block);
    }
});

The closure which user passes to encrypt_with_backend is effectively a "template" of a program variants of which we want to generate. Because those templates get inlined during mononomorphization and in case of the first branch we even specifically enable additional target features, compiler should be able to optimize the hell out of such "templates" without any obstacles.

Here is how combination of several primitives could look like:

let mut cipher = Aes128Ctr::new(key1);
let mut hasher = Ghash::new(key2);
// note: AES-CTR and GHASH both operate over 128-bit blocks
cipher.encrypt_with_backend(|cback| {
    hasher.update_with_backend(|hback| {
        const PAR_BLOCKS = lcm(cback::PAR_BLOCKS, hback::PAR_BLOCKS);
        let (chunks, tail) = blocks.as_chunks_mut::<PAR_BLOCKS>();
        for chunk in chunks {
            let (cchunks, _) = chunk.as_chunks_mut::<cback::PAR_BLOCKS>();
            for cchunk in cchunks {
                cback.par_encrypt(cchunk);
            }
            let (hchunks, _) = chunk.as_chunks_mut::<hback::PAR_BLOCKS>();
            for hchunk in hchunks {
                hback.par_update(hchunk);
            }
        }
        for block in tail {
            cback.encrypt(block);
            hback.update(block);
        }
    });
});

As you may notice there is still a tiny sub-optimality: we check CPU capabilities (target features) twice, one time for the cipher and the second time for the hasher. But since we do those checks only once per processed slice of blocks, in practice it's impact will be virtually non-existent.

As for impact on final binary size, it will not be too big, since the main meat will be inside of backends code, while the "templates" are usually relatively lightweight in practice.

Well, everything is well and good, but for one small problem: Rust currently does not support rank-2 polymorphism outside of lifetimes.

Yes, it's possible to work around lack of rank-2 polymorphism by using dynamic dispatch, but, unfortunately, not only such solutions are somewhat ugly (e.g. you can not easily create temporary buffers), but also performance-wise they are quite slow (in some cases on the level of 40-70%) since it's impossible to force compiler to inline dynamic dispatch in cases when a concrete type is available to it.

How difficult would it be to extend HRTB to types? Is it possible to add such feature without fully committing to HKT? Could GATs help here somehow (personally, I don't see how)?

2 Likes

Regarding workarounds without a performance penalty, something like defining a trait

trait EncryptWithBackendCallback<const BLOCK_SIZE: usize> {
    fn call<B: Backend<BLOCK_SIZE = BLOCK_SIZE>>(self, backend: &B);
}

should be possible; you'd then use

trait Encrypt {
    const BLOCK_SIZE: usize;

    fn encrypt_with_backend<F>(&self, f: F)
    where F: EncryptWithBackendCallback<{Self::BLOCK_SIZE}>;
}

The main downside, a user like

let mut cipher = Aes128Ctr::new(key1);
let mut hasher = Ghash::new(key2);
// note: AES-CTR and GHASH both operate over 128-bit blocks
cipher.encrypt_with_backend(|cback| {
    hasher.update_with_backend(|hback| {
        const PAR_BLOCKS = lcd(cback::PAR_BLOCKS, hback::PAR_BLOCKS);
        let (chunks, tail) = blocks.as_chunks_mut::<PAR_BLOCKS>();
        for chunk in chunks {
            let (cchunks, _) = chunk.as_chunks_mut::<cback::PAR_BLOCKS>();
            for cchunk in cchunks {
                cback.par_encrypt(cchunk);
            }
            let (hchunks, _) = chunk.as_chunks_mut::<hback::PAR_BLOCKS>();
            for hchunk in hchunks {
                hback.par_update(hchunk);
            }
        }
        for block in tail {
            cback.encrypt(block);
            hback.update(block);
        }
    });
});

now needs to do something syntactically unpleasant along the lines of:

let mut cipher = Aes128Ctr::new(key1);
let mut hasher = Ghash::new(key2);
// note: AES-CTR and GHASH both operate over 128-bit blocks
cipher.encrypt_with_backend({
    struct Cb<'a>{ hasher: &'a mut Ghash };
    impl EncryptWithBackendCallback<Aes128Ctr::BLOCK_SIZE> for Cb<'_> {
        fn call<B: Backend<BLOCK_SIZE = Aes128Ctr::BLOCK_SIZE>>(self, cback: &B) {
            self.hasher.update_with_backend({
                struct Cb2<'a, B> { cback: &'a B }
                impl<B> UpdateWithBackendCallback
                where
                    B: Backend<BLOCK_SIZE = Aes128Ctr::BLOCK_SIZE>>,
                {
                    fn call<B2: GhashBackend>(self, hback: &B2) {
                        let (chunks, tail) = blocks.as_chunks_mut::<{lcd(B::PAR_BLOCKS, B2::PAR_BLOCKS)}>();
                        for chunk in chunks {
                            let (cchunks, _) = chunk.as_chunks_mut::<{B::PAR_BLOCKS}>();
                            for cchunk in cchunks {
                                self.cback.par_encrypt(cchunk);
                            }
                            let (hchunks, _) = chunk.as_chunks_mut::<{B2::PAR_BLOCKS}>();
                            for hchunk in hchunks {
                                hback.par_update(hchunk);
                            }
                        }
                        for block in tail {
                            self.cback.encrypt(block);
                            hback.update(block);
                        }
                    }
                }
                Cb2 { cback }
            }
        }
    }
    Cb { hasher: &mut hasher }
});

Basically like defining closure types by hand.

(The lcd part, (whatever "lcd" exactly means; perhaps you mean "gcd" or "lcm") seems somewhat problematic, because what if that const operation fails?)

1 Like

Hm, interesting. I will try to play with this workaround. I can not use it directly because of lack of equality constrains on associated constants, but in practice we currently use typenum either way, so it should not be a problem.

Yeah, it should've been lcm for Least Common Multiple. Fixed.

Unfortunately, it looks like it indeed will be problematic, since there is no place to add a bound for this const computation... But we could work around it by using chunks_exact_mut instead of as_chunks_mut for the outer loop. Compiler should handle propagation of constants for such code just fine.

Totally neither here nor there (kind of off-topic, I think), but, since you mention it...

This patch largely addresses this issue, actually, although I'm not sure I had seen other people notice the problem: wip unordered pr · thomcc/stdarch@ca095d8 · GitHub

I haven't submitted it largely due to:

  • lack of time (although you've reminded me so I'll try to get back to it)
  • it being kind of a pain to collect benchmarks/asm.
  • i know i'll have to defend the use of unordered load/store.
  • relaxed is already in hot water for the memory model, unordered is even worse (although this pattern is correct even if unordered loads return any previously stored value, which IIUC should be true given that this usage is acyclic)