Expose LLVM's "or disjoint" instruction

Alion · April 23, 2024, 11:16pm

The instruction performs a bitwise or under the assumption that there are no pairs of bits where both bits are set. This is useful for optimization purposes because it acts like an or, an xor, and an unchecked add, all at the same time. It also helps the code be self-documenting.

As far as I'm aware, there is no bulletproof way to get this instruction in user code. I propose adding a core::intrinsics::disjoint_or function, analogously to the unchecked_* functions.

Advantages:

unlock optimization opportunities
save time spent choosing between or, xor, or add to implement a conceptually disjoint or

Drawbacks:

bigger intrinsic surface area
niche optimization route

ACP

scottmcm · April 23, 2024, 11:56pm

You should make an API change proposal for how it would be exposed to stable, like how unchecked_mul and friends are going to be stable in 1.79.

intrinsics are internal and can change whenever, so are relatively uninteresting. (Also, or disjoint is so new -- it's not in https://releases.llvm.org/17.0.1/docs/LangRef.html#or-instruction -- that we might not be able to use it everywhere yet.)

tczajka · April 24, 2024, 11:37am

There is already a way to specify any assumption:

if !condition {
    unsafe { core::hint::unreachable_unchecked() };
}

The assert-unchecked crate wraps it into a macro assert_unchecked!.

So you can already write this as:

if a & b == 0 {
    a | b
} else {
   unsafe { unreachable_unchecked() }
}

Similarly, you can already write:

unsafe { a.checked_mul(b).unwrap_unchecked() }

so it's strange to me that additional functions like unchecked_mul are being added to u32 for this specific operation.

The rationale for adding unchecked_mul is that the above doesn't optimize equally well, but wouldn't a better solution be to improve the optimizer rather than add redundant functions?

josh · April 24, 2024, 2:08pm

What are some of the specific circumstances in which this operation 1) comes up, and 2) provides a useful optimization? What kind of code needs it, what does code generation look like without it, and what does code generation look like with it?

Alion · April 24, 2024, 2:22pm

LLVM does not recognize this right now. Ostensibly that's a bug, but...

I think having those functions is just a concession that we don't live in the world with a "sufficiently smart compiler." I don't know much about LLVM internals, but I can observe their work and suggest things that will help me achieve what I want to. And ultimately, it will always be less work for the optimizer to just say exactly what you mean, so it's much harder for it to break or become brittle or anything.

Alion · April 24, 2024, 2:25pm

Mentioned in the ACP.

tczajka · April 24, 2024, 2:29pm

But we have to assume that the compiler is somewhat smart, otherwise we can't have clean APIs.

This could just be an optimization pass that recognizes the pattern.

I'd say that a.checked_mul(b).unwrap_unchecked() is "saying exactly what we mean". That's kind of the whole point of returning an Option -- the caller decides what to do with it next. Otherwise, for every function that returns an Option we would add an "unchecked" version that does unwrap_unchecked, a "strict" version that does unwrap (strict_mul is also on nightly), etc.

Alion · April 24, 2024, 3:31pm

By the same logic, a.checked_add(b).unwrap_or(!0) is saying u32::saturating_add "exactly." In the stabilization PR for unchecked math, scott writes:

Semantially these are already available on stable [...]. So IMHO we might as well just let people write them directly [..].

As you mentioned, it is possible to semantically write this operation on stable as well... except you don't even get any benefits from doing that right now.

tczajka · April 24, 2024, 3:36pm

That's a judgement call of course, but in this case there is an extra bit of logic that would have to be written by the user, namely that the saturating value in case of overflow is !0 (or better u32::MAX). For signed types the logic is even more complicated.

Whereas unwrap() / unwrap_unchecked() is basically saying directly what we want and nothing more: the bad case should panic / is impossible.

scottmcm · April 24, 2024, 4:02pm

The disjoint flag on or is actually quite new to LLVM, so you might be interested in [RFC] Add or disjoint flag - IR & Optimizations - LLVM Discussion Forums

My understanding is that it's useful because of a confluence of two things:

In general, LLVM would rather use the weaker operation (bitor) than the stronger one (add), since checks like "is this bit set?" are easier without carries.
However, some CPU operations exist with addition but not with bitwise ops, like LEA.

*goes to try to make a demo*

Ah, here we go. These two compile differently in x86

#[no_mangle]
pub unsafe fn demo1(p: *const u8, a: usize, b: usize) -> *const u8 {
    p.add((4*a) + b)
}

#[no_mangle]
pub unsafe fn demo2(p: *const u8, a: usize, b: usize) -> *const u8 {
    p.add((4*a) | b)
}

giving https://rust.godbolt.org/z/f93rPh9EP

demo1:
        lea     rax, [rdi + 4*rsi]
        add     rax, rdx
        ret

demo2:
        lea     rax, [4*rsi]
        or      rax, rdx
        add     rax, rdi
        ret

So the better one here depends on knowing about your target's instructions, which is something that we ideally wouldn't make people know. And if you used disjoint_or here they could compile to the same thing.

Thus this method to say "look, it can be implemented with or, add, or xor; whichever is convenient".

Do I have any good examples of where I'd actually use this? TBH no. It's very important for LLVM to preserve knowledge rather than needing to re-prove it every time, but using it for things like (a << 16) | b in surface rust wouldn't be needed because (assuming small enough inputs) LLVM will prove that itself and then keep it around later for the backend to use.

But I do think it would be worth having in nightly to try to find those things, like how we have exact_div.

scottmcm · April 24, 2024, 4:06pm

That's an interesting example as it's a place where LLVM has said they're not going to optimize that to mul n[su]w: Simplify `sadd_overflow`+`assume` to `add nsw` [InstCombine] · Issue #80637 · llvm/llvm-project · GitHub

So I've done it manually in the library for some cases (Make `checked` ops emit *unchecked* LLVM operations where feasible by scottmcm · Pull Request #124114 · rust-lang/rust · GitHub), but checked_mul isn't one of the places where that style of library change is reasonable.

tczajka · April 24, 2024, 4:22pm

As Rust gains popularity, other compilers will (hopefully) appear. So the argument "rustc + LLVM doesn't optimize X" seems like poor rationale for expanding stable API (core::intrinsics is another story) with redundant functionality. Stable API is forever, compilers can change (including changes to LLVM, to rustc, or completely new compilers), you don't want to tie u32 API to the details of LLVM.

scottmcm · April 24, 2024, 4:55pm

Certainly. But the reason that LLVM has it may well apply to other backends as well.

For example, while I don't think this can be specified with cranelift's InstBuilder today, cranelift's BB optimization works on an e-graph that allows multiple representations of the same instruction so that it can pick the best one given the other things around it.

Thus one could imagine lowering a u32::disjoint_bitor to a cranelift egraph that includes bitor, add, and xor as possibilities, so it can then pick the best representations based on what else is happening in the function.

tczajka · April 24, 2024, 5:04pm

Sure, but it can also do that for regular bitor in cases where it has proved the bits are disjoint, perhaps after you write assert_unchecked!(a & b == 0).

But I was commenting more on unchecked_mul than disjoint_bitor. disjoint_bitor seems like a reasonable operation, but I think if it is added, it should return Option<u32>, so that the caller can decide how to handle the error.

So you would still have the same question as with multiplication: you could also add unchecked_disjoint_bitor and strict_disjoint_bitor, or you could have users write a.disjoint_bitor(b).unwrap_unchecked() and a.disjoint_bitor(b).unwrap(). I'm saying the latter is preferable because the former is additional redundant API.

Alion · April 24, 2024, 5:40pm

Can we come up with a use case that doesn't involve just unwrapping the value? I'm a fan of API symmetry, but I don't remember ever needing that operation for anything other than the performance, and I work with bitwise operations a lot. And if I wanted to write safe code, I would just use a normal bitor or addition. It smells like premature generalization to me.

tczajka · April 24, 2024, 5:44pm

If the disjoint or operation exists, it would be weird if it's unusable in safe code. One advantage is that you'd get an error if the bits aren't actually disjoint.

You listed it yourself in OP as an advantage beyond performance optimization:

Alion · April 24, 2024, 5:48pm

You don't really want to spell that as .disjoint_bitor().unwrap() though, unless you really don't care about the performance. I'm not sure if there is such a niche. Maybe there is.

I'm being slightly defensive because I don't want to pepper the LLVM IR with 3 instructions for every one of my or disjoints, in case I end up needing to write .unwrap_unchecked(). You would probably argue that's for LLVM to figure out, but I'm trying to be pragmatic.

josh · April 24, 2024, 6:47pm

This is the kind of thing I wonder: under what circumstances will it be the case but LLVM can't already prove it? Such as by knowing the exact value of a constant or a constant-folded computation, or by knowing the range of a type?

bjorn3 · April 24, 2024, 6:51pm

You can technically use builder.func.dfg.union(option_a, option_b) to create a value which can be computed in one of two ways. Unfortunately the egraph optimization pass documents that it requires no union values to exist before it starts optimizing. I don't know in what way it uses this assumption.

CAD97 · April 24, 2024, 9:20pm

The primitive ops are special here due to their privileged status of being primitive and conditionally checked based on cfg(overflow_checks). It's because of this that the wide range of different kind_op variants are justifiable, e.g. a.strict_add(b) gives a better panic message than a.checked_add(b).unwrap(), equivalent to the one generated for a + b with -Coverflow-checks=true.

Primitives are also a special case in that they use unchecked_op naming and not op_unchecked like essentially the rest of std. They're different and proud.

For an option-returning fn f, imo the variants are justified if self is legitimately a primitive where the variants have potential impact, or roughly:

checked — if f has a 90% default behavior for handling the None case that would just be unhelpful clutter at call sites if it returned Option. More commonly just an effective synonym, e.g. index/get.
lossy — if f has a canonical default behavior for handling the None case but calling out that this default fallback behavior is being applied is beneficial.
unchecked — if f checking the None case requires asymptotic overhead over just assuming^[1] or checking is intertwined in such a way that the optimizer fails to optimize out the checking given unwrap_unchecked AND this is profiled as performance relevant.
strict — rarely, if f is only conditionally checked (e.g. by debug_assert!) and producing the specific panic on check failure instead of unwrap is beneficial somehow.
wrapping/saturating/overflowing — basically never for non-primitives.
- Although, a "partial" version when partial results are available (e.g. -> (T, Result<()>) and not -> Result<T>) is often beneficial, as partial results being available correlates with partial results being partially usable.

It's ultimately a similar API design question to deciding when API should panic versus return Result. There's no universally applicable answer to either, and it's always a subjective judgement call with tradeoffs either way. Returning reified effects is and should be the default choice, but specific scenarios will justify special cases.

Generally I agree — if some unsafe fn f_unchecked exists, then fn f should exist and do the check for which the prior is unchecked. But as with everything in API design, there are exceptions where the checked version is impossible or just not particularly useful.

The safe version of unchecked_bitmerge (what the ACP names the publicly visible version of the disjoint_or intrinsic) is just bitor.

My crate ptr-union (which niches pointers together via tagging in the low alignment bytes) is perhaps a perfect use case for bitmerge, but it doesn't ever want to use the checked version, because the condition it needs to hold isn't ptr.addr() & TAG == 0 but instead ptr.addr() & TAG_MASK == 0.

I suppose there's also the assert_unsafe_precondition! check to consider, which means if you optimize with cfg!(ub_checks) set, much detectible UB won't remove a control flow arm completely but instead fold to ud2.

Although it is an abuse of notation, for this comparison, consider sized stack moves as "O(0)", which justifies e.g. NonNull::new_unchecked, since the nonnull check is O(1), but if elided, construction is O(0). ↩︎

Topic		Replies	Views
Expose LLVM integer intrinsics for arbitrarily-large integers compiler	16	1553	August 31, 2023
Taming Undefined Behavior in LLVM Unsafe Code Guidelines	6	2151	March 25, 2019
`unordered` as a solution to "Bit-wise reasoning for atomic accesses" Unsafe Code Guidelines	23	3336	October 15, 2019
Optimizing branched assignings based on possible states compiler	5	683	April 26, 2021
Could we have `fetch_not` for `AtomicBool`s? libs	4	985	September 22, 2022

Expose LLVM's "or disjoint" instruction

Related Topics