LLVM's add, sub, mul, div, and rem intrinsics can operate on integers of any length. Using these intrinsics instead of equivalent Rust code can yield performance benefits in excess of 20%*. These intrinsics are currently only available to use on integers of length 8, 16, 32, 64, and 128, but it should be possible to use the same intrinsics on integers of arbitrary size.
For an example of the convoluted code currently required to use some of these basic LLVM intrinsics, see the ethnum-rs crate. Not only is the code needlessly convoluted, the resulting code performs worse than the non-intrinsic alternative unless one uses an extra compiler option. This is unacceptable.
Options:
Add new functions to the Array types which use LLVM intrinsics to treat the entire array as a single integer.
Expose all LLVM intrinsics using a method similar to the asm! macro.
The use of these intrinsics is very niche; I don't believe option 1 makes much sense. Option 2 seems far more robust, and may even be helpful within the compiler itself. To be honest, I don't know how Rust got an asm! macro before it got an llvm! macro.
*The 20% number comes from my own experimentation. Compare u64::widening_mul with equivalent code that does not use any u128 representations. The Karatsuba trick doesn't help.
LLVM is an implementation detail, not intrinsically part of the language itself. In that particular example, originally asm! was very close to LLVM's quirky assembly syntax, but it wasn't stabilized until after it had been redesigned into something independent. For a while, the old form lived on as llvm_asm! to help the transition, never to be stabilized.
How do the multipliers emitted by LLVM compare with the assembly in gmp? The implementation space for arbitrary-precision arithmetic is really broad (just see how much code is in gmp or flint to pick two examples!). Practically, you aren't going to get good performance at all sizes without reimplementing a serious chunk of gmp, which rustc developers would have to do from scratch because gmp is under a less permissive license. And as part of the standard library it would have to be supported on the exact same architectures rust itself is supported on.
Well from a quick check the GMP core x86_64 add/sub implementations are, afaict, still basically just adc/sbb loops with fancy load reordering and loop branching, depending on the exact core. You can't really beat a single ALU op, so this is hardly surprising: all the fanciness is to avoid pipeline stalls.
Multiplication, and especially division, can get fancier at the algorithmic level, but that wouldn't affect the intrinsics you'd want, which is all this proposal is. In fact being able to use a fancy language with those intrinsics would make developing the fancy algorithms much nicer!
Avoiding pipeline stalls and loop trickery is llvm's problem, not Rustc's, and they do crazy good at this sort of thing. I doubt anyone expects it to match the hypertuned assembly loops of GMP (but those are machine generated too, do who knows?), let alone beat them, but most likely they would beat anything you or I could pull off in inline assembly.
A modern x86/x86_64 implementation will use the ADX extension (i.e. adcx, adox) instead of adc, and if you're using ADX in the presence of multiplication you'll want to use mulx to avoid clobbering flags for ADX. This will allow you some ILP when working with bignums.
All of that needs to be guarded with runtime CPU feature detection (ADX + BMI2 for MULX).
TIL that the --emit=asm feature is not a panacea and I really should disassemble the output x86/x86_64. I also learned that I have a lot to learn about deep assembly optimizations.
That makes sense. Those that need more power can just use the asm! macro, circumventing the LLVM implementation detail.
Yes, I am trying to write just such a library, but every algorithm for large numbers relies on high-performance algorithms for smaller numbers, and I was hoping that I could use LLVM's expertise in the matter. I'm clearly not doing a great job yet, but I found that increasing the bit-width of my limb increased performance across the board. I tried three algorithms:
u64::widening_mul
A custom replacement of u64::widening_mul
The same replacement, but using a u128 limb
My u128 algorithm beat u64::widening_mul, but u64::widening_mul beat my u64 algorithm by 20%. You can probably see why that caused me some frustration. Even if u128s are the right limb, I thought I was leaving 20% of my performance on the table just because I couldn't use as u256! Now I understand that I wasn't doing enough optimization, and that there are optimization techniques I simply didn't know about. Back to the drawing board!
So far they don't explicitly do so, but the next step of the RFC is to write some assembly checks. At that point they'll know whether Rustc is optimizing their relatively-high-level code well enough. (I would hazard a guess that Rustc is doing a fine job of it.)
It looks like LLVM doesn't currently generate adox or adcx:
#40891
where topperc commented Apr 20, 2019
The X86 backend isn't currently set up to model the C flag and O flag separately. We model all of the flags as one register. Because of this we can't interleave the flag dependencies. We would need to do something about that before it makes sense to implement _addcarryx_u64 as anything other than plain adc.
I don't know if anything has changed since, but at least couldn't immediately find anything more recent.
Note that while that's theoretically true, we saw when enabling u128 that it often wasn't practically true. They were commonly broken outside of mainline targets, when it came to codegen.
The embarassingly-parallel operations (like bitwise and or xor) are probably fine -- and we could lower in rustc if absolutely needed -- but I'm somewhat skeptical about directly using the more complicated ones, especially difficult things like division.
The bigint_helper_methods seem like the way to go here -- and I'd love to see more in that vein, like funnel shifts.
Addition and subtraction are not the bottlenecks in any algorithm though. If they appear to be bottlenecks, it's really the loads and stores that are taking time. (With all due respect, "just" is doing a lot of heavy lifting in what you wrote.)
I think multiprecison arithmetic, like BLAS, is best left to libraries outside the language. I don't think LLVM is "crazy good" at it either. icc is sometimes better but this is one place where handwritten code tuned for the application at hand is a win.
I feel like you've missed the essential point though: that the language is only adding the intrinsic for a single adc/sbb. So the "just" is only working on the "adc/sbb loop" part, and I addressed the "fancy load reordering" further down, essentially that that's already llvm's problem, not Rustc's.
If this were talking about adding a standard library bigint class, now that's a huge amount of work to get a really good quality implementation. I don't see anything controversial about bigint_helper_methods.
I am all for intrinsics for add with carry, subtract with borrow, integer FMA, etc. and would use them all over the place. But the OP was asking about adding 256-bit multipliers to the language. That seems, at least to me, like it belongs in a library.
Most C compilers already have the intrinsics you need for multiprecison arithmetic and in fact gmp has a backend using them (the generic mpn code). They doesn't compile to code as efficient as handwritten assembly. This is just something compilers aren't very good at.
I'm interested in being able to create an constant-width integer with any number of bits, similar to what LLVM has intrinsically. I thought that access to the LLVM intrinsics would be my best option, but apparently multiple-precision integers aren't well-supported OR well-optimized in LLVM. I think my best bet is to use inline assembly for the most common targets, with a pure-Rust fallback. It would be really nice if Rust's bigint helper methods were well-optimized so I didn't have to write assembly, but I'm also happy to get my hands dirty if I need to.
In other words, I'm fine with closing this feature request.
Ah, sorry! I missed the context of the OP referring to actually arbitrary integer intrinsics, not just the various integer zoo types exposed by Rust, my mistake.
If this is compile time arbitrary integer length, though, you might be able to do a decent amount better (guessing out my butt ~1-5%) than a GMP type heap allocated number library, but it's still quite weird and not really something I'd expect LLVM to do a great job at - though mostly because it's just such an untested edge case. I'm unsurprised that they are apparently even buggy right now!
As far as I know in the middle end of LLVM the arbitrary-precision integers work fine -- when its optimizations encounter them, it doesn't care how big they are, and things like constant propagation works great on them.
It's only if they don't have simple lowering (like lowering bitor i192 to multiple bitor i64) that they seem to not work great in codegen.