There are two things you could mean by this, with very different implications. One possibility is that you are just saying that LLVM guarantees not to inline a function that is annotated with "this function uses feature X" into a function that doesn't have this annotation. This would still mean that we are on our own for detecting features and we cannot have a feature-dependent block in the middle of a feature-independent function. The other possibility is that LLVM actually does have a "detect runtime presence of feature" intrinsic, tracks feature-dependency at the level of individual machine instructions, and won't hoist a feature-dependent instruction past a detection intrinsic. That would be great if it were true, and would mean that an awful lot of the complexity of this RFC was unnecessary, but it would entail so much complexity inside LLVM that I suspect it's not true.
Some concreteness would probably help here, and since I don't know anything about the guts of LLVM, I am going to express said concreteness with x86(-64) assembly language. The AVX-specific variant of memchr
in GNU libc (sysdeps/x86_64/multiarch/memchr-avx2.S) begins like this (after stripping out a bunch of macro goo that's not relevant):
__memchr_avx2:
/* Check for zero length. */
testq %rdx, %rdx
jz .Lnull
movl %edi, %ecx
/* Broadcast CHAR to YMM0. */
vmovd %esi, %xmm0
vpbroadcastb %xmm0, %ymm0
And the version that assumes only SSE2, which is always available on a 64-bit x86 (sysdeps/x86_64/memchr.S), begins like this:
__memchr_sse2:
movd %esi, %xmm1
mov %edi, %ecx
punpcklbw %xmm1, %xmm1
test %rdx, %rdx
jz .Lreturn_null
punpcklbw %xmm1, %xmm1
(Yes, there are two punpcklbw
instructions in a row. No, that's not a mistake, as far as I can tell.)
Now, imagine that x86 had a test_feature_set
instruction that let you directly branch on whether or not AVX is available, then you could choose between these two implementations using it:
memchr:
test_feature_set #avx2
beq .Lmemchr_sse2 /* if not available */
/* Check for zero length. */
testq %rdx, %rdx
jz .Lnull
movl %edi, %ecx
/* Broadcast CHAR to YMM0. */
vmovd %esi, %xmm0
vpbroadcastb %xmm0, %ymm0
...
.Lmemchr_sse2:
movd %esi, %xmm1
mov %edi, %ecx
punpcklbw %xmm1, %xmm1
test %rdx, %rdx
jz .Lreturn_null
punpcklbw %xmm1, %xmm1
...
Let us further imagine that this is late-stage LLVM intermediate representation for a Rust memchr
compiled using the runtime if cfg!(avx2) { ... } else { ... }
thing, and the compiler is now looking for shared code that it can factor out. With adjustments to the register allocation choices, it could do this:
memchr:
/* Check for zero length. */
testq %rdx, %rdx
jz .Lreturn_null
movl %edi, %ecx
vmovd %esi, %xmm0
test_feature_set #avx2
beq .Lno_avx2 /* AVX not available */
/* Broadcast CHAR to YMM0. */
vpbroadcastb %xmm0, %ymm0
...
.Lno_avx2:
punpcklbw %xmm0, %xmm0
punpcklbw %xmm0, %xmm0
...
But the vpbroadcastb
instruction must not be moved up any further, because it's only known to be available after the test_feature_set
/ beq
pair. And that's the key question: Does LLVM, in some way, know that? My suspicion is that it doesn't â that it would cheerfully move the vpbroadcastb
up and render the code unsafe to run on non-AVX processors.
Incidentally, cpuid
is slow and clobbers an awful lot of integer registers. You notice neither version of memchr
needs a stack frame? That wouldn't be true if it were invoking cpuid
every time it's called, and that's a big point in favor of the ifunc (i.e. "function pointers set up at the beginning of main
") approach to architecture-specific code.