Worth optimizing ilog10?

Would it be worth submitting a patch to speed up ilog10? Not sure how performance-critical it's considered. I implemented a slightly tweaked version of Steve Cannon's design that isn't quite as biased towards being fast for large numbers. It's about 1.5-2x faster compared to the existing stdlib implementation for u32, the only one I've implemented so far.

#[inline]
const fn ilogmap(val_lz: u32) -> u32 {
    const LZ_GUESSMASK: u32 = 0b01001001000100100100010010010000;
    let guess = (LZ_GUESSMASK << val_lz).count_ones();
    if guess > LZ_GUESSMASK.count_ones() {
        // SAFETY: shifting never increases the count of ones
        unsafe {
            std::hint::unreachable_unchecked()
        }
    }
    guess
}

// Requires 0 < val <= u32::MAX
// This guarantee is ensured by the caller, checked_ilog10
const fn ilog10(val: u32) -> u32 {
    let guess = ilogmap(val.leading_zeros());
    const TEN_THRESHOLDS: [u32; 10] = [
        9, 99, 999, 9999, 99999, 
        999999, 9999999, 99999999, 999_999_999, u32::MAX,
    ];
    let ttg = TEN_THRESHOLDS[guess as usize];
    guess + (val > ttg) as u32
}
fn benchmark_ilog() {
    const LOOPS: usize = 1;
    const UPTO: u32 = u32::MAX;
    let now = std::time::Instant::now();
    for _ in 0..LOOPS {
        for i in 1..=UPTO{
            std::hint::black_box(i.ilog10());
        }
    }
    let elapsed_real = now.elapsed();
    let now = std::time::Instant::now();
    for _ in 0..LOOPS {
        for i in 1..=UPTO {
            std::hint::black_box(ilog10(i));
        }
    }
    let elapsed_mine = now.elapsed();
    // convert to microseconds
    let elapsed_mine = elapsed_mine.as_micros();
    let elapsed_real = elapsed_real.as_micros();
    println!("new: {elapsed_mine}  real: {elapsed_real}");
}


Compiling with RUSTFLAGS="-C target-cpu=native" cargo build --release

Benchmark times in microseconds for calculating ilog10 of 1 .. u32::MAX:

Macbook pro m1: new:2255391 real: 4799908

AMD EPYC 7543: new: 2518772 real: 6827481

Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz: new: 2531450 real: 6062788

Edited to include @sahnehaeubchen 's optimization to eliminate the panic possibility in the ilogmap function ("new", older results now indicated as "w/out unsafe").

Edited 2 to include further optimization of the second check based on Hank Warren's solution (using 999 comparisons instead of >= 1000 to avoid a sentinel value).

If it seems worthwhile I can generalize to u64. The existing implementation for u16 and u8 is probably faster than this one though I can certainly test.

(I should explain the algorithm - LZ_GUESSMASK is a bitmap of the power-of-two locations at which the log10 value increases -- 16, 128, 1024, etc. So the first lookup just counts the number of bits in the mask, which is equal to the number of times the log10 value increases, producing the approximate log10 value. Because there's a gap between the log2 approximation and the true log10 values for things between 10^x and 2^y, the second lookup table cleans them up by possibly adding 1 to the log10 estimate.)

7 Likes

I think it's absolutely worth optimizing standard library functions like this; people will call ilog10 in performance-sensitive code.

14 Likes

Does that mean this version, while faster in aggregate for all numbers, is slower than the status quo for large numbers?

I've no idea for what distribution of inputs log functions are typically invoked, but I can certainly imagine there's a bias toward larger inputs.

No, it's still faster for large numbers, it's just a little slower than Steve Cannon's original table lookup version that I based the design on: c++ - performance of log10 function returning an int - Stack Overflow

His version has a u8 lookup table to implement the first "guess" mapping the count leading zeros to a power of ten estimate. It's really fast if you benchmark it sequentially because it gets really good locality in the lookup table. But the performance depends on ordering. The mask version is faster if you don't have cache locality for the lookup table.

1 Like

Mostly off-topic, but I was just amazed how fast exhaustively checking the whole range of u32 works these days. On my (still kinda new) laptop, a simple

use rayon::prelude::*;

fn main() {
    (1..=u32::MAX).into_par_iter().for_each(|x| assert_eq!(ilog10(x), x.ilog10()));
}

/* code from OP here… */

check takes about 1.5 seconds.

10 Likes

The current compiler is not smart enough to see that the bitshift cannot increase the count of ones. I have adapted your code so it compiles to a single branch version without panics. I guess the performance impact is barely measurable in tight loops because the branch predictor should have an easy time with the original code as well. Biggest impact is probably due to freed cache space since the code is a bit smaller.

2 Likes

Cool! Thank you. I was wondering about how to get rid of that check.

Note that in the context where the internal functions are called, the value has already been filtered to be > 0:

(checked_ilog10)

The change to the initial map function makes a substantial difference, very nice:

ARM m1:

3532556 -> 2984163 microseconds

EPYC: 3469384 -> 2894589

2 Likes

This is the big question to me.

If you assume that the inputs are evenly spread over u32, the basic linear comparison tree works great, since the most common answer is returned more than half the time.

But if you think of uses for ilog10 like "how many decimal digits do I need?", then smaller numbers could easily be more common.

I have absolutely no idea how to solve this question :slightly_frowning_face:

And if you wish to make this more obvious, you could consider having the internal things take NonZeroUN instead of uN.

Then u32::checked_ilog10 can be Some(NonZeroU32::new(self)?.ilog10()) since those methods are on the non-zero types now too.

If you are comparing against the standard library, I would suggest not using -Ctarget-cpu since the standard library is not built with that.

3 Likes

What are the right flags to use to compare with the std library as installed by rustup?

The results are quite different by platform. On epyc, the new version is 3.5x slower than the stdlib version if compiled without any flags other than --release. On mac m1, it's still 2x faster. I presume this relates to whether or not the default flags are giving access to lzcnt/clz and popcount (provided in SSE4 and BMI1).

It's certainly the case that this version is optimized for platforms that have fast implementations of count_ones() and leading_zeros().

on x64, with -C target-feature=+popcnt,+lzcnt

the resulting code is fast. Without it, it's slower than the original. Notably, popcnt and lzcnt have been available since Haswell in 2013, so should be fairly widely supported at this point.

Not sure, but to get a fairer comparison, you can copy the stdlib implementation into your benchmark code.

2 Likes

Mac M1, only --release: new: 2299237 real: 4821680 (2x faster)

EPYC, +popcnt, +lzcnt: new: 2310963 real: 5651960 (2.4x faster)

EPYC, only --release: new: 28891184 real: 5670912 (5x slower)

Skylake Xeon, +popcnt, +lzcnt: new: 3681202 real: 6301415 (1.7x faster)

Skylake Xeon, only --release: new: 31592585 real: 6302126 (5x slower)

Definitely needs to be gated with a config flag for popcnt and lzcnt.

3 Likes

Since it needs to special-case 0 anyway, is lzcnt important? I'd have guessed that bsr would suffice.

Here's an alternative log10 approximation that uses an integer multiply instead of popcnt.

src

// Returns the lower bound log10 based only on the highest set bit:
// 1..=15     => 0
// 16..=127   => 1,
// 128..=1023 => 2, etc.
// SAFETY: x must be nonzero
#[inline(always)]
const unsafe fn guess_log10(x: u32) -> u32 {
    if x == 0 {
        std::hint::unreachable_unchecked()
    }
    let log2 = x.ilog2();
    // 77/256 ~= log(2)/log(10), with enough accuracy for u32
    log2.wrapping_mul(77) >> 8
}

// SAFETY: x must be nonzero
pub const unsafe fn ilog10(val: u32) -> u32 {
    let guess = guess_log10(val);
    if guess >= 10 {
        std::hint::unreachable_unchecked()
    }
    const TEN_THRESHOLDS: [u32; 10] = [
        9, 99, 999, 9999, 99999, 
        999999, 9999999, 99999999, 999_999_999, u32::MAX,
    ];
    let ttg = TEN_THRESHOLDS[guess as usize];
    guess + (val > ttg) as u32
}

x86 asm

example::ilog10:
        bsr     eax, edi
        imul    eax, eax, 77
        shr     eax, 8
        lea     rcx, [rip + .Lanon.86a2cbdc947e1bc57e9e23fca962766a.0]
        cmp     dword ptr [rcx + 4*rax], edi
        adc     eax, 0
        ret

However, in the benchmarks I ran, std's current implementation was still much faster than this. Since that implementation seems much more involved, I haven't looked into it any further.

edit: Some tuning led to an expression that avoids the imul as well. Compared to std's impl, this version seems to have better throughput and equivalent latency, although I haven't done very thorough benchmarking.

src v2

// equivalent to x.checked_ilog10().unwrap_or(0)
pub const fn ilog10(x: u32) -> u32 {
    // set least significant 3 bits so numbers 0 to 6 all get the same treatment 7
    // changes nothing if x >= 7
    let mut log2 = ((x|7) >> 1).ilog2();
    debug_assert!(log2 < 31);
    // guess close enough for all u32
    let guess = log2.wrapping_mul(5) >> 4;
    debug_assert!(guess < 10);
    if guess >= 10 {
        unsafe {
            std::hint::unreachable_unchecked()
        }
    }
    const TEN_THRESHOLDS: [u32; 10] = [
        9, 99, 999, 9999, 99999, 
        999999, 9999999, 99999999, 999_999_999, u32::MAX,
    ];
    let ttg = TEN_THRESHOLDS[guess as usize];
    guess + (x > ttg) as u32
}

x86 asm v2

example::ilog10:
        mov     eax, edi
        shr     eax
        or      eax, 3
        bsr     eax, eax
        lea     eax, [rax + 4*rax]
        shr     eax, 4
        lea     rcx, [rip + .Lanon.deee93eb7c3754e35991badc71729755.0]
        cmp     dword ptr [rcx + 4*rax], edi
        adc     eax, 0
        ret

With no assumption on the uses, a reasonable heuristic could be that all outputs are equally likely, because that's the situation where the function extracts the most information.

If you know that your inputs are uniformly distributed or have a specific likely range, it might be reasonable to write

if x >= 1_000_000_000 {
    // covers over 75% of the cases
    need_all_10_digits()
} else {
    do_the_general_thing(x.ilog10())
}

but if you know your inputs are mostly small but the library function is optimized for uniform inputs, you can't really do anything about that without substituting a different implementation.

2 Likes

This is similar to Hank Warren's version:

but is faster on ARM; I wonder if that multiply by 5 is getting optimized better. After looking more at which ARM processors support a fast popcount (mostly the A64 series and up), I think I'm becoming convinced of the merit of avoiding it. More ARM CPUs support clz, which is necessary for all of these optimized versions, but then we'd have something that works more broadly. Your version seems fantastic.

Performance ("mul" is the above version), where the x64's were compiled with +popcount and +lzcnt, times in microseconds for 1..=u32::MAX

Machine Popcount Mul Stdlib
Mac M1 2930456 2718608 4785064
Epyc 2311016 3338139 5649915
Skylake Xeon 4935130 3863517 6301556

Only epyc seems to prefer the popcount version. With the improved generality of the multiplication version, it seems worth going with. For comparison, if compiled with only +lzcnt:

EPYC: new: 6704599 mul: 3306040 real: 5668675

And if compiled with neither:

EPYC: new: 28949662 mul: 5817262 real: 5705284 (note the 28M vs 5M).

1 Like

EPYC's success with popcount makes a lot of sense given that AMD's Zen architectures can manage up to 4 popcnt per cycle with a single cycle latency, whereas typical Intels can do 1 per cycle with 3 cycles of latency: uops.info - popcnt So yes, there is some amount of variance in these. ^^

3 Likes

And interestingly, using your code to avoid the checks, the Warren version is now even faster on ARM; about 17% faster than the above, and now equivalently fast as the popcount version on AMD EPYC. This might be the winner.

pub const fn ilog10_mul(x: u32) -> u32 {
    let guess = x.ilog2().wrapping_mul(9) >> 5;
    debug_assert!(guess < 10);
    if guess >= 10 {
        unsafe { std::hint::unreachable_unchecked() }
    }
    const TEN_THRESHOLDS: [u32; 10] = [
        9,
        99,
        999,
        9999,
        99999,
        999999,
        9999999,
        99999999,
        999_999_999,
        u32::MAX,
    ];
    let ttg = TEN_THRESHOLDS[guess as usize];
    guess + (x > ttg) as u32
}

I think that when I'd previously implemented it, I didn't use the wrapping_mul, so left an extra check there, and didn't use the unreachable_unchecked, both of which caused it to be slower.

It's pretty nice now. And amusingly, it's stock from Hacker's Delight.

(Notably it's still slower than stdlib for x86 platforms that lack lzcnt.)

I don't think this should matter -- * and wrapping_mul are identical in release.

How about this version? https://rust.godbolt.org/z/jhPf7c8GE

pub const fn ilog10_mul_tweaked(x: core::num::NonZeroU32) -> u32 {
    let guess = (x.ilog2() * 9) >> 5;
    const TEN_THRESHOLDS: [u32; 10] = [
        9,
        99,
        999,
        9999,
        99999,
        999999,
        9999999,
        99999999,
        999_999_999,
        u32::MAX,
    ];
    let ttg = unsafe { *TEN_THRESHOLDS.get_unchecked(guess as usize) };
    guess + (x.get() > ttg) as u32
}

Notably, that looks fine on just normal x64, nothing special:

example::ilog10_mul_tweaked:
        bsr     eax, edi
        lea     eax, [rax + 8*rax]
        shr     eax, 5
        lea     rcx, [rip + .L__unnamed_1]
        cmp     dword ptr [rcx + 4*rax], edi
        adc     eax, 0
        ret

Beautiful that LLVM figures out how to do the "add one sometimes" with the carry flag :exploding_head:

(The u32 version will need a pre-check for zero, but that's ok.)

Normally I'd be iffy on a lookup-table version, since those often microbench better than they run in real workloads, but here the current std version has a bunch of big constants anyway, so the cache differences are probably minor.


Also, it's annoying that LLVM can't optimize away that array bounds check itself. It knows the ranges on all the values, but apparently that's not enough :frowning:

1 Like

Drawback: get_unchecked is not available in stable as a const fn yet.

I wrote something earlier that the compiler explorer wasn't getting the optimizations right, but now it is - maybe I had a caching problem with it.

Now I'm wondering if the differences I see are that when I compile my benchmark version, the compiler is figuring out that x is never zero because of the way it's used, so the code is identical, but your version is free of that check no matter what. Ahh, compilers. :slight_smile: It does make me believe that the two versions will perform identically when inlined into ilog10_checked, though, because the same test for non-zeroness will be present.

1 Like

The precompiled std is usually built with the release profile plus:

codegen-units = 1
debug-assertions = true # except Apple
overflow-checks = true # except Apple
debug = 1

It only uses the default target feature levels.

3 Likes