Would it be worth submitting a patch to speed up ilog10? Not sure how performance-critical it's considered. I implemented a slightly tweaked version of Steve Cannon's design that isn't quite as biased towards being fast for large numbers. It's about 1.5-2x faster compared to the existing stdlib implementation for u32, the only one I've implemented so far.
#[inline]
const fn ilogmap(val_lz: u32) -> u32 {
const LZ_GUESSMASK: u32 = 0b01001001000100100100010010010000;
let guess = (LZ_GUESSMASK << val_lz).count_ones();
if guess > LZ_GUESSMASK.count_ones() {
// SAFETY: shifting never increases the count of ones
unsafe {
std::hint::unreachable_unchecked()
}
}
guess
}
// Requires 0 < val <= u32::MAX
// This guarantee is ensured by the caller, checked_ilog10
const fn ilog10(val: u32) -> u32 {
let guess = ilogmap(val.leading_zeros());
const TEN_THRESHOLDS: [u32; 10] = [
9, 99, 999, 9999, 99999,
999999, 9999999, 99999999, 999_999_999, u32::MAX,
];
let ttg = TEN_THRESHOLDS[guess as usize];
guess + (val > ttg) as u32
}
fn benchmark_ilog() {
const LOOPS: usize = 1;
const UPTO: u32 = u32::MAX;
let now = std::time::Instant::now();
for _ in 0..LOOPS {
for i in 1..=UPTO{
std::hint::black_box(i.ilog10());
}
}
let elapsed_real = now.elapsed();
let now = std::time::Instant::now();
for _ in 0..LOOPS {
for i in 1..=UPTO {
std::hint::black_box(ilog10(i));
}
}
let elapsed_mine = now.elapsed();
// convert to microseconds
let elapsed_mine = elapsed_mine.as_micros();
let elapsed_real = elapsed_real.as_micros();
println!("new: {elapsed_mine} real: {elapsed_real}");
}
Compiling with RUSTFLAGS="-C target-cpu=native" cargo build --release
Benchmark times in microseconds for calculating ilog10 of 1 .. u32::MAX:
Macbook pro m1:
new:2255391 real: 4799908
AMD EPYC 7543:
new: 2518772 real: 6827481
Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz:
new: 2531450 real: 6062788
Edited to include @sahnehaeubchen 's optimization to eliminate the panic possibility in the ilogmap function ("new", older results now indicated as "w/out unsafe").
Edited 2 to include further optimization of the second check based on Hank Warren's solution (using 999 comparisons instead of >= 1000 to avoid a sentinel value).
If it seems worthwhile I can generalize to u64. The existing implementation for u16 and u8 is probably faster than this one though I can certainly test.
(I should explain the algorithm - LZ_GUESSMASK is a bitmap of the power-of-two locations at which the log10 value increases -- 16, 128, 1024, etc. So the first lookup just counts the number of bits in the mask, which is equal to the number of times the log10 value increases, producing the approximate log10 value. Because there's a gap between the log2 approximation and the true log10 values for things between 10^x and 2^y, the second lookup table cleans them up by possibly adding 1 to the log10 estimate.)