Worth optimizing ilog10?

matt1985 · February 9, 2023, 3:08am

Unchecked indexing can be done in const contexts on stable with raw pointers: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=c8542ca9228fda1d9909aaa8ef3e0938

let ttg = unsafe { *TEN_THRESHOLDS.as_ptr().add(guess as usize) };

(only this line changes)

dave-andersen · February 11, 2023, 1:40am

Following up, a 64 bit version. It was necessary to slightly increase the precision of the Warren log2(10) approximation, going from *9/32 to *19/64; the effect on performance is minor. On x64 this turns from one lea instruction to two. The size of the lookup table increases from 10 32 bit elements (44 bytes; one cache line) to 19 64 bit elements (152 bytes; three cache lines), which is a bit more serious. I could see arguments either way for either keeping the existing u64 version (one test + div by a constant, which will compile to an imul and some cleanup) or using the 64 bit warren version, depending on how hot the use of the function is. Note that even just adopting the u32 version will still speed the 64 bit codepath, as the existing ilog10 for u64 divides by a large number and then invokes the u32 code.

Since it's getting a little long, I've put the code for all of these versions on github: GitHub - dave-andersen/ilog: playing with integer log implementations in rust and c

Here's the 64 bit version and a test function:

const U64_THRESHOLDS: [u64; 19] = [
    9,
    99,
    999,
    9999,
    99999,
    999999,
    9999999,
    99999999,
    999999999,
    9999999999,
    99999999999,
    999999999999,
    9999999999999,
    99999999999999,
    999999999999999,
    9999999999999999,
    99999999999999999,
    999999999999999999,
    9999999999999999999,
];

pub fn ilog10_u64_mul(x: u64) -> u32 {
    // Use slightly more accurate approximation of log2(10) for u64;
    // this takes two lea instructions on x64 instead of just 1 but not bad.
    let guess: u32 = x.ilog2().wrapping_mul(19) >> 6;
    let ttg = unsafe { *U64_THRESHOLDS.get_unchecked(guess as usize) };
    guess + (x > ttg) as u32
}

// the warren mapping follows a slightly unintuitive invariant:
// The warren map value must be correctable to the real log10 value
// with the addition of at most 1.
fn test_warren_64bit() {
    let mut test_values: Vec<u64> = (0..62).map(|x| 1u64 << x).collect();
    for i in 2..64 {
        test_values.push(((1u128 << i) - 1) as u64);
    }
    test_values.push(u64::MAX);

    for val in test_values {
        let log2val = val.ilog2();
        let real_log10val = val.ilog10();
        // This is unfortunate. The cheap warren map doesn't work.
        // We have to mul by 19, which turns into
        // x << 4 + x << 1 + x
        // which is a little more expensive.
        // on x64 it's .. two lea's. *grin* not bad at all.
        let warren_map = log2val.wrapping_mul(19) >> 6;
        assert!(warren_map == real_log10val || warren_map == real_log10val - 1);
    }
}

fn test_ilog64() {
    println!("Testing warren mapping function");
    test_warren_64bit();
    println!("Testing log of u32s to sanity check");
    let start = std::time::Instant::now();
    (1..=u32::MAX)
        .into_par_iter()
        .map(|x| x as u64)
        .for_each(|x| assert_eq!(ilog10_u64_mul(x), x.ilog10()));
    let elapsed = start.elapsed();
    println!(
        "passed exhaustive u32 test in {:.2} seconds",
        elapsed.as_secs_f64()
    );
    println!("Testing boundary values");
    assert_eq!(ilog10_u64_mul(1u64 << 62), (1u64 << 62).ilog10());
    assert_eq!(ilog10_u64_mul(u64::MAX), u64::MAX.ilog10());
    // Now test the 64 bit version using random 64 bit values
    println!("Testing random u64s");
    let start = std::time::Instant::now();
    (1..128).into_par_iter().for_each(|_| {
        let mut rng = rand::thread_rng();
        for _ in 0..10000000 {
            let x = rng.gen::<u64>();
            assert_eq!(ilog10_u64_mul(x), x.ilog10());
        }
    });
    let elapsed = start.elapsed();
    println!(
        "passed random u64 test in {:.2} seconds",
        elapsed.as_secs_f64()
    );
}

system · May 12, 2023, 1:40am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Array indexes strong typing	6	1220	March 20, 2021
Possible optimization for boundary check when accessing without side effects in-between? compiler	1	668	August 6, 2020
Missed optimization opportunity for bounds checks	14	1085	April 5, 2022
Suggestion for a low-effort way to take advantage of SIMD and other architecture specific tricks LLVM knows about internals	19	3029	March 25, 2019
Faster PartialEq for small arrays of non 2^n length compiler	25	724	September 14, 2024

Worth optimizing ilog10?

Related topics