RFC: Extending Beyond Base 36 for Free or Cheaper

This proposal extends dec and hex schemes upto radix 64. The common case, decimal, retains its cost of 1 sub and 2 tests. In bigger radices numeric digits save one test and letters save even more. Without affecting those costs, it can handle 28 more digits. As a bonus, I don’t need mut. (This is not to be confused with internet base64 encoding, RFC 4648.)

Beyond 36 (unlike RFC 4648 and mathematicians) most software chose to put lowercase first. Beyond 62, various libraries let the caller supply 2 more “digits” of their choice. Some give at least a default: Bash, to not collide with arithmetic, chose unusual digits @, _. The others took only these last two digits from that RFC. Base2n went with the Url variant -, _. Basencode and convertBase have the normal +, /. I follow that.

N.b. For easy testability as a normal function, I have temporarily renamed to selfie. I have fixed the bug of not checking the radix lower bound, but see below. For the same effort it took to add a FIXME, couldn’t we have gotten const then_some? Restricting such a simple function would never cause a regret.

const fn to_digit(selfie: char, radix: u32) -> Option<u32> {
    let digit =
        if radix <= 36 {
            assert!(radix >= 2, "to_digit: radix is too low (minimum 2)");
            match selfie {
                // If not a digit, a number greater than radix will be created.
                ..='9' => (selfie as u32).wrapping_sub('0' as u32),
                // Force the 6th bit to be set to ensure ascii is lower case.
                _ => (selfie as u32 | 0b10_0000).wrapping_sub('a' as u32).saturating_add(10)
            }
        } else {
            assert!(radix <= 64, "to_digit: radix is too high (maximum 64)");
            match selfie {
                '+' => 62, // as in RFC 2045 & 4648
                '/' => 63,
                ..='9' => (selfie as u32).wrapping_sub('0' as u32),
                // Test these in Ascii order.
                ..='Z' => (selfie as u32).wrapping_sub('A' as u32).saturating_add(36),
                ..='z' => (selfie as u32).wrapping_sub('a' as u32).saturating_add(10),
                _ => return None
            }
        };
    // FIXME: once then_some is const fn, use it here
    if digit < radix { Some(digit) } else { None }
}

The major caller, from_str_radix, checks the radix bounds. So I propose this more efficient panic-free alternative. If the almost doubling of code is an issue, the identical matches could be factored out. Btw. they have better error diagnostics when not in const context. I would be cool to make those accessible above.

const fn to_digit_unchecked(selfie: char, radix: u32) -> Option<u32> {
    let digit =
        if radix <= 36 {
            match selfie {
                // If not a digit, a number greater than radix will be created.
                ..='9' => (selfie as u32).wrapping_sub('0' as u32),
                // Force the 6th bit to be set to ensure ascii is lower case.
                _ => (selfie as u32 | 0b10_0000).wrapping_sub('a' as u32).saturating_add(10)
            }
        } else {
            match selfie {
                '+' => 62, // as in RFC 2045 & 4648
                '/' => 63,
                ..='9' => (selfie as u32).wrapping_sub('0' as u32),
                // Test these in Ascii order.
                ..='Z' => (selfie as u32).wrapping_sub('A' as u32).saturating_add(36),
                ..='z' => (selfie as u32).wrapping_sub('a' as u32).saturating_add(10),
                _ => return None
            }
        };
    // FIXME: once then_some is const fn, use it here
    if digit < radix { Some(digit) } else { None }
}

fn test() {
    for c in ' '..='~' {
        print!("'{c}'");
        for r in [2, 8, 10, 16, 36, 62, 64] {
            // Directly {:8?} is inconsistent on Option and nesting w/o to_string ignores width.
            print!("  {r}: {:8} {:8}",
                format_args!("{:?}", to_digit(c, r)).to_string(),
                format_args!("{:?}", to_digit_unchecked(c, r)).to_string());
        }
        println!("");
    }
}

Only doc updates and for from_str_radix, also minor panic string and 36 → 64 changes are needed.

What are the use cases for this? Why can't this live in a crate?

Rust generally does not adapt features without concrete use cases. A feature needs to carry it's weight. And I'm arguing this is incredibly niche. Even in languages which do support unusual bases I have never seen them actually used.

10 Likes

I’ve had a use case at work, which could of course have been served by a crate (in which case an internet-base64 could have been a compromise too.)

I’m not happy with Rust’s crates.io mentality! Just about every useful crate is either itself a V0.x, or at least it has many such dependencies. In this case for something that can be done in a few lines of code, and at no extra cost, the equation doesn’t balance.

A (from computing perspective) random radix like 36 was ok to implement, even though it falls under your “nobody needs it”. Then surely rounding it up to 64, while making the function partially cheaper, is a win!

Proposing incorporation of a syntax change into an established language generally encounters a high bar with regard to its prospects of acceptance. One option here is to present some compelling use cases. However, if such use cases are not available as examples, perhaps this is an opportunity for you to develop a useful crate, and see it through to version 1.0, without its relying on dependencies that have not been properly completed.

4 Likes

Conventions beyond base 36 don't seem to be standardized. That indicates this doesn't belong in std. Bash is the only widely used example here, the other three libraries don't seem popular.

Creating a 1.0 third party library is a much lower bar than inclusion in std, so why not start with that?

6 Likes

While not at all prepared to weigh in on whether or not this proposal should be adopted, I will note instead that new tools of some sort for working with a variety of number systems would be quite interesting.

Creating a library, shepherded to completion of version 1.0, does have the advantage that the developer would have all the power over what to include. If this proposal were instead to be adopted as a syntax change to the language, then judging by what folks have been saying here, it seems likely that it would be a highly pared down version of what is being proposed.

In addition to mainstream number systems, a library could include some interesting niche number systems, which would be unlikely to happen as a language syntax change. Would anyone care to guess what number this might represent, either as a literal, if it were to be adopted as a syntax feature, or as a stringy input and output representation available in a library?

0z10010000

Answer:

Summary

42 in Zeckendorf positional notation

Up until recently the answer to this question would be that the language was insufficiently powerful to express these functions; you could only use non-blessed bases as "literals" outside performance critical contexts. But now they are/can/will be const, so it's only really if baseN became just overwhelmingly popular that it should really be considered for std.

As this seems to be going nowhere, I have further optimised it (by eliminating the branch above 36 and going branchless) and submitted Optimise and bug-fix `to_digit` · Issue #132428 · rust-lang/rust · GitHub.

1 Like

That's because you haven't established a use case or why this needs to go in the standard library directly. There are a number of issues that you're simply ignoring. They have been brought up repeatedly.

4 Likes

I’m just coping constructively with the situation. Whereas you seem to be implying that I’m insisting against better advice that this be implemented. Did you even read what I linked?

1 Like